feat(adapter): trigger fallback on timeout + context exhaustion, wire manifest fallbacks#1472
Draft
nextlevelshit wants to merge 4 commits into
Draft
feat(adapter): trigger fallback on timeout + context exhaustion, wire manifest fallbacks#1472nextlevelshit wants to merge 4 commits into
nextlevelshit wants to merge 4 commits into
Conversation
… manifest fallbacks
The fallback chain previously only triggered on `rate_limit`. That left
the most common failure mode in local-Ollama setups — the model stalls
mid-stream and Wave classifies it as `timeout` — with no automatic
recovery. The pipeline just failed, even when a peer adapter was
configured and would have completed the work.
Two coupled changes:
- `isFallbackTrigger` now returns true for rate_limit, timeout, and
context_exhaustion. All three commonly resolve when retried on a peer
with different limits, model architecture, or context window. Other
classifications (general_error, validation, etc.) are intentionally
excluded — they typically indicate a bug or schema mismatch that will
fail the same way on any provider.
- `cmd/wave/commands/run.go` now passes `m.Runtime.Fallbacks` to the
adapter registry constructor. Previously the manifest field existed
but the runtime always passed nil — the chain was unreachable.
Plus a chain in this repo's wave.yaml that captures the local intent:
opencode-glm: [opencode-qwen, claude]
opencode-qwen: [claude]
Stalled glm hands off to qwen (different architecture, no GLM-Ollama
post-tool-call stall); stalled qwen hands off to claude.
Test changes mirror the new semantics:
- TestFallbackRunner_ContextExhaustionTriggersFallback (renamed +
inverted assertion)
- TestFallbackRunner_TimeoutTriggersFallback (new)
- TestIsFallbackTrigger updated to expect timeout + context_exhaustion
in the trigger set
This was referenced Apr 28, 2026
nextlevelshit
added a commit
that referenced
this pull request
Apr 28, 2026
Two more schema rigidity points caught by the same live ops-pr-respond run on PR #1472: 1. `actionable[]` items also include the originating finding's `item` field (e.g. "function name", "route path", "type symbol") — same provenance trail as `severity/file/line`. Add it to the actionable schema as an optional string. 2. `design_questions[].source_finding_id` was a required string. The triage planner reasonably emits `null` when the design question isn't tied to a specific source finding (composite or architecture-level questions). Allow null in addition to string. Same fix shape as the deferred/rejected loosening from the parent commit. Mirrored to .agents/contracts/.
nextlevelshit
added a commit
that referenced
this pull request
Apr 28, 2026
…chemas The schemas were locking shape with `additionalProperties: false` to catch typos. The cost: every time a prompt evolves to include a sensible context field (severity, file, line, item, source, category), the contract validator rejects the deliverable even though the run was correct in every other way. Live ops-pr-respond run on PR #1472 just hit this twice in 30 minutes: once for `deferred[]/rejected[]` extras (#1509 fix, narrow), once for `actionable[].item` and `design_questions[].source_finding_id: null` (also in #1509). Same pattern lurked in 24 other schemas. Fix: blanket-flip `additionalProperties: false` to `true` across all LLM-output schemas. The `required` arrays still catch typo bugs (`descriptin` vs `description` is a missing-required failure, not an extra-properties failure). Extras are now welcomed as provenance. API-result schemas (issue-update-result, scope-filter-stats, gate results) are also flipped — they have strong `required` lists and extras would be benign forwards-compat surface. Direct commit on main (no PR) per maintainer direction. 48 files touched (24 each in internal/defaults/contracts/ and .agents/contracts/), 36 individual `false → true` flips in each tree (some schemas had multiple nested item shapes). Verified: `go build ./...` green, `go test ./internal/contract/... ./internal/manifest/...` green.
Add fmt.Errorf %w wrapping cases (single and double), errors.Join with both skill+tool errors, and empty-slice nil returns. Use reflect.DeepEqual for exact PreflightMetadata field assertions per remediation.
…omposition Adds unit tests for dbLoggingEmitter.Emit using a fake EventEmitter and a StateStore that records LogEvent calls. Covers: - empty step_progress / stream_activity heartbeats are dropped (no LogEvent) - stream_activity with ToolName composes "<tool> <target>" message - step_progress with TokensUsed > 0 or DurationMs > 0 is persisted - normal Message events are persisted as-is - Event.PipelineID overrides the emitter default runID
nextlevelshit
added a commit
that referenced
this pull request
Apr 28, 2026
Closes the silent-kill side of #1467. The orphan reaper marks "running" rows as failed when their owning process is gone. Liveness is checked in priority order: heartbeat, PID, age fallback. Sub-pipeline child runs (composition: iterate, sub_pipeline, branch, loop) execute inside the parent process's goroutines and never get their own PID, never write their own heartbeats. Their liveness IS the parent's liveness. The age fallback then fired at 5 minutes and silently killed every active fan-out child. Live ops-pr-respond resume on PR #1472 (this session): triage emitted 15 actionable findings, the loop spawned 6 parallel impl-finding children, 5 of them got reaped at exactly 5m47s with status="failed" message="process gone (orphaned)". The sixth completed at 5m29s — it crossed the finish line a few seconds before the reaper fired. Real timeouts (default_timeout_minutes: 30, stall_timeout: 10m) had nothing to do with it. The reaper itself was the problem. apply-fix runs are LLM-driven worktree edits + tests; 6-10 minutes is normal, not zombie. Fix: skip the zombie check entirely for runs with a non-empty ParentRunID. Reap on the parent if anything; never on the child. The parent's heartbeat still covers the whole subtree. Also cleared up: this is exactly the failure mode the user wrote about — "diese timeouts sollen uns nicht die beine brechen, die sollten wir nur selten verwenden". Reaper as currently written was acting like a third hidden timeout on top of the configured ones. Test: TestReconcileZombies_SubPipelineChildSpared seeds an aged child with PID=0/heartbeat=0/parent_run_id=parent and asserts the reaper does NOT touch it.
nextlevelshit
added a commit
that referenced
this pull request
Apr 28, 2026
Live ops-pr-respond resume on PR #1472 (this session) failed at exactly 30m09s with cascading "context canceled" across in-flight apply-fix children. Loop fan-out across 15 findings × max 6 concurrent × ~10 min average per finding = >30 min wall-clock, exceeding the per-step 30m default that gets propagated to adapter calls inside the loop body. Bump to 90 minutes so a 15-finding fan-out finishes inside the budget. Per-step `timeout_minutes` overrides still apply.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two coupled gaps surfaced while running local-Ollama pipelines:
runtime.fallbackswas wired into the type but never reached the runtime.cmd/wave/commands/run.goconstructed the adapter registry withnil, so any chain declared in the manifest was inert — silent dead config.rate_limit. The most common local-Ollama failure mode — the model stalls mid-stream, Wave classifies it astimeout— never invoked the chain. The pipeline just failed even when a configured peer would have succeeded.Concrete trigger
audit-doc-scanonopencode-glm: model emits a tool call, GLM-via-Ollama stops responding (documented bug, see PRs #1404 / #1468 review notes). Stream idle watchdog insideopencode-patchedretries 19+ times, eventually the wave step timeout fires and the pipeline fails. With this PR, the same failure hands off toopencode-qwen(different model architecture, no post-tool-call stall) and ultimatelyclaude.Changes
isFallbackTriggernow returns true forrate_limit,timeout, andcontext_exhaustion. Other classifications (general_error, etc.) stay excluded — they typically indicate a bug that will fail the same way on any provider.cmd/wave/commands/run.gopassesm.Runtime.Fallbackstoadapter.NewAdapterRegistry.wave.yamldeclares the intent:Test plan
go test ./...— full suite greengolangci-lint run ./...— 0 issuesgo run ./cmd/wave validate— cleanTestFallbackRunner_ContextExhaustionTriggersFallback(inverted from previous "DoesNotTrigger" assertion)TestFallbackRunner_TimeoutTriggersFallback(new)TestIsFallbackTriggerupdated for new triggersReviewer notes
The semantic change to
isFallbackTriggeris a deliberate behavior change. Two existing tests had asserted the old "context_exhaustion does not trigger" / "timeout does not trigger" behavior; both are inverted in this PR with comments explaining why.If a project doesn't want this behavior, it can leave
runtime.fallbacksempty (the historical default — same outcome as before). The change only takes effect when fallbacks are explicitly declared.