feat(adapter): trigger fallback on timeout + context exhaustion, wire manifest fallbacks by nextlevelshit · Pull Request #1472 · re-cinq/wave

nextlevelshit · 2026-04-28T11:37:30Z

Summary

Two coupled gaps surfaced while running local-Ollama pipelines:

runtime.fallbacks was wired into the type but never reached the runtime. cmd/wave/commands/run.go constructed the adapter registry with nil, so any chain declared in the manifest was inert — silent dead config.
Fallback only triggered on rate_limit. The most common local-Ollama failure mode — the model stalls mid-stream, Wave classifies it as timeout — never invoked the chain. The pipeline just failed even when a configured peer would have succeeded.

Concrete trigger

audit-doc-scan on opencode-glm: model emits a tool call, GLM-via-Ollama stops responding (documented bug, see PRs #1404 / #1468 review notes). Stream idle watchdog inside opencode-patched retries 19+ times, eventually the wave step timeout fires and the pipeline fails. With this PR, the same failure hands off to opencode-qwen (different model architecture, no post-tool-call stall) and ultimately claude.

Changes

isFallbackTrigger now returns true for rate_limit, timeout, and context_exhaustion. Other classifications (general_error, etc.) stay excluded — they typically indicate a bug that will fail the same way on any provider.
cmd/wave/commands/run.go passes m.Runtime.Fallbacks to adapter.NewAdapterRegistry.

This repo's wave.yaml declares the intent:

runtime:
  fallbacks:
    opencode-glm: [opencode-qwen, claude]
    opencode-qwen: [claude]

Test plan

go test ./... — full suite green
golangci-lint run ./... — 0 issues
go run ./cmd/wave validate — clean
TestFallbackRunner_ContextExhaustionTriggersFallback (inverted from previous "DoesNotTrigger" assertion)
TestFallbackRunner_TimeoutTriggersFallback (new)
TestIsFallbackTrigger updated for new triggers

Reviewer notes

The semantic change to isFallbackTrigger is a deliberate behavior change. Two existing tests had asserted the old "context_exhaustion does not trigger" / "timeout does not trigger" behavior; both are inverted in this PR with comments explaining why.

If a project doesn't want this behavior, it can leave runtime.fallbacks empty (the historical default — same outcome as before). The change only takes effect when fallbacks are explicitly declared.

… manifest fallbacks The fallback chain previously only triggered on `rate_limit`. That left the most common failure mode in local-Ollama setups — the model stalls mid-stream and Wave classifies it as `timeout` — with no automatic recovery. The pipeline just failed, even when a peer adapter was configured and would have completed the work. Two coupled changes: - `isFallbackTrigger` now returns true for rate_limit, timeout, and context_exhaustion. All three commonly resolve when retried on a peer with different limits, model architecture, or context window. Other classifications (general_error, validation, etc.) are intentionally excluded — they typically indicate a bug or schema mismatch that will fail the same way on any provider. - `cmd/wave/commands/run.go` now passes `m.Runtime.Fallbacks` to the adapter registry constructor. Previously the manifest field existed but the runtime always passed nil — the chain was unreachable. Plus a chain in this repo's wave.yaml that captures the local intent: opencode-glm: [opencode-qwen, claude] opencode-qwen: [claude] Stalled glm hands off to qwen (different architecture, no GLM-Ollama post-tool-call stall); stalled qwen hands off to claude. Test changes mirror the new semantics: - TestFallbackRunner_ContextExhaustionTriggersFallback (renamed + inverted assertion) - TestFallbackRunner_TimeoutTriggersFallback (new) - TestIsFallbackTrigger updated to expect timeout + context_exhaustion in the trigger set

Two more schema rigidity points caught by the same live ops-pr-respond run on PR #1472: 1. `actionable[]` items also include the originating finding's `item` field (e.g. "function name", "route path", "type symbol") — same provenance trail as `severity/file/line`. Add it to the actionable schema as an optional string. 2. `design_questions[].source_finding_id` was a required string. The triage planner reasonably emits `null` when the design question isn't tied to a specific source finding (composite or architecture-level questions). Allow null in addition to string. Same fix shape as the deferred/rejected loosening from the parent commit. Mirrored to .agents/contracts/.

…chemas The schemas were locking shape with `additionalProperties: false` to catch typos. The cost: every time a prompt evolves to include a sensible context field (severity, file, line, item, source, category), the contract validator rejects the deliverable even though the run was correct in every other way. Live ops-pr-respond run on PR #1472 just hit this twice in 30 minutes: once for `deferred[]/rejected[]` extras (#1509 fix, narrow), once for `actionable[].item` and `design_questions[].source_finding_id: null` (also in #1509). Same pattern lurked in 24 other schemas. Fix: blanket-flip `additionalProperties: false` to `true` across all LLM-output schemas. The `required` arrays still catch typo bugs (`descriptin` vs `description` is a missing-required failure, not an extra-properties failure). Extras are now welcomed as provenance. API-result schemas (issue-update-result, scope-filter-stats, gate results) are also flipped — they have strong `required` lists and extras would be benign forwards-compat surface. Direct commit on main (no PR) per maintainer direction. 48 files touched (24 each in internal/defaults/contracts/ and .agents/contracts/), 36 individual `false → true` flips in each tree (some schemas had multiple nested item shapes). Verified: `go build ./...` green, `go test ./internal/contract/... ./internal/manifest/...` green.

Add fmt.Errorf %w wrapping cases (single and double), errors.Join with both skill+tool errors, and empty-slice nil returns. Use reflect.DeepEqual for exact PreflightMetadata field assertions per remediation.

…omposition Adds unit tests for dbLoggingEmitter.Emit using a fake EventEmitter and a StateStore that records LogEvent calls. Covers: - empty step_progress / stream_activity heartbeats are dropped (no LogEvent) - stream_activity with ToolName composes "<tool> <target>" message - step_progress with TokensUsed > 0 or DurationMs > 0 is persisted - normal Message events are persisted as-is - Event.PipelineID overrides the emitter default runID

Closes the silent-kill side of #1467. The orphan reaper marks "running" rows as failed when their owning process is gone. Liveness is checked in priority order: heartbeat, PID, age fallback. Sub-pipeline child runs (composition: iterate, sub_pipeline, branch, loop) execute inside the parent process's goroutines and never get their own PID, never write their own heartbeats. Their liveness IS the parent's liveness. The age fallback then fired at 5 minutes and silently killed every active fan-out child. Live ops-pr-respond resume on PR #1472 (this session): triage emitted 15 actionable findings, the loop spawned 6 parallel impl-finding children, 5 of them got reaped at exactly 5m47s with status="failed" message="process gone (orphaned)". The sixth completed at 5m29s — it crossed the finish line a few seconds before the reaper fired. Real timeouts (default_timeout_minutes: 30, stall_timeout: 10m) had nothing to do with it. The reaper itself was the problem. apply-fix runs are LLM-driven worktree edits + tests; 6-10 minutes is normal, not zombie. Fix: skip the zombie check entirely for runs with a non-empty ParentRunID. Reap on the parent if anything; never on the child. The parent's heartbeat still covers the whole subtree. Also cleared up: this is exactly the failure mode the user wrote about — "diese timeouts sollen uns nicht die beine brechen, die sollten wir nur selten verwenden". Reaper as currently written was acting like a third hidden timeout on top of the configured ones. Test: TestReconcileZombies_SubPipelineChildSpared seeds an aged child with PID=0/heartbeat=0/parent_run_id=parent and asserts the reaper does NOT touch it.

…back chain test

Live ops-pr-respond resume on PR #1472 (this session) failed at exactly 30m09s with cascading "context canceled" across in-flight apply-fix children. Loop fan-out across 15 findings × max 6 concurrent × ~10 min average per finding = >30 min wall-clock, exceeding the per-step 30m default that gets propagated to adapter calls inside the loop body. Bump to 90 minutes so a 15-finding fan-out finishes inside the budget. Per-step `timeout_minutes` overrides still apply.

This was referenced Apr 28, 2026

fix(pipelines): ops-pr-respond audit children scan whole repo despite pr-context injection #1453

Closed

Pipeline ships wrong deliverable when command-step output_artifacts feed downstream inject_artifacts #1490

Closed

nextlevelshit marked this pull request as draft April 28, 2026 16:39

This was referenced Apr 28, 2026

fix(pipeline): register command-step output_artifacts for downstream injection #1491

Merged

epic: architecture re-decomposition — physical splits, layer inversions, transport boundaries #1494

Closed

nextlevelshit added 2 commits April 28, 2026 20:28

test(f2): cover extractPreflightMetadata error-chain wrapping

43893c9

Add fmt.Errorf %w wrapping cases (single and double), errors.Join with both skill+tool errors, and empty-slice nil returns. Use reflect.DeepEqual for exact PreflightMetadata field assertions per remediation.

fix(f14): assert exact error and primary failure_reason in empty fall…

6205fe4

…back chain test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(adapter): trigger fallback on timeout + context exhaustion, wire manifest fallbacks#1472

feat(adapter): trigger fallback on timeout + context exhaustion, wire manifest fallbacks#1472
nextlevelshit wants to merge 4 commits into
mainfrom
feat/local-adapter-fallback

nextlevelshit commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nextlevelshit commented Apr 28, 2026

Summary

Concrete trigger

Changes

Test plan

Reviewer notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant