Skip to content

feat(adapter): trigger fallback on timeout + context exhaustion, wire manifest fallbacks#1472

Draft
nextlevelshit wants to merge 4 commits into
mainfrom
feat/local-adapter-fallback
Draft

feat(adapter): trigger fallback on timeout + context exhaustion, wire manifest fallbacks#1472
nextlevelshit wants to merge 4 commits into
mainfrom
feat/local-adapter-fallback

Conversation

@nextlevelshit
Copy link
Copy Markdown
Collaborator

Summary

Two coupled gaps surfaced while running local-Ollama pipelines:

  1. runtime.fallbacks was wired into the type but never reached the runtime. cmd/wave/commands/run.go constructed the adapter registry with nil, so any chain declared in the manifest was inert — silent dead config.
  2. Fallback only triggered on rate_limit. The most common local-Ollama failure mode — the model stalls mid-stream, Wave classifies it as timeout — never invoked the chain. The pipeline just failed even when a configured peer would have succeeded.

Concrete trigger

audit-doc-scan on opencode-glm: model emits a tool call, GLM-via-Ollama stops responding (documented bug, see PRs #1404 / #1468 review notes). Stream idle watchdog inside opencode-patched retries 19+ times, eventually the wave step timeout fires and the pipeline fails. With this PR, the same failure hands off to opencode-qwen (different model architecture, no post-tool-call stall) and ultimately claude.

Changes

  • isFallbackTrigger now returns true for rate_limit, timeout, and context_exhaustion. Other classifications (general_error, etc.) stay excluded — they typically indicate a bug that will fail the same way on any provider.
  • cmd/wave/commands/run.go passes m.Runtime.Fallbacks to adapter.NewAdapterRegistry.
  • This repo's wave.yaml declares the intent:
    runtime:
      fallbacks:
        opencode-glm: [opencode-qwen, claude]
        opencode-qwen: [claude]

Test plan

  • go test ./... — full suite green
  • golangci-lint run ./... — 0 issues
  • go run ./cmd/wave validate — clean
  • TestFallbackRunner_ContextExhaustionTriggersFallback (inverted from previous "DoesNotTrigger" assertion)
  • TestFallbackRunner_TimeoutTriggersFallback (new)
  • TestIsFallbackTrigger updated for new triggers

Reviewer notes

The semantic change to isFallbackTrigger is a deliberate behavior change. Two existing tests had asserted the old "context_exhaustion does not trigger" / "timeout does not trigger" behavior; both are inverted in this PR with comments explaining why.

If a project doesn't want this behavior, it can leave runtime.fallbacks empty (the historical default — same outcome as before). The change only takes effect when fallbacks are explicitly declared.

… manifest fallbacks

The fallback chain previously only triggered on `rate_limit`. That left
the most common failure mode in local-Ollama setups — the model stalls
mid-stream and Wave classifies it as `timeout` — with no automatic
recovery. The pipeline just failed, even when a peer adapter was
configured and would have completed the work.

Two coupled changes:

- `isFallbackTrigger` now returns true for rate_limit, timeout, and
  context_exhaustion. All three commonly resolve when retried on a peer
  with different limits, model architecture, or context window. Other
  classifications (general_error, validation, etc.) are intentionally
  excluded — they typically indicate a bug or schema mismatch that will
  fail the same way on any provider.
- `cmd/wave/commands/run.go` now passes `m.Runtime.Fallbacks` to the
  adapter registry constructor. Previously the manifest field existed
  but the runtime always passed nil — the chain was unreachable.

Plus a chain in this repo's wave.yaml that captures the local intent:

    opencode-glm: [opencode-qwen, claude]
    opencode-qwen: [claude]

Stalled glm hands off to qwen (different architecture, no GLM-Ollama
post-tool-call stall); stalled qwen hands off to claude.

Test changes mirror the new semantics:
- TestFallbackRunner_ContextExhaustionTriggersFallback (renamed +
  inverted assertion)
- TestFallbackRunner_TimeoutTriggersFallback (new)
- TestIsFallbackTrigger updated to expect timeout + context_exhaustion
  in the trigger set
@nextlevelshit nextlevelshit marked this pull request as draft April 28, 2026 16:39
nextlevelshit added a commit that referenced this pull request Apr 28, 2026
Two more schema rigidity points caught by the same live ops-pr-respond
run on PR #1472:

1. `actionable[]` items also include the originating finding's `item`
   field (e.g. "function name", "route path", "type symbol") — same
   provenance trail as `severity/file/line`. Add it to the actionable
   schema as an optional string.

2. `design_questions[].source_finding_id` was a required string. The
   triage planner reasonably emits `null` when the design question
   isn't tied to a specific source finding (composite or
   architecture-level questions). Allow null in addition to string.

Same fix shape as the deferred/rejected loosening from the parent
commit. Mirrored to .agents/contracts/.
nextlevelshit added a commit that referenced this pull request Apr 28, 2026
…chemas

The schemas were locking shape with `additionalProperties: false` to
catch typos. The cost: every time a prompt evolves to include a
sensible context field (severity, file, line, item, source, category),
the contract validator rejects the deliverable even though the run
was correct in every other way.

Live ops-pr-respond run on PR #1472 just hit this twice in 30 minutes:
once for `deferred[]/rejected[]` extras (#1509 fix, narrow), once for
`actionable[].item` and `design_questions[].source_finding_id: null`
(also in #1509). Same pattern lurked in 24 other schemas.

Fix: blanket-flip `additionalProperties: false` to `true` across all
LLM-output schemas. The `required` arrays still catch typo bugs
(`descriptin` vs `description` is a missing-required failure, not an
extra-properties failure). Extras are now welcomed as provenance.

API-result schemas (issue-update-result, scope-filter-stats, gate
results) are also flipped — they have strong `required` lists and
extras would be benign forwards-compat surface.

Direct commit on main (no PR) per maintainer direction. 48 files
touched (24 each in internal/defaults/contracts/ and .agents/contracts/),
36 individual `false → true` flips in each tree (some schemas had
multiple nested item shapes).

Verified: `go build ./...` green, `go test ./internal/contract/...
./internal/manifest/...` green.
Add fmt.Errorf %w wrapping cases (single and double), errors.Join with
both skill+tool errors, and empty-slice nil returns. Use reflect.DeepEqual
for exact PreflightMetadata field assertions per remediation.
…omposition

Adds unit tests for dbLoggingEmitter.Emit using a fake EventEmitter and a
StateStore that records LogEvent calls. Covers:

- empty step_progress / stream_activity heartbeats are dropped (no LogEvent)
- stream_activity with ToolName composes "<tool> <target>" message
- step_progress with TokensUsed > 0 or DurationMs > 0 is persisted
- normal Message events are persisted as-is
- Event.PipelineID overrides the emitter default runID
nextlevelshit added a commit that referenced this pull request Apr 28, 2026
Closes the silent-kill side of #1467.

The orphan reaper marks "running" rows as failed when their owning
process is gone. Liveness is checked in priority order: heartbeat,
PID, age fallback. Sub-pipeline child runs (composition: iterate,
sub_pipeline, branch, loop) execute inside the parent process's
goroutines and never get their own PID, never write their own
heartbeats. Their liveness IS the parent's liveness. The age
fallback then fired at 5 minutes and silently killed every active
fan-out child.

Live ops-pr-respond resume on PR #1472 (this session): triage emitted
15 actionable findings, the loop spawned 6 parallel impl-finding
children, 5 of them got reaped at exactly 5m47s with status="failed"
message="process gone (orphaned)". The sixth completed at 5m29s — it
crossed the finish line a few seconds before the reaper fired.

Real timeouts (default_timeout_minutes: 30, stall_timeout: 10m) had
nothing to do with it. The reaper itself was the problem. apply-fix
runs are LLM-driven worktree edits + tests; 6-10 minutes is normal,
not zombie.

Fix: skip the zombie check entirely for runs with a non-empty
ParentRunID. Reap on the parent if anything; never on the child. The
parent's heartbeat still covers the whole subtree.

Also cleared up: this is exactly the failure mode the user wrote
about — "diese timeouts sollen uns nicht die beine brechen, die
sollten wir nur selten verwenden". Reaper as currently written was
acting like a third hidden timeout on top of the configured ones.

Test: TestReconcileZombies_SubPipelineChildSpared seeds an aged
child with PID=0/heartbeat=0/parent_run_id=parent and asserts the
reaper does NOT touch it.
nextlevelshit added a commit that referenced this pull request Apr 28, 2026
Live ops-pr-respond resume on PR #1472 (this session) failed at
exactly 30m09s with cascading "context canceled" across in-flight
apply-fix children. Loop fan-out across 15 findings × max 6
concurrent × ~10 min average per finding = >30 min wall-clock,
exceeding the per-step 30m default that gets propagated to
adapter calls inside the loop body.

Bump to 90 minutes so a 15-finding fan-out finishes inside the
budget. Per-step `timeout_minutes` overrides still apply.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant