Skip to content

feat(loops): configurable failure policy — fail-fast vs continue-on-error (#1167)#1339

Open
dolho wants to merge 2 commits into
devfrom
feature/1167-loop-failure-policy
Open

feat(loops): configurable failure policy — fail-fast vs continue-on-error (#1167)#1339
dolho wants to merge 2 commits into
devfrom
feature/1167-loop-failure-policy

Conversation

@dolho

@dolho dolho commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Sequential agent loops (#740) were fail-fast: the first failed iteration aborted the whole loop. This adds a per-loop failure policy so a batch-style loop can tolerate a bad iteration and keep going — bounded by a consecutive-failure circuit so a fully-broken agent still terminates. theme-reliability.

Config (default preserves current behavior)

  • on_failure: abort (default — fail-fast, backward compatible) or continue.
  • max_consecutive_failures: default 3, range 1–100. In continue mode, the loop aborts as failed (stop_reason=max_consecutive_failures) once that many iterations fail in a row; a success resets the streak.

Behavior

  • Both failure surfaces gated: a raised exception from execute_task AND a non-success TaskExecutionResult (TIMEOUT/AGENT_ERROR/CIRCUIT_OPEN/AUTH). Each failed iteration finalizes its agent_loop_runs row as failed, then continue mode proceeds to the next run.
  • Terminal status: reaching max_runs (or matching the stop-signal) with ≥1 tolerated failure → new completed_with_errors status; failed_runs count surfaced on the loop row + API + UI.
  • {{previous_response}} carries the last successful response — a failed iteration never overwrites it (documented semantics).

Surfaces (Invariant #13)

  • schema/migration: agent_loops gains on_failure, max_consecutive_failures, failed_runs (schema.py + tables.py Core object + versioned migration).
  • api: POST /api/agents/{name}/loops accepts the two fields; LoopStatusResponse surfaces failed_runs/on_failure/max_consecutive_failures.
  • mcp: run_agent_loop gains the two params (tools/loops.ts + client.ts).
  • ui: LoopsPanel.vue failure-policy controls + failed_runs/completed_with_errors surfacing.

Acceptance criteria

  • on_failure: abort|continue, default abort (backward compatible)
  • Continue bounded by max_consecutive_failures (cutoff aborts as failed)
  • Failed iteration finalizes its run row as failed, then proceeds
  • Both failure surfaces (exception + non-success result)
  • Partial-success terminal status documented (completed_with_errors + failed_runs)
  • {{previous_response}} on a continued iteration defined (last successful)
  • Plumbed through schema/migration, API, MCP, UI
  • Tests: abort unchanged, continue past a failed run, consecutive cutoff

Testing

42 passed — new TestFailurePolicy (abort default, continue past failure, consecutive cutoff, raised-exception surface, streak-reset) + existing loop suite (backward-compat test_iteration_exception_aborts_loop still green) + schema-parity + migrations.

Related to #1167

🤖 Generated with Claude Code

…rror (#1167)

Sequential agent loops (#740) were fail-fast: the first failed iteration aborted
the whole loop. Add a per-loop policy to tolerate failures and keep going,
bounded so a fully-broken agent still terminates.

- config: `on_failure` ('abort' default = current fail-fast, backward compatible;
  'continue' tolerates a failed iteration) + `max_consecutive_failures`
  (default 3, range 1–100). Both plumbed end to end (Invariant #13).
- runner (loop_service.py): both failure surfaces gated — raised exception AND
  non-success TaskExecutionResult. Continue mode finalizes the failed
  agent_loop_runs row, increments failed_runs/consecutive_failures, and proceeds;
  a success resets the streak. Reaching max_runs (or stop-signal) with tolerated
  failures finalizes as `completed_with_errors`; hitting the consecutive cap
  finalizes `failed`/`stop_reason=max_consecutive_failures`. {{previous_response}}
  keeps the last *successful* response (a failed iteration never overwrites it).
- schema/migration: agent_loops gains on_failure, max_consecutive_failures,
  failed_runs (schema.py + tables.py Core + versioned migration). New terminal
  status `completed_with_errors`.
- api: POST /loops accepts on_failure/max_consecutive_failures; LoopStatusResponse
  surfaces failed_runs/on_failure/max_consecutive_failures.
- mcp: run_agent_loop gains the two params (tools/loops.ts + client.ts).
- ui: LoopsPanel failure-policy controls + failed_runs/completed_with_errors
  surfacing.
- tests: abort unchanged, continue past a failed run (both surfaces), consecutive
  cutoff, streak-reset; schema-parity + migrations green. 42 passed.

Related to #1167

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@dolho dolho left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequential review — #1167 loop failure policy

Reviewed the diff directly (correctness, backward-compat, plumbing). No blocking issues. One low-severity consistency nit and a few intentional-semantics notes.

Finding (low) — exception surface skips the inter-run delay in continue mode

In services/loop_service.py, the raised-exception failure path ends with continue, which jumps straight to the next iteration and skips the delay_seconds inter-run pause. The non-success-TaskExecutionResult path instead falls through to the delay block before the next run. So a loop whose iterations raise (vs. return a failure result) will retry back-to-back with no pacing.

  • Impact is bounded by max_consecutive_failures, so it can't spin forever, but a flapping agent could hammple the backend faster than intended when delay_seconds is set.
  • Suggest applying the delay before continue (or restructuring both surfaces to share one tail) for parity. Optional — not blocking.

Notes (intentional, documenting)

  • completed_with_errors + stop-signal: if the stop-signal matches on a success after earlier tolerated failures, status stays completed → promoted to completed_with_errors. Defensible (the run did have failures); flagging the semantic.
  • runs_completed counts attempted iterations including tolerated failures, paired with the new failed_runs chip in the UI — consistent with the prior non-success path.
  • max_consecutive_failures > max_runs is allowed (cutoff then never fires); harmless — the loop simply ends at max_runs as completed_with_errors.

Verified good

  • Both failure surfaces gated; backward-compat preserved (default abort; existing test_iteration_exception_aborts_loop still green).
  • {{previous_response}} keeps the last successful response on a tolerated failure (no poisoning of the next prompt).
  • Schema dual-definition correct (schema.py DDL and tables.py Core object) — schema-parity + migrations suites green.
  • on_failure validated via Literal; max_consecutive_failures bounded 1–100.
  • Minimal scope: stores/loops.js correctly left untouched (passes payload through); abort-mode requests don't even send the new fields.
  • Cooperative stop still honored each iteration in continue mode (stop check is at loop top).

Verdict: ship-ready; the delay-on-exception nit is the only thing worth a follow-up commit if you want surface parity.

…1167 review)

Continue mode skipped delay_seconds when an iteration *raised* (the exception
path `continue`d past the delay block), while a non-success result honored it.
Extract the inter-run delay (with its cooperative-stop check) into a helper and
call it on both surfaces so continue-mode pacing is consistent. Add a test
asserting the delay fires after a raised iteration.
@dolho

dolho commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Addressed the review finding (delay-on-exception parity) in ed9dce2.

The inter-run delay (with its cooperative-stop check) is now extracted into a single _inter_run_delay() helper and applied on both failure surfaces, so a continue-mode loop whose iterations raise paces with delay_seconds exactly like the non-success-result path. Added a test asserting the delay fires after a raised iteration. 22 passed.

The three semantic notes were intentional and left as-is.

@dolho dolho requested a review from vybe June 24, 2026 13:40

@vybe vybe left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validated via /validate-prblocking on Invariant #3 (dual-track migrations).

This adds 3 columns to agent_loops (on_failure, max_consecutive_failures, failed_runs) with a SQLite migration (_migrate_agent_loops_failure_policy in db/migrations.py) and updates db/schema.py + db/tables.py — but there's no Alembic revision under src/backend/migrations/versions/. On a PostgreSQL backend, init_database() runs alembic_runner.upgrade_to_head(), which won't add these columns, so loop_service crashes reading them. schema-parity is green because it only validates the SQLite track, not Alembic.

Fix: add a new Alembic revision adding the 3 agent_loops columns. The failure-policy logic, MCP surface, tests, and docs all look complete otherwise — this is the only blocker.

— posted via /validate-pr

@github-actions

Copy link
Copy Markdown

⚠️ Nightly unit-suite check skipped — merge conflict against dev.

Resolve by running git merge dev locally and pushing the result. The next nightly run will re-test once the conflict is gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants