feat(loops): loop-level wall-clock deadline (max_duration_seconds) (#1156)#1182
feat(loops): loop-level wall-clock deadline (max_duration_seconds) (#1156)#1182dolho wants to merge 3 commits into
Conversation
be4c6f5 to
bc23675
Compare
|
Resolve by running |
|
@dependabot rebase |
vybe
left a comment
There was a problem hiding this comment.
Thanks — the feature itself looks good (clean SQLite migration, schema.py/tables.py updated, docs + tests present). One blocker before merge:
Missing the PostgreSQL Alembic revision (dual-track migration, CLAUDE.md §9 / Invariant #9).
This PR adds agent_loops.max_duration_seconds via the SQLite track (_migrate_agent_loops_max_duration in db/migrations.py) but there is no matching Alembic revision under src/backend/migrations/versions/. On a PostgreSQL deployment the column would never be created, so the loop deadline feature would crash there.
Please add a revision (down_revision = current head, e.g. 0001_baseline) that runs op.add_column('agent_loops', sa.Column('max_duration_seconds', sa.Integer(), nullable=True)). Once that's in, this is good to go.
…1156) Adds an optional total-duration bound to sequential agent loops, the third hard stop alongside the max_runs iteration cap. A loop legally configured today (max_runs=100 × timeout_per_run up to 2h + delays) could run for days; max_duration_seconds caps total wall-clock time. - Schema: agent_loops.max_duration_seconds INTEGER (nullable) + idempotent migration (agent_loops_max_duration). - Runner (loop_service): deadline measured from started_at, checked only at iteration boundaries (before the next run + before/after the inter-run delay, which is capped to the remaining budget). An in-flight run is never killed mid-turn — overshoot is bounded by one timeout_per_run. Expiry stops with stop_reason="deadline_exceeded", terminal status "stopped". - Router: optional max_duration_seconds (1..604800); 400 when smaller than the effective per-run timeout (timeout_per_run, else agent execution_timeout). GET /api/loops/{id} returns max_duration_seconds + computed elapsed_seconds. - MCP run_agent_loop + UI Loops form/detail expose the parameter and deadline. - Tests: deadline stop at boundary, in-flight run completes (not killed), delay capped to remaining budget, no-deadline regression, and the four validation paths. Docs: architecture.md (feature + schema) and requirements.md §38.2. Closes #1156 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…1156) The SQLite path got the column via db/migrations.py + schema.py, but the PostgreSQL backend is Alembic-owned and the sqlite PRAGMA migrations are skipped — so existing PG deployments never received the column, and loops.py's insert(agent_loops).values(max_duration_seconds=...) would fail on Postgres. Adds 0004_agent_loops_max_duration (down_revision 0003) mirroring the 0002/0003 precedent: idempotent ADD COLUMN IF NOT EXISTS so it's a no-op on fresh PG builds (baseline already creates it from schema.py:TABLES) and adds it in place on existing PG deployments. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5546a4d to
30bbaf9
Compare
…#1156) The feature added `max_duration_seconds` but the run-agent-loop feature flow wasn't updated (flagged by /validate-pr). Weave the wall-clock deadline through the flow: tool/UI params, the create-time 400 validation + GET elapsed_seconds, the iteration-boundary deadline check and `deadline_exceeded` terminal reason, the dual-track migration (SQLite + Alembic 0005), error-handling rows, the 604800s limit, and the new tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR Validation Report —
|
| Category | Status | Notes |
|---|---|---|
| Commit messages | ✅ | conventional (feat/fix/docs) |
| Base branch | ✅ | → dev |
| PR size | ✅ | 15 files |
| Issue link | ✅ | resolves public #1156 (p1, type-feature); "Related to" per relate-don't-close convention |
| Requirements | ✅ | requirements.md updated (param 1–604800) |
| Architecture | ✅ | loops section + agent_loops DDL + endpoint behavior |
| Feature flow | ✅ | run-agent-loop.md updated in 53ed5aa (was the one |
| Security | ✅ | no secrets / emails / IPs / .env / cred files |
| Build/Config packaging | ➖ | no new top-level module; no new os.getenv() |
| Dual-track migration (Inv. #9) | ✅ | SQLite + Alembic 0005 + schema.py + tables.py consistent |
| Code quality | ✅ | iteration-boundary deadline, overshoot ≤ one timeout_per_run, 400 if deadline < per-run timeout |
Verification: schema-parity pytest ✅ (4) · loop tests ✅ (test_loop_service + test_loops_router_validation, 25).
Recommendation: APPROVE ✅
Clean, well-tested, dual-track migration correct after the rebase fix, docs complete. P1 feature — /review + /cso --diff remain part of the pipeline if not already run.
Summary
Adds an optional loop-level wall-clock deadline (
max_duration_seconds) to sequential agent loops — the third hard stop alongside themax_runsiteration cap (and the separately-tracked cost budget). A loop legally configured today (max_runs=100×timeout_per_runup to 2h +delay_seconds) can run for days; this bounds total duration.Related to #1156.
What changed (end-to-end)
agent_loops.max_duration_seconds INTEGER(nullable) inschema.py+ idempotentagent_loops_max_durationmigration for existing DBs.loop_service._run): deadline measured fromstarted_at, checked only at iteration boundaries — before the next run and before/after the inter-run delay (thedelay_secondssleep is capped to the remaining budget, never sleeping past the deadline). An in-flight run is never killed mid-turn, so overshoot is bounded by onetimeout_per_run. Expiry → terminal statusstopped,stop_reason="deadline_exceeded".max_duration_seconds(1–604800 = 7d); 400 when smaller than the effective per-run timeout (timeout_per_run, else the agent'sexecution_timeout_seconds).GET /api/loops/{id}now returnsmax_duration_seconds+ a computedelapsed_seconds.run_agent_loop+ Loops UI (form field + deadline/elapsed in detail,deadline_exceededlabel).architecture.md(feature + schema) andrequirements.md§38.2.Acceptance criteria
POST /loopsaccepts optionalmax_duration_seconds; persisted onagent_loopsstarted_at, checked at iteration boundaries; in-flight run completesstop_reason="deadline_exceeded", statusstopped< timeout_per_run(or agent execution timeout) with 400GET /loops/{id}returns deadline + elapsed; MCPrun_agent_loopexposes the paramTesting
tests/unit/test_loop_service.py(deadline suite) +tests/unit/test_loops_router_validation.py— 24 passed locally. Also verified the migration is idempotent andcreate_loop/get_loopround-trips the column on a fresh schema and via the ALTER path.🤖 Generated with Claude Code