feat(agent): turn-level resume for `ynh agent run` by eyelock · Pull Request #175 · eyelock/ynh

eyelock · 2026-06-20T09:37:11Z

Summary

Adds turn-level executional durability to ynh agent run. After a crash, interrupt, or SIGTERM, a relaunch with --resume <dir> continues from the last completed turn with budget counters and the worker conversation restored — at most one in-flight turn is ever redone. (This is the final piece on the yna-loop-agent branch; it also lands the prior turn-based loop work.)

Substrate (option B): a small checkpoint.json sidecar, written atomically (temp→fsync→rename) after each completed turn, is the resume source of truth; trajectory.jsonl stays the append-only audit record.

What changed

checkpoint.go (new) — Checkpoint struct + atomic write + read with distinct missing/corrupt/version errors.
budget.go — Resume()/WallConsumed() so caps (turns/tokens/wall-clock) carry across a relaunch; new exit codes ExitResumeError(21), ExitInterrupted(31).
trajectory.go — session_resumed event (SessionResumedData).
backends — WorkerSession.ResumeToken() + StartOptions.ResumeToken:
- claude: controls its own --session-id <uuid>; resumes via --resume <uuid>.
- cursor: persists/re-supplies its chatId (already disk-durable per turn).
- codex: captures the session id from the --json stream and resumes via codex exec resume <id> — best-effort / unverified (codex binary is broken in the dev env; may need a follow-up).
loop.go — --resume plumbing, append-on-resume trajectory, budget restore, checkpoint writes at session-start / plan-finalize / per-turn, session_resumed emit, cancelable ctx + SIGINT/SIGTERM handler + non-interactive interrupt (previously dropped), resume-past-exceeded-budget early exit, multi-resume carry-forward.
cmd/ynh/agent.go — --resume <dir> flag; task optional on resume.
capabilities — CapabilitiesVersion 0.5.0 → 0.6.0; x-capabilities propagated across docs/schema/** + the internal/clischema/schema/** mirror (parity), goldens (version/installed/list/fork/info), and docs.

Edge cases handled

Mid-turn crash redoes only the incomplete turn; resume past an exceeded budget emits budget_exceeded and exits without a new turn; pending approval gates re-emit via re-execution; stale/corrupt/missing checkpoint fails with a clear code (21); non---resume runs behave exactly as before.

Tests

New checkpoint_test.go + resume_test.go: checkpoint roundtrip & atomic write; missing/corrupt/version errors; resume restores budget and continues at turn N+1 with no replay; append-on-resume vs truncate-on-fresh; interrupt and SIGTERM both leave a resumable checkpoint; mid-turn crash redoes only the incomplete turn; double-interrupt → resume; resume-past-exceeded-budget exits without starting a worker; per-backend resume-token capture. make check green (-race, 0 lint), /evals PASS.

TermQ contract (out of scope here)

Resume: relaunch the same ynh agent run … --emit-jsonl <dir>/trajectory.jsonl plus --resume <dir>.
Feature-detect: ynh version --format json → capabilities ≥ 0.6.0.
New event: session_resumed heads a resumed trajectory (carries resumed_at_turn, restored budget).
Exit codes: 21 = stale/corrupt/missing checkpoint; 31 = interrupted-but-resumable.
Stop protocol unchanged ({"action":"interrupt"} then SIGTERM) — both now leave a resumable checkpoint.

🤖 Generated with Claude Code

Implements the Phase 1 loop driver as `ynh agent run`, embedding it in the ynh binary per the agent-loop plan. The loop driver is the missing orchestration layer that sits above ynh's sensor execution: it spawns a vendor agent subprocess, runs sensors between turns, synthesises feedback, and enforces budgets and stuckness limits until all sensors converge. Key design decisions: - WorkerBackend interface isolates all wire-format details inside each backend; the loop driver never sees stream-json specifics. Claude Code is the only v1 backend; the interface is ready for Codex (Phase 4) with ~200 incremental lines. - Sensor execution shells out to `ynh sensors run` (already shipped in v0.3.1) so loop-driver policy (pass/fail thresholds) stays separate from ynh's mechanical execution. - NDJSON trajectory writer emits one event per line to a JSONL file or stdout; TermQ's Inspector drives off this stream. - Stdin control protocol (approve_plan, reject_plan, interrupt, approve_turn, replace_feedback) allows TermQ and CI to steer the loop without polling. - Budget (turns/tokens/wall-clock) and stuckness watchdog (edit-loop + no-progress) are enforced in-process with typed exit codes (10-30) for CI integration. - srt sandbox support via --sandbox srt|none. - Plan/Act phase split: first turn writes plan.md, awaits approval, then enters the act loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add CodexBackend (codex exec --json) and CursorBackend (per-turn subprocess with --resume) so all three vendor CLIs are supported - Fix trajectory wire format to match TermQ consumer expectations: Event.Kind serialises as "type" (not "kind"), Event.Timestamp as "timestamp" (not "time") - BudgetExceededData gains a typed Budget field ("turns"/"tokens"/"wall_clock") - SessionEndData gains TotalTurns and TotalTokens on all exit paths - TurnApprovalData field renamed to SynthesizedFeedback (JSON: synthesized_feedback) - Budget.Exceeded() returns BudgetType as a third value; loop driver threads it through to the trajectory event Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Cover NDJSON parsing, content accumulation, usage tracking, EOF handling, unknown-event skipping, Send wire format, and cursor session state (pending queue, firstTurn flag, Close no-op). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tching Loop driver accepts --sensor-overlay <json> (e.g. '{"build":{"source": {"command":"make fast"}}}') and passes each sensor's overlay to ynh sensors run via the new --sensor-overlay-json flag. ynh performs a shallow JSON field-merge over the base harness declaration before executing the sensor, keeping all execution logic inside ynh. TermQ uses this to let users tweak sensor declarations per-session in the Inspector without modifying the installed harness. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Go's json.Marshal escapes < and > as </> by default. Switching to json.NewEncoder + SetEscapeHTML(false) so usage strings like <harness-name> render literally in terminals and CI logs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Cover unstructured vs structured mode, JSON field values, no-HTML-escape behaviour (verifies SetEscapeHTML(false) is effective), and trailing newline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Local installs were storing the user-supplied relative path verbatim in the pointer record. Consumers loading the pointer from a different cwd (daemon, embedding host, CLI invoked from elsewhere) hit a misleading "manifest not found" error. Resolve via filepath.Abs at install time so the pointer stays valid regardless of the loader's cwd. Closes #167.

claude-code refuses --print --output-format=stream-json without --verbose. Without it, every claude-backend agent run failed at startup. Add --verbose unconditionally — the streaming wire format the loop driver depends on requires it.

Three related ynh agent run hardening fixes: * Add --profile <name>: applies the named profile overlay to the harness before assembly, mirroring ynh run --profile semantics. * Add --focus <name>: looks up a harness focus, sets the task to the focus prompt, and applies the focus's bound profile if any. Mutually exclusive with --task and --profile (the focus carries both). * Reject unknown backend names with a hint. Previously --backend claude-code was silently passed through to vendor.Get, which then failed with "unknown vendor". Now validateBackend rejects up front with: unknown backend "claude-code" (did you mean "claude"?).

Prepare the trajectory schema and exit codes for plan-phase iteration (replace_feedback during plan approval reissues the plan, mirroring the act-phase per-turn refinement). Bump CapabilitiesVersion 0.4.0 → 0.5.0. Schema additions: * KindPlanApprovalRequired event with PlanApprovalData{plan, iteration}. Plan-phase approval gates use this instead of the act-only KindTurnApprovalRequired (whose scope narrows to turn ≥ 1; no rename to avoid consumer churn). * KindPlanRevised event with PlanRevisedData{iteration, notes}, emitted at the start of plan iterations 2+ to delimit refinement boundaries in the trajectory stream. * ExitPlanIterationCap (15) for the new --max-plan-iterations guard. Goldens, doc examples, and x-capabilities annotations bumped to 0.5 across docs/schema/{cli,shared}/, test/golden/, docs/cli-structured.md, docs/schema-cli.md, docs/tutorial/16-structured-output.md. The loop driver wiring follows in subsequent commits.

Refactor the plan phase into an iteration loop. In interactive mode, ActionReplaceFeedback during plan approval reissues the plan with the user's notes instead of being silently discarded; ActionRejectPlan now accepts an optional feedback payload that surfaces in KindSessionEnd.Reason for telemetry. Behavioural contract: * Iteration 1: KindPlan boundary as before. * Iterations 2+: KindPlanRevised{iteration, notes} emitted before the revised-plan request is sent. * Each iteration ends with KindPlanApprovalRequired{plan, iteration} carrying the full plan text (consumers don't walk back through events). * MaxPlanIterations (default 5) caps the refine loop; exit code 15 (ExitPlanIterationCap) on overflow. * Tokens and wall-clock count across plan iterations; the act-phase turn cap (MaxTurns) does not. * Worker errors mid-refine still emit a terminal KindSessionEnd (no orphan KindPlanRevised events). waitForApproval now passes the optional feedback payload through on reject as well as approve. Backward-compatible: existing call sites that ignore the second return value see no change. KindTurnApprovalRequired stays as-is — its scope narrows to act-phase (turn ≥ 1) post-bump per the consumer-coordination decision (avoids a deprecation churn on every renderer that has the string baked in).

Seven tests covering the new plan-refine surface: * PlanApproveOnFirstIteration: regression guard for the no-refine path (zero KindPlanRevised, exactly one KindPlanApprovalRequired). * PlanRefineOnceThenApprove: end-to-end refine — assert KindPlanRevised payload {iteration: 2, notes} and that the worker received the user's notes verbatim in the revise prompt. * PlanRefineHitsCap: exits cleanly with ExitPlanIterationCap when refine exceeds MaxPlanIterations. * PlanRejectWithNotes: ExitError.Message and KindSessionEnd.Reason both carry the stable "plan rejected by user: <notes>" prefix consumers can pattern-match against. * PlanRefineTokenBudget: tokens consumed across iterations roll up into budget.Exceeded; second-iteration overflow exits ExitTokenBudget. * PlanRefineWallBudget: wall-clock check is phase-agnostic — sleeping past MaxWall during plan iteration 2 trips ExitWallClock. * PlanRefineWorkerError: a worker error after KindPlanRevised still emits a terminal KindSessionEnd (no orphan in-flight events). mockBackend gains optional `delays` and `errs` parallel slices for timing/error injection. Both nil-safe — existing tests continue to work unchanged.

internal/clischema embeds a copy of the docs/schema/ tree for runtime validation. The capabilities-version commit updated docs/schema/ but missed the embedded copies; TestSchemaParityWithDocs flagged the drift. Bump the embedded x-capabilities annotations to match.

Wire opts.MaxPlanIterations through cmd/ynh/agent.go so consumers can override the library default of 5 from the command line. Mirrors the parsing shape of --max-turns: required value, non-negative integer, zero (or omitted) keeps the library default. Without this, the field on RunOptions was reachable only via the Go API — fine for embedded use, but TermQ and other CLI-driven consumers had no way to surface the budget control to users.

After adding ExitPlanIterationCap, gofmt re-aligns the const block to the longest identifier. Whitespace-only.

Two related fixes to the plan-phase prompt and plan→act handoff. (1) Plan prompts no longer ask the worker to write plan.md. The previous prompts ("Document it clearly in plan.md in the current directory" / "Update plan.md accordingly") demanded a file write the loop never read back — plan.md was vestigial decoration. claude in plan mode is read-only by design, so the worker stalled at its own permission gate and replied with a question about the write instead of a plan. ynh then captured that question as the plan content and surfaced it through KindPlanApprovalRequired, leaving consumers with nothing to approve. Both prompts now ask only for an inline reply. Plan mode being read-only is the feature, not a constraint to work around. The act phase has write access if a plan file is wanted later. (2) Approved plan content is forwarded into the act phase. Previously the plan→act boundary collapsed to "Plan approved. Proceed with implementation: <task>" — the worker entered act mode with only the original task and lost every word of the plan it just generated (or refined). Now the act-phase first message embeds the approved plan text alongside the task so refinement work survives the phase boundary. Tests assert: plan prompts contain neither "plan.md" nor "current directory"; the act-phase first message contains both the final approved plan content and the original task.

Adds KindBudgetSnapshot, emitted once per turn in both the plan and act phases immediately after the worker's reply is recorded against the budget. Carries running totals (Turns, Tokens) so consumers can render live progress without shadowing the driver's accounting. Why now: tokens were previously visible only in terminal events (KindBudgetExceeded, KindSessionEnd). Live UIs had no way to render a "7/25 turns · 142k/500k tokens" strip without reimplementing budget bookkeeping from KindAssistantMessage payloads — and turn.Usage doesn't flow into the trajectory at all today, so consumers can't even shadow. Plan-phase snapshots intentionally report Turns=0; the act-phase turn counter has not started in the plan loop. Plan-iteration count is a distinct concept already exposed via PlanApprovalData.Iteration. The two are deliberately not folded into one field — separate budgets, separate surfaces. Documented as a field-level invariant. No CapabilitiesVersion bump: KindBudgetSnapshot is a new member of the open-set EventKind enum. Per docs/cli-structured.md consumers MUST tolerate unknown kinds, so this is purely additive. Pre-feature ynh just doesn't emit the event; consumers degrade gracefully. Tests: * per-turn act-phase emission with monotonic accumulation * plan-phase emission with Turns=0 invariant * parity between final snapshot and KindSessionEnd totals (drift guard) * wire-shape regression on field names and types

Make `ynh agent run` resumable at turn granularity. After a crash, interrupt, or SIGTERM, a relaunch with `--resume <dir>` continues from the last completed turn with budget counters and the worker conversation restored — at most one in-flight turn is ever redone. Substrate (option B): a small checkpoint.json sidecar, written atomically after each completed turn, is the resume source of truth; trajectory.jsonl stays the append-only audit record. - checkpoint.go: Checkpoint struct + atomic write (temp+fsync+rename) + read with distinct missing/corrupt/version errors. - budget.go: Resume()/WallConsumed() so caps carry across a relaunch; ExitResumeError(21), ExitInterrupted(31). - backends: WorkerSession.ResumeToken() + StartOptions.ResumeToken. claude controls its own --session-id and resumes via --resume; cursor persists/re-supplies its chatId; codex captures the session id from the --json stream and resumes via `exec resume` (best-effort, unverified). - loop.go: --resume plumbing, append-on-resume trajectory, budget restore, checkpoint writes at session-start/plan-finalize/per-turn, session_resumed event, cancelable ctx + SIGINT/SIGTERM handler + non-interactive interrupt, resume-past-exceeded-budget early exit. - cmd/ynh/agent.go: --resume <dir> flag; task optional on resume. Bumps CapabilitiesVersion 0.5.0 -> 0.6.0 so TermQ can feature-detect --resume; propagates x-capabilities across docs/schema + the clischema mirror, goldens, and docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

David Collie and others added 20 commits May 11, 2026 17:31

test(cli): add tests for structured error envelope format

b89da63

Cover unstructured vs structured mode, JSON field values, no-HTML-escape behaviour (verifies SetEscapeHTML(false) is effective), and trailing newline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/develop' into feat/yna-loop-agent

5fb60ac

style(agent): gofmt column alignment for exit-code block

67512b6

After adding ExitPlanIterationCap, gofmt re-aligns the const block to the longest identifier. Whitespace-only.

Merge remote-tracking branch 'origin/develop' into feat/yna-loop-agent

54eeb45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): turn-level resume for `ynh agent run`#175

feat(agent): turn-level resume for `ynh agent run`#175
eyelock wants to merge 20 commits into
developfrom
feat/yna-loop-agent

eyelock commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eyelock commented Jun 20, 2026

Summary

What changed

Edge cases handled

Tests

TermQ contract (out of scope here)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant