Skip to content

feat(agent): turn-level resume for ynh agent run#175

Open
eyelock wants to merge 20 commits into
developfrom
feat/yna-loop-agent
Open

feat(agent): turn-level resume for ynh agent run#175
eyelock wants to merge 20 commits into
developfrom
feat/yna-loop-agent

Conversation

@eyelock

@eyelock eyelock commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Summary

Adds turn-level executional durability to ynh agent run. After a crash, interrupt, or SIGTERM, a relaunch with --resume <dir> continues from the last completed turn with budget counters and the worker conversation restored — at most one in-flight turn is ever redone. (This is the final piece on the yna-loop-agent branch; it also lands the prior turn-based loop work.)

Substrate (option B): a small checkpoint.json sidecar, written atomically (temp→fsync→rename) after each completed turn, is the resume source of truth; trajectory.jsonl stays the append-only audit record.

What changed

  • checkpoint.go (new) — Checkpoint struct + atomic write + read with distinct missing/corrupt/version errors.
  • budget.goResume()/WallConsumed() so caps (turns/tokens/wall-clock) carry across a relaunch; new exit codes ExitResumeError(21), ExitInterrupted(31).
  • trajectory.gosession_resumed event (SessionResumedData).
  • backendsWorkerSession.ResumeToken() + StartOptions.ResumeToken:
    • claude: controls its own --session-id <uuid>; resumes via --resume <uuid>.
    • cursor: persists/re-supplies its chatId (already disk-durable per turn).
    • codex: captures the session id from the --json stream and resumes via codex exec resume <id>best-effort / unverified (codex binary is broken in the dev env; may need a follow-up).
  • loop.go--resume plumbing, append-on-resume trajectory, budget restore, checkpoint writes at session-start / plan-finalize / per-turn, session_resumed emit, cancelable ctx + SIGINT/SIGTERM handler + non-interactive interrupt (previously dropped), resume-past-exceeded-budget early exit, multi-resume carry-forward.
  • cmd/ynh/agent.go--resume <dir> flag; task optional on resume.
  • capabilitiesCapabilitiesVersion 0.5.0 → 0.6.0; x-capabilities propagated across docs/schema/** + the internal/clischema/schema/** mirror (parity), goldens (version/installed/list/fork/info), and docs.

Edge cases handled

Mid-turn crash redoes only the incomplete turn; resume past an exceeded budget emits budget_exceeded and exits without a new turn; pending approval gates re-emit via re-execution; stale/corrupt/missing checkpoint fails with a clear code (21); non---resume runs behave exactly as before.

Tests

New checkpoint_test.go + resume_test.go: checkpoint roundtrip & atomic write; missing/corrupt/version errors; resume restores budget and continues at turn N+1 with no replay; append-on-resume vs truncate-on-fresh; interrupt and SIGTERM both leave a resumable checkpoint; mid-turn crash redoes only the incomplete turn; double-interrupt → resume; resume-past-exceeded-budget exits without starting a worker; per-backend resume-token capture. make check green (-race, 0 lint), /evals PASS.

TermQ contract (out of scope here)

  • Resume: relaunch the same ynh agent run … --emit-jsonl <dir>/trajectory.jsonl plus --resume <dir>.
  • Feature-detect: ynh version --format jsoncapabilities ≥ 0.6.0.
  • New event: session_resumed heads a resumed trajectory (carries resumed_at_turn, restored budget).
  • Exit codes: 21 = stale/corrupt/missing checkpoint; 31 = interrupted-but-resumable.
  • Stop protocol unchanged ({"action":"interrupt"} then SIGTERM) — both now leave a resumable checkpoint.

🤖 Generated with Claude Code

David Collie and others added 20 commits May 11, 2026 17:31
Implements the Phase 1 loop driver as `ynh agent run`, embedding
it in the ynh binary per the agent-loop plan. The loop driver is
the missing orchestration layer that sits above ynh's sensor
execution: it spawns a vendor agent subprocess, runs sensors
between turns, synthesises feedback, and enforces budgets and
stuckness limits until all sensors converge.

Key design decisions:
- WorkerBackend interface isolates all wire-format details inside
  each backend; the loop driver never sees stream-json specifics.
  Claude Code is the only v1 backend; the interface is ready for
  Codex (Phase 4) with ~200 incremental lines.
- Sensor execution shells out to `ynh sensors run` (already
  shipped in v0.3.1) so loop-driver policy (pass/fail thresholds)
  stays separate from ynh's mechanical execution.
- NDJSON trajectory writer emits one event per line to a JSONL
  file or stdout; TermQ's Inspector drives off this stream.
- Stdin control protocol (approve_plan, reject_plan, interrupt,
  approve_turn, replace_feedback) allows TermQ and CI to steer
  the loop without polling.
- Budget (turns/tokens/wall-clock) and stuckness watchdog
  (edit-loop + no-progress) are enforced in-process with typed
  exit codes (10-30) for CI integration.
- srt sandbox support via --sandbox srt|none.
- Plan/Act phase split: first turn writes plan.md, awaits
  approval, then enters the act loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add CodexBackend (codex exec --json) and CursorBackend (per-turn
  subprocess with --resume) so all three vendor CLIs are supported
- Fix trajectory wire format to match TermQ consumer expectations:
  Event.Kind serialises as "type" (not "kind"), Event.Timestamp as
  "timestamp" (not "time")
- BudgetExceededData gains a typed Budget field ("turns"/"tokens"/"wall_clock")
- SessionEndData gains TotalTurns and TotalTokens on all exit paths
- TurnApprovalData field renamed to SynthesizedFeedback (JSON: synthesized_feedback)
- Budget.Exceeded() returns BudgetType as a third value; loop driver
  threads it through to the trajectory event

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cover NDJSON parsing, content accumulation, usage tracking, EOF handling,
unknown-event skipping, Send wire format, and cursor session state
(pending queue, firstTurn flag, Close no-op).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tching

Loop driver accepts --sensor-overlay <json> (e.g. '{"build":{"source":
{"command":"make fast"}}}') and passes each sensor's overlay to
ynh sensors run via the new --sensor-overlay-json flag. ynh performs a
shallow JSON field-merge over the base harness declaration before
executing the sensor, keeping all execution logic inside ynh.

TermQ uses this to let users tweak sensor declarations per-session in
the Inspector without modifying the installed harness.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Go's json.Marshal escapes < and > as </> by default.
Switching to json.NewEncoder + SetEscapeHTML(false) so usage strings
like <harness-name> render literally in terminals and CI logs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cover unstructured vs structured mode, JSON field values, no-HTML-escape
behaviour (verifies SetEscapeHTML(false) is effective), and trailing newline.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Local installs were storing the user-supplied relative path verbatim in
the pointer record. Consumers loading the pointer from a different cwd
(daemon, embedding host, CLI invoked from elsewhere) hit a misleading
"manifest not found" error.

Resolve via filepath.Abs at install time so the pointer stays valid
regardless of the loader's cwd.

Closes #167.
claude-code refuses --print --output-format=stream-json without
--verbose. Without it, every claude-backend agent run failed at startup.
Add --verbose unconditionally — the streaming wire format the loop
driver depends on requires it.
Three related ynh agent run hardening fixes:

* Add --profile <name>: applies the named profile overlay to the
  harness before assembly, mirroring ynh run --profile semantics.

* Add --focus <name>: looks up a harness focus, sets the task to the
  focus prompt, and applies the focus's bound profile if any. Mutually
  exclusive with --task and --profile (the focus carries both).

* Reject unknown backend names with a hint. Previously --backend
  claude-code was silently passed through to vendor.Get, which then
  failed with "unknown vendor". Now validateBackend rejects up front
  with: unknown backend "claude-code" (did you mean "claude"?).
Prepare the trajectory schema and exit codes for plan-phase iteration
(replace_feedback during plan approval reissues the plan, mirroring the
act-phase per-turn refinement). Bump CapabilitiesVersion 0.4.0 → 0.5.0.

Schema additions:
* KindPlanApprovalRequired event with PlanApprovalData{plan, iteration}.
  Plan-phase approval gates use this instead of the act-only
  KindTurnApprovalRequired (whose scope narrows to turn ≥ 1; no rename
  to avoid consumer churn).
* KindPlanRevised event with PlanRevisedData{iteration, notes}, emitted
  at the start of plan iterations 2+ to delimit refinement boundaries
  in the trajectory stream.
* ExitPlanIterationCap (15) for the new --max-plan-iterations guard.

Goldens, doc examples, and x-capabilities annotations bumped to 0.5
across docs/schema/{cli,shared}/, test/golden/, docs/cli-structured.md,
docs/schema-cli.md, docs/tutorial/16-structured-output.md.

The loop driver wiring follows in subsequent commits.
Refactor the plan phase into an iteration loop. In interactive mode,
ActionReplaceFeedback during plan approval reissues the plan with the
user's notes instead of being silently discarded; ActionRejectPlan now
accepts an optional feedback payload that surfaces in
KindSessionEnd.Reason for telemetry.

Behavioural contract:
* Iteration 1: KindPlan boundary as before.
* Iterations 2+: KindPlanRevised{iteration, notes} emitted before the
  revised-plan request is sent.
* Each iteration ends with KindPlanApprovalRequired{plan, iteration}
  carrying the full plan text (consumers don't walk back through events).
* MaxPlanIterations (default 5) caps the refine loop; exit code 15
  (ExitPlanIterationCap) on overflow.
* Tokens and wall-clock count across plan iterations; the act-phase
  turn cap (MaxTurns) does not.
* Worker errors mid-refine still emit a terminal KindSessionEnd
  (no orphan KindPlanRevised events).

waitForApproval now passes the optional feedback payload through on
reject as well as approve. Backward-compatible: existing call sites
that ignore the second return value see no change.

KindTurnApprovalRequired stays as-is — its scope narrows to act-phase
(turn ≥ 1) post-bump per the consumer-coordination decision (avoids
a deprecation churn on every renderer that has the string baked in).
Seven tests covering the new plan-refine surface:

* PlanApproveOnFirstIteration: regression guard for the no-refine path
  (zero KindPlanRevised, exactly one KindPlanApprovalRequired).
* PlanRefineOnceThenApprove: end-to-end refine — assert KindPlanRevised
  payload {iteration: 2, notes} and that the worker received the user's
  notes verbatim in the revise prompt.
* PlanRefineHitsCap: exits cleanly with ExitPlanIterationCap when refine
  exceeds MaxPlanIterations.
* PlanRejectWithNotes: ExitError.Message and KindSessionEnd.Reason both
  carry the stable "plan rejected by user: <notes>" prefix consumers
  can pattern-match against.
* PlanRefineTokenBudget: tokens consumed across iterations roll up into
  budget.Exceeded; second-iteration overflow exits ExitTokenBudget.
* PlanRefineWallBudget: wall-clock check is phase-agnostic — sleeping
  past MaxWall during plan iteration 2 trips ExitWallClock.
* PlanRefineWorkerError: a worker error after KindPlanRevised still
  emits a terminal KindSessionEnd (no orphan in-flight events).

mockBackend gains optional `delays` and `errs` parallel slices for
timing/error injection. Both nil-safe — existing tests continue to
work unchanged.
internal/clischema embeds a copy of the docs/schema/ tree for runtime
validation. The capabilities-version commit updated docs/schema/ but
missed the embedded copies; TestSchemaParityWithDocs flagged the drift.
Bump the embedded x-capabilities annotations to match.
Wire opts.MaxPlanIterations through cmd/ynh/agent.go so consumers can
override the library default of 5 from the command line. Mirrors the
parsing shape of --max-turns: required value, non-negative integer,
zero (or omitted) keeps the library default.

Without this, the field on RunOptions was reachable only via the Go
API — fine for embedded use, but TermQ and other CLI-driven consumers
had no way to surface the budget control to users.
After adding ExitPlanIterationCap, gofmt re-aligns the const block to
the longest identifier. Whitespace-only.
Two related fixes to the plan-phase prompt and plan→act handoff.

(1) Plan prompts no longer ask the worker to write plan.md.

The previous prompts ("Document it clearly in plan.md in the current
directory" / "Update plan.md accordingly") demanded a file write the
loop never read back — plan.md was vestigial decoration. claude in
plan mode is read-only by design, so the worker stalled at its own
permission gate and replied with a question about the write instead
of a plan. ynh then captured that question as the plan content and
surfaced it through KindPlanApprovalRequired, leaving consumers with
nothing to approve. Both prompts now ask only for an inline reply.

Plan mode being read-only is the feature, not a constraint to work
around. The act phase has write access if a plan file is wanted later.

(2) Approved plan content is forwarded into the act phase.

Previously the plan→act boundary collapsed to "Plan approved. Proceed
with implementation: <task>" — the worker entered act mode with only
the original task and lost every word of the plan it just generated
(or refined). Now the act-phase first message embeds the approved plan
text alongside the task so refinement work survives the phase boundary.

Tests assert: plan prompts contain neither "plan.md" nor "current
directory"; the act-phase first message contains both the final
approved plan content and the original task.
Adds KindBudgetSnapshot, emitted once per turn in both the plan and act
phases immediately after the worker's reply is recorded against the
budget. Carries running totals (Turns, Tokens) so consumers can render
live progress without shadowing the driver's accounting.

Why now: tokens were previously visible only in terminal events
(KindBudgetExceeded, KindSessionEnd). Live UIs had no way to render a
"7/25 turns · 142k/500k tokens" strip without reimplementing budget
bookkeeping from KindAssistantMessage payloads — and turn.Usage doesn't
flow into the trajectory at all today, so consumers can't even shadow.

Plan-phase snapshots intentionally report Turns=0; the act-phase turn
counter has not started in the plan loop. Plan-iteration count is a
distinct concept already exposed via PlanApprovalData.Iteration. The
two are deliberately not folded into one field — separate budgets,
separate surfaces. Documented as a field-level invariant.

No CapabilitiesVersion bump: KindBudgetSnapshot is a new member of the
open-set EventKind enum. Per docs/cli-structured.md consumers MUST
tolerate unknown kinds, so this is purely additive. Pre-feature ynh
just doesn't emit the event; consumers degrade gracefully.

Tests:
* per-turn act-phase emission with monotonic accumulation
* plan-phase emission with Turns=0 invariant
* parity between final snapshot and KindSessionEnd totals (drift guard)
* wire-shape regression on field names and types
Make `ynh agent run` resumable at turn granularity. After a crash,
interrupt, or SIGTERM, a relaunch with `--resume <dir>` continues from
the last completed turn with budget counters and the worker conversation
restored — at most one in-flight turn is ever redone.

Substrate (option B): a small checkpoint.json sidecar, written
atomically after each completed turn, is the resume source of truth;
trajectory.jsonl stays the append-only audit record.

- checkpoint.go: Checkpoint struct + atomic write (temp+fsync+rename) +
  read with distinct missing/corrupt/version errors.
- budget.go: Resume()/WallConsumed() so caps carry across a relaunch;
  ExitResumeError(21), ExitInterrupted(31).
- backends: WorkerSession.ResumeToken() + StartOptions.ResumeToken.
  claude controls its own --session-id and resumes via --resume; cursor
  persists/re-supplies its chatId; codex captures the session id from the
  --json stream and resumes via `exec resume` (best-effort, unverified).
- loop.go: --resume plumbing, append-on-resume trajectory, budget
  restore, checkpoint writes at session-start/plan-finalize/per-turn,
  session_resumed event, cancelable ctx + SIGINT/SIGTERM handler +
  non-interactive interrupt, resume-past-exceeded-budget early exit.
- cmd/ynh/agent.go: --resume <dir> flag; task optional on resume.

Bumps CapabilitiesVersion 0.5.0 -> 0.6.0 so TermQ can feature-detect
--resume; propagates x-capabilities across docs/schema + the clischema
mirror, goldens, and docs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant