feat(agent): turn-level resume for ynh agent run#175
Open
eyelock wants to merge 20 commits into
Open
Conversation
Implements the Phase 1 loop driver as `ynh agent run`, embedding it in the ynh binary per the agent-loop plan. The loop driver is the missing orchestration layer that sits above ynh's sensor execution: it spawns a vendor agent subprocess, runs sensors between turns, synthesises feedback, and enforces budgets and stuckness limits until all sensors converge. Key design decisions: - WorkerBackend interface isolates all wire-format details inside each backend; the loop driver never sees stream-json specifics. Claude Code is the only v1 backend; the interface is ready for Codex (Phase 4) with ~200 incremental lines. - Sensor execution shells out to `ynh sensors run` (already shipped in v0.3.1) so loop-driver policy (pass/fail thresholds) stays separate from ynh's mechanical execution. - NDJSON trajectory writer emits one event per line to a JSONL file or stdout; TermQ's Inspector drives off this stream. - Stdin control protocol (approve_plan, reject_plan, interrupt, approve_turn, replace_feedback) allows TermQ and CI to steer the loop without polling. - Budget (turns/tokens/wall-clock) and stuckness watchdog (edit-loop + no-progress) are enforced in-process with typed exit codes (10-30) for CI integration. - srt sandbox support via --sandbox srt|none. - Plan/Act phase split: first turn writes plan.md, awaits approval, then enters the act loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add CodexBackend (codex exec --json) and CursorBackend (per-turn
subprocess with --resume) so all three vendor CLIs are supported
- Fix trajectory wire format to match TermQ consumer expectations:
Event.Kind serialises as "type" (not "kind"), Event.Timestamp as
"timestamp" (not "time")
- BudgetExceededData gains a typed Budget field ("turns"/"tokens"/"wall_clock")
- SessionEndData gains TotalTurns and TotalTokens on all exit paths
- TurnApprovalData field renamed to SynthesizedFeedback (JSON: synthesized_feedback)
- Budget.Exceeded() returns BudgetType as a third value; loop driver
threads it through to the trajectory event
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cover NDJSON parsing, content accumulation, usage tracking, EOF handling, unknown-event skipping, Send wire format, and cursor session state (pending queue, firstTurn flag, Close no-op). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tching
Loop driver accepts --sensor-overlay <json> (e.g. '{"build":{"source":
{"command":"make fast"}}}') and passes each sensor's overlay to
ynh sensors run via the new --sensor-overlay-json flag. ynh performs a
shallow JSON field-merge over the base harness declaration before
executing the sensor, keeping all execution logic inside ynh.
TermQ uses this to let users tweak sensor declarations per-session in
the Inspector without modifying the installed harness.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Go's json.Marshal escapes < and > as </> by default. Switching to json.NewEncoder + SetEscapeHTML(false) so usage strings like <harness-name> render literally in terminals and CI logs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cover unstructured vs structured mode, JSON field values, no-HTML-escape behaviour (verifies SetEscapeHTML(false) is effective), and trailing newline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Local installs were storing the user-supplied relative path verbatim in the pointer record. Consumers loading the pointer from a different cwd (daemon, embedding host, CLI invoked from elsewhere) hit a misleading "manifest not found" error. Resolve via filepath.Abs at install time so the pointer stays valid regardless of the loader's cwd. Closes #167.
claude-code refuses --print --output-format=stream-json without --verbose. Without it, every claude-backend agent run failed at startup. Add --verbose unconditionally — the streaming wire format the loop driver depends on requires it.
Three related ynh agent run hardening fixes: * Add --profile <name>: applies the named profile overlay to the harness before assembly, mirroring ynh run --profile semantics. * Add --focus <name>: looks up a harness focus, sets the task to the focus prompt, and applies the focus's bound profile if any. Mutually exclusive with --task and --profile (the focus carries both). * Reject unknown backend names with a hint. Previously --backend claude-code was silently passed through to vendor.Get, which then failed with "unknown vendor". Now validateBackend rejects up front with: unknown backend "claude-code" (did you mean "claude"?).
Prepare the trajectory schema and exit codes for plan-phase iteration
(replace_feedback during plan approval reissues the plan, mirroring the
act-phase per-turn refinement). Bump CapabilitiesVersion 0.4.0 → 0.5.0.
Schema additions:
* KindPlanApprovalRequired event with PlanApprovalData{plan, iteration}.
Plan-phase approval gates use this instead of the act-only
KindTurnApprovalRequired (whose scope narrows to turn ≥ 1; no rename
to avoid consumer churn).
* KindPlanRevised event with PlanRevisedData{iteration, notes}, emitted
at the start of plan iterations 2+ to delimit refinement boundaries
in the trajectory stream.
* ExitPlanIterationCap (15) for the new --max-plan-iterations guard.
Goldens, doc examples, and x-capabilities annotations bumped to 0.5
across docs/schema/{cli,shared}/, test/golden/, docs/cli-structured.md,
docs/schema-cli.md, docs/tutorial/16-structured-output.md.
The loop driver wiring follows in subsequent commits.
Refactor the plan phase into an iteration loop. In interactive mode,
ActionReplaceFeedback during plan approval reissues the plan with the
user's notes instead of being silently discarded; ActionRejectPlan now
accepts an optional feedback payload that surfaces in
KindSessionEnd.Reason for telemetry.
Behavioural contract:
* Iteration 1: KindPlan boundary as before.
* Iterations 2+: KindPlanRevised{iteration, notes} emitted before the
revised-plan request is sent.
* Each iteration ends with KindPlanApprovalRequired{plan, iteration}
carrying the full plan text (consumers don't walk back through events).
* MaxPlanIterations (default 5) caps the refine loop; exit code 15
(ExitPlanIterationCap) on overflow.
* Tokens and wall-clock count across plan iterations; the act-phase
turn cap (MaxTurns) does not.
* Worker errors mid-refine still emit a terminal KindSessionEnd
(no orphan KindPlanRevised events).
waitForApproval now passes the optional feedback payload through on
reject as well as approve. Backward-compatible: existing call sites
that ignore the second return value see no change.
KindTurnApprovalRequired stays as-is — its scope narrows to act-phase
(turn ≥ 1) post-bump per the consumer-coordination decision (avoids
a deprecation churn on every renderer that has the string baked in).
Seven tests covering the new plan-refine surface:
* PlanApproveOnFirstIteration: regression guard for the no-refine path
(zero KindPlanRevised, exactly one KindPlanApprovalRequired).
* PlanRefineOnceThenApprove: end-to-end refine — assert KindPlanRevised
payload {iteration: 2, notes} and that the worker received the user's
notes verbatim in the revise prompt.
* PlanRefineHitsCap: exits cleanly with ExitPlanIterationCap when refine
exceeds MaxPlanIterations.
* PlanRejectWithNotes: ExitError.Message and KindSessionEnd.Reason both
carry the stable "plan rejected by user: <notes>" prefix consumers
can pattern-match against.
* PlanRefineTokenBudget: tokens consumed across iterations roll up into
budget.Exceeded; second-iteration overflow exits ExitTokenBudget.
* PlanRefineWallBudget: wall-clock check is phase-agnostic — sleeping
past MaxWall during plan iteration 2 trips ExitWallClock.
* PlanRefineWorkerError: a worker error after KindPlanRevised still
emits a terminal KindSessionEnd (no orphan in-flight events).
mockBackend gains optional `delays` and `errs` parallel slices for
timing/error injection. Both nil-safe — existing tests continue to
work unchanged.
internal/clischema embeds a copy of the docs/schema/ tree for runtime validation. The capabilities-version commit updated docs/schema/ but missed the embedded copies; TestSchemaParityWithDocs flagged the drift. Bump the embedded x-capabilities annotations to match.
Wire opts.MaxPlanIterations through cmd/ynh/agent.go so consumers can override the library default of 5 from the command line. Mirrors the parsing shape of --max-turns: required value, non-negative integer, zero (or omitted) keeps the library default. Without this, the field on RunOptions was reachable only via the Go API — fine for embedded use, but TermQ and other CLI-driven consumers had no way to surface the budget control to users.
After adding ExitPlanIterationCap, gofmt re-aligns the const block to the longest identifier. Whitespace-only.
Two related fixes to the plan-phase prompt and plan→act handoff.
(1) Plan prompts no longer ask the worker to write plan.md.
The previous prompts ("Document it clearly in plan.md in the current
directory" / "Update plan.md accordingly") demanded a file write the
loop never read back — plan.md was vestigial decoration. claude in
plan mode is read-only by design, so the worker stalled at its own
permission gate and replied with a question about the write instead
of a plan. ynh then captured that question as the plan content and
surfaced it through KindPlanApprovalRequired, leaving consumers with
nothing to approve. Both prompts now ask only for an inline reply.
Plan mode being read-only is the feature, not a constraint to work
around. The act phase has write access if a plan file is wanted later.
(2) Approved plan content is forwarded into the act phase.
Previously the plan→act boundary collapsed to "Plan approved. Proceed
with implementation: <task>" — the worker entered act mode with only
the original task and lost every word of the plan it just generated
(or refined). Now the act-phase first message embeds the approved plan
text alongside the task so refinement work survives the phase boundary.
Tests assert: plan prompts contain neither "plan.md" nor "current
directory"; the act-phase first message contains both the final
approved plan content and the original task.
Adds KindBudgetSnapshot, emitted once per turn in both the plan and act phases immediately after the worker's reply is recorded against the budget. Carries running totals (Turns, Tokens) so consumers can render live progress without shadowing the driver's accounting. Why now: tokens were previously visible only in terminal events (KindBudgetExceeded, KindSessionEnd). Live UIs had no way to render a "7/25 turns · 142k/500k tokens" strip without reimplementing budget bookkeeping from KindAssistantMessage payloads — and turn.Usage doesn't flow into the trajectory at all today, so consumers can't even shadow. Plan-phase snapshots intentionally report Turns=0; the act-phase turn counter has not started in the plan loop. Plan-iteration count is a distinct concept already exposed via PlanApprovalData.Iteration. The two are deliberately not folded into one field — separate budgets, separate surfaces. Documented as a field-level invariant. No CapabilitiesVersion bump: KindBudgetSnapshot is a new member of the open-set EventKind enum. Per docs/cli-structured.md consumers MUST tolerate unknown kinds, so this is purely additive. Pre-feature ynh just doesn't emit the event; consumers degrade gracefully. Tests: * per-turn act-phase emission with monotonic accumulation * plan-phase emission with Turns=0 invariant * parity between final snapshot and KindSessionEnd totals (drift guard) * wire-shape regression on field names and types
Make `ynh agent run` resumable at turn granularity. After a crash, interrupt, or SIGTERM, a relaunch with `--resume <dir>` continues from the last completed turn with budget counters and the worker conversation restored — at most one in-flight turn is ever redone. Substrate (option B): a small checkpoint.json sidecar, written atomically after each completed turn, is the resume source of truth; trajectory.jsonl stays the append-only audit record. - checkpoint.go: Checkpoint struct + atomic write (temp+fsync+rename) + read with distinct missing/corrupt/version errors. - budget.go: Resume()/WallConsumed() so caps carry across a relaunch; ExitResumeError(21), ExitInterrupted(31). - backends: WorkerSession.ResumeToken() + StartOptions.ResumeToken. claude controls its own --session-id and resumes via --resume; cursor persists/re-supplies its chatId; codex captures the session id from the --json stream and resumes via `exec resume` (best-effort, unverified). - loop.go: --resume plumbing, append-on-resume trajectory, budget restore, checkpoint writes at session-start/plan-finalize/per-turn, session_resumed event, cancelable ctx + SIGINT/SIGTERM handler + non-interactive interrupt, resume-past-exceeded-budget early exit. - cmd/ynh/agent.go: --resume <dir> flag; task optional on resume. Bumps CapabilitiesVersion 0.5.0 -> 0.6.0 so TermQ can feature-detect --resume; propagates x-capabilities across docs/schema + the clischema mirror, goldens, and docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds turn-level executional durability to
ynh agent run. After a crash, interrupt, or SIGTERM, a relaunch with--resume <dir>continues from the last completed turn with budget counters and the worker conversation restored — at most one in-flight turn is ever redone. (This is the final piece on theyna-loop-agentbranch; it also lands the prior turn-based loop work.)Substrate (option B): a small
checkpoint.jsonsidecar, written atomically (temp→fsync→rename) after each completed turn, is the resume source of truth;trajectory.jsonlstays the append-only audit record.What changed
checkpoint.go(new) —Checkpointstruct + atomic write + read with distinct missing/corrupt/version errors.budget.go—Resume()/WallConsumed()so caps (turns/tokens/wall-clock) carry across a relaunch; new exit codesExitResumeError(21),ExitInterrupted(31).trajectory.go—session_resumedevent (SessionResumedData).WorkerSession.ResumeToken()+StartOptions.ResumeToken:--session-id <uuid>; resumes via--resume <uuid>.chatId(already disk-durable per turn).--jsonstream and resumes viacodex exec resume <id>— best-effort / unverified (codex binary is broken in the dev env; may need a follow-up).loop.go—--resumeplumbing, append-on-resume trajectory, budget restore, checkpoint writes at session-start / plan-finalize / per-turn,session_resumedemit, cancelable ctx + SIGINT/SIGTERM handler + non-interactive interrupt (previously dropped), resume-past-exceeded-budget early exit, multi-resume carry-forward.cmd/ynh/agent.go—--resume <dir>flag; task optional on resume.CapabilitiesVersion0.5.0 → 0.6.0;x-capabilitiespropagated acrossdocs/schema/**+ theinternal/clischema/schema/**mirror (parity), goldens (version/installed/list/fork/info), and docs.Edge cases handled
Mid-turn crash redoes only the incomplete turn; resume past an exceeded budget emits
budget_exceededand exits without a new turn; pending approval gates re-emit via re-execution; stale/corrupt/missing checkpoint fails with a clear code (21); non---resumeruns behave exactly as before.Tests
New
checkpoint_test.go+resume_test.go: checkpoint roundtrip & atomic write; missing/corrupt/version errors; resume restores budget and continues at turn N+1 with no replay; append-on-resume vs truncate-on-fresh; interrupt and SIGTERM both leave a resumable checkpoint; mid-turn crash redoes only the incomplete turn; double-interrupt → resume; resume-past-exceeded-budget exits without starting a worker; per-backend resume-token capture.make checkgreen (-race, 0 lint),/evalsPASS.TermQ contract (out of scope here)
ynh agent run … --emit-jsonl <dir>/trajectory.jsonlplus--resume <dir>.ynh version --format json→capabilities ≥ 0.6.0.session_resumedheads a resumed trajectory (carriesresumed_at_turn, restored budget).21= stale/corrupt/missing checkpoint;31= interrupted-but-resumable.{"action":"interrupt"}then SIGTERM) — both now leave a resumable checkpoint.🤖 Generated with Claude Code