Problem
Sub-agents time out after 120s with uncommitted work sitting in a worktree. The agent's reasoning — why it chose that approach, what it had left to do — is lost. Manual recovery requires reading the partial diff and guessing the intent. This happened to agent_dbb22444 (#3977, 693 lines uncommitted) and agent_683da1af (#3954) today.
Scope
Add checkpoint-and-resume for sub-agents:
- Auto-checkpoint — Every N turns (or on tool calls that mutate state), serialize the agent's reasoning state + file diffs to a checkpoint store
- Continuation prompt — On resume, the agent receives its own checkpoint + a "continue from here" instruction
- Filesystem durability — Uncommitted worktree changes are committed to a checkpoint branch (not the main branch), so no data is lost on timeout
- Resume trigger — Orchestrator detects timeout → spawns a new agent with the checkpoint as input
- Checkpoint cleanup — On successful completion, checkpoint branches are pruned
Non-goals
- Not full session replay. This is single-agent resume, not multi-agent replay.
- Not changing the 120s timeout. That's a provider constraint; checkpointing works around it.
Acceptance
- Agent times out with uncommitted changes → checkpoint is automatically saved
- Resumed agent receives the checkpoint as its initial context
- Resumed agent can continue from where the original left off
- Checkpoint branch is cleaned up after successful completion
- Tests cover: timeout checkpoint save, resume with partial work, cleanup on success
Related
Problem
Sub-agents time out after 120s with uncommitted work sitting in a worktree. The agent's reasoning — why it chose that approach, what it had left to do — is lost. Manual recovery requires reading the partial diff and guessing the intent. This happened to agent_dbb22444 (#3977, 693 lines uncommitted) and agent_683da1af (#3954) today.
Scope
Add checkpoint-and-resume for sub-agents:
Non-goals
Acceptance
Related
crates/tui/src/tools/subagent/mod.rscrates/tui/src/fleet/worker_runtime.rs