Skip to content

v0.8.68 WhaleFlow: agent checkpoint and resume on timeout/interrupt #4011

Description

@Hmbown

Problem

Sub-agents time out after 120s with uncommitted work sitting in a worktree. The agent's reasoning — why it chose that approach, what it had left to do — is lost. Manual recovery requires reading the partial diff and guessing the intent. This happened to agent_dbb22444 (#3977, 693 lines uncommitted) and agent_683da1af (#3954) today.

Scope

Add checkpoint-and-resume for sub-agents:

  1. Auto-checkpoint — Every N turns (or on tool calls that mutate state), serialize the agent's reasoning state + file diffs to a checkpoint store
  2. Continuation prompt — On resume, the agent receives its own checkpoint + a "continue from here" instruction
  3. Filesystem durability — Uncommitted worktree changes are committed to a checkpoint branch (not the main branch), so no data is lost on timeout
  4. Resume trigger — Orchestrator detects timeout → spawns a new agent with the checkpoint as input
  5. Checkpoint cleanup — On successful completion, checkpoint branches are pruned

Non-goals

  • Not full session replay. This is single-agent resume, not multi-agent replay.
  • Not changing the 120s timeout. That's a provider constraint; checkpointing works around it.

Acceptance

  • Agent times out with uncommitted changes → checkpoint is automatically saved
  • Resumed agent receives the checkpoint as its initial context
  • Resumed agent can continue from where the original left off
  • Checkpoint branch is cleaned up after successful completion
  • Tests cover: timeout checkpoint save, resume with partial work, cleanup on success

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestv0.8.68Targeting v0.8.68whaleflowWhaleFlow branch/leaf workflow runtime and workflow mode

    Projects

    Status
    Backlog

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions