Skip to content

feat(runtime): AcpxRuntime supervisor registry with respawn + auto-resume (#273)#465

Merged
windoliver merged 22 commits into
mainfrom
feat/273-acpx-supervisor
May 30, 2026
Merged

feat(runtime): AcpxRuntime supervisor registry with respawn + auto-resume (#273)#465
windoliver merged 22 commits into
mainfrom
feat/273-acpx-supervisor

Conversation

@windoliver
Copy link
Copy Markdown
Owner

Closes #273.

Builds AcpxSupervisor — a grove-owned registry of acpx runtime handles with death-detection, respawn, best-effort auto-resume, and SessionLost surfacing on AgentTask, adopted behind a flag. Also delivers the #210 runtime-adapter conformance matrix.

Phases (each TDD'd, two-stage reviewed)

  • P1 — Death-detection seam. AgentDisconnectedError + optional onDisconnect on AgentRuntime; AcpRuntime detects an unexpected child exit (new onExit on LaunchResult), marks the session crashed, and rejects the in-flight send() with stopReason: "error". Intentional close() never fires it.
  • P2 — Registry core + adapter matrix (fix: restore green runtime adapter tests for acpx/tmux session identity and discovery #210). AcpxSupervisor (ensure/get/stop/list, idempotent + single-flight, AgentRuntime façade). New shared runRuntimeAdapterMatrix(label, factory) runs Mock + Acp + Supervisor through one conformance suite. (Caught + fixed a real MockRuntime.close() contract bug along the way.)
  • P3 — Respawn + auto-resume + seq. On disconnect: running → resuming → running, fresh session within the shared runtime (new wireSessionId, never the dead one — bug(acp): parser rejects all acpx session/update frames as _sessionMismatch #319), exponential backoff + maxRespawns → dead. Monotonic per-slot seq survives the respawn boundary (no reset).
  • P4 — SessionLost on AgentTask. New Resuming/SessionLost condition types; a thin wiring layer (src/server/acpx-supervisor-wiring.ts) translates respawn events into task conditions — a transient blip stays Running, a permanent death goes Failed.
  • P5 — Adoption + lease release + E2E. selectRuntime wraps in the supervisor behind GROVE_SUPERVISOR=1; the server activates the respawn→task wiring. Claims carry context.agentTaskId (stamped via MCP from GROVE_AGENT_TASK_ID) so onDead releases exactly that task's leases instead of stranding them.

Key design decisions (the issue is ahead of current code in several ways)

  • Class is AcpRuntime, not AcpxRuntime; the registry holds it.
  • One shared AcpRuntime for all slots, not one per slot. grove's AcpRuntime already spawns one adapter subprocess per spawn(), so process-per-slot isolation (the issue's intent) is preserved while a shared client gives a single monotonic id counter and one event sink to demux. Recorded in the design doc.
  • Respawn-as-new, not session/load (upstream-unsupported; consistent with grove-direct-acp). SessionLost is always surfaced; seq/acpxRecordId are forward-compatible if session/load ever lands.
  • seq lives at the eventSink, not publishTurnToNexus (which has no production caller today).

Design + plan: docs/superpowers/specs/2026-05-29-acpx-supervisor-design.md, docs/superpowers/plans/2026-05-29-acpx-supervisor.md.

Verification

  • Full suite: 8426 pass / 0 fail / 7 skip. The non-zero bun test exit is a pre-existing coverage threshold on use-text-input.ts (exists on main), unrelated to this branch.
  • bun run build green (after bun install for the ask-user SDK); typecheck shows only the pre-existing packages/ask-user @anthropic-ai/sdk errors.

Deferred / called out honestly

  • The real-process kill-PID E2E is authored but NOT-YET-RUN (tests/e2e/acpx-supervisor-respawn-tmux.ts + runbook). It needs a live grove+Nexus stack; two spots are marked TODO(verify-on-stack) (acpx child-PID discovery, AgentTask PUT schema/readiness). Do not treat the respawn path as E2E-validated until it runs green on a stack.
  • seq is not wire-observable today — it lives on the in-process eventSink (→ AcpSessionStore, TUI-local), so the E2E asserts AgentTask phase + SessionLost condition + sessionId-change instead. Seq continuity is covered by the unit test acpx-supervisor.respawn.test.ts. Exposing seq over HTTP/SSE would be a separate follow-up.

🤖 Generated with Claude Code

windoliver added 22 commits May 29, 2026 13:26
… contract (#273)

Delete the session map entry on close() instead of setting status="stopped",
so listSessions() no longer returns closed sessions — matching the real
AcpRuntime behavior and the AgentRuntime interface doc ("List all active
sessions").  Updated the one collateral test in agent-runtime.test.ts that
was asserting the old non-contract behavior; matrix now 12/0.
Add `agentTaskId?: string` to McpDeps and thread GROVE_AGENT_TASK_ID from the
stdio MCP server env into every grove_claim call so the supervisor can release
the agent's leases when its AgentTask permanently dies.
Replace the onDead TODO no-op in createTaskControllerWiring with a real
implementation: when a slot permanently dies, list all active claims and
release those whose context.agentTaskId matches the dead slotId.
…ors blocking pre-push typecheck

- AgentDisconnectedError: split readonly parameter property into separate field decl + constructor assignment (erasableSyntaxOnly compat)
- task-controller-wiring.test.ts: add missing putAgentTaskSpec/listAgentTasks to fakeTaskStore; fix onRespawnCalls type to AcpxRespawnEvent (not Parameters<> nesting); import AcpxRespawnEvent; fix biome import order + empty block warnings
@windoliver windoliver merged commit 18d07a5 into main May 30, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(runtime): AcpxRuntime supervisor registry with respawn + auto-resume

1 participant