Summary
When a batch starts and the supervisor activates, pi can crash with an uncaught exception if a background supervisor timer calls pi.sendMessage() after the extension context has been replaced or reloaded.
This kills the entire pi session. The orchestrator engine may continue in a worker thread, but the operator loses the supervisor UI and monitoring.
Environment
- taskplane: 0.30.1 (npm latest)
- pi:
@earendil-works/pi-coding-agent@0.77.0
- OS: macOS (darwin 24.6.0)
- Mode: repo mode, supervised autonomy
Observed behavior
/orch all starts batch and wave 1 successfully
- Operator sees normal supervisor/orchestrator output, e.g.:
🌊 Wave 1 starting with 8 task(s) across 3 lanes.
🔀 Orchestrator · repo mode · 1...
- Pi exits immediately after with:
pi exiting due to uncaughtException:
Error: This extension ctx is stale after session replacement or reload. Do not use a captured pi or command ctx after ctx.newSession(), ctx.fork(), ctx.switchSession(), or ctx.reload(). For newSession, fork, and switchSession, move post-replacement work into withSession and use the ctx passed to withSession. For reload, do not use the old ctx after await ctx.reload().
at Object.assertActive (file:///usr/local/lib/node_modules/@earendil-works/pi-coding-agent/dist/core/extensions/loader.js:105:19)
at Object.sendMessage (file:///usr/local/lib/node_modules/@earendil-works/pi-coding-agent/dist/core/extensions/loader.js:197:21)
at Timeout.<anonymous> (/Users/<user>/.pi/agent/npm/node_modules/taskplane/extensions/taskplane/supervisor.ts:3736:12)
Root cause (analysis)
In extensions/taskplane/supervisor.ts, activateSupervisor() starts background timers that capture pi: ExtensionAPI in closures:
startHeartbeat() — 30s interval, line ~3736 calls pi.sendMessage() in the takeover/yield branch
startEventTailer() — 10s interval, notify() callback also calls pi.sendMessage()
Pi forbids using a captured extension API after session replacement/reload. When a timer fires with a stale handle, assertActive throws. The exception is not caught, so it becomes a process-fatal uncaughtException.
Additional lifecycle gap: activateSupervisor() assigns state.heartbeatTimer = startHeartbeat(...) without clearing any existing heartbeat/event tailer timers first (cleanup only happens in deactivateSupervisor()). Re-activation or session churn can leave orphaned timers holding stale pi references.
Relevant code on main (still present in 0.30.1):
// activateSupervisor — no timer teardown before starting new ones
state.heartbeatTimer = startHeartbeat(stateRoot, state, pi);
startEventTailer(pi, state.eventTailer, state, ...);
// startHeartbeat — stale pi.sendMessage on takeover detection
if (currentLock && currentLock.sessionId !== sessionId) {
clearInterval(timer);
pi.sendMessage({ customType: "supervisor-yield", ... }, { triggerTurn: false });
deactivateSupervisor(pi, state);
}
Expected behavior
- Supervisor timers should either resolve a fresh extension context (e.g. via
withSession) or treat stale ctx as a shutdown signal
- Timer callbacks should never crash pi — wrap
pi.sendMessage() in try/catch and silently stop timers on stale ctx
activateSupervisor() should tear down existing heartbeat/event tailer timers before starting new ones
Suggested fix
-
At top of activateSupervisor() (before starting timers):
stopEventTailer(state.eventTailer);
if (state.heartbeatTimer) {
clearInterval(state.heartbeatTimer);
state.heartbeatTimer = null;
}
-
In startHeartbeat() and event tailer notify():
try {
pi.sendMessage(...);
} catch (err) {
if (isStaleExtensionCtx(err)) {
clearInterval(timer);
// deactivate without rethrowing — do not crash pi
return;
}
}
-
Consider deferring timer start until after the activation triggerTurn completes, or use pi's recommended withSession pattern for any post-session-replacement work.
Impact
- Severity: high — process crash during normal batch startup
- Workaround: restart pi and
/orch-resume; batch state may survive but supervisor monitoring is unreliable until fixed
- Not a task/worker failure — crash is in supervisor infrastructure, not lane worker code
Repro notes
Observed during batch startup with 3 lanes / 8 tasks in wave 1. Exact session-replacement trigger not confirmed, but crash site matches pi's stale ctx guard on a captured pi in a supervisor timer callback.
Summary
When a batch starts and the supervisor activates, pi can crash with an uncaught exception if a background supervisor timer calls
pi.sendMessage()after the extension context has been replaced or reloaded.This kills the entire pi session. The orchestrator engine may continue in a worker thread, but the operator loses the supervisor UI and monitoring.
Environment
@earendil-works/pi-coding-agent@0.77.0Observed behavior
/orch allstarts batch and wave 1 successfully🌊 Wave 1 starting with 8 task(s) across 3 lanes.🔀 Orchestrator · repo mode · 1...Root cause (analysis)
In
extensions/taskplane/supervisor.ts,activateSupervisor()starts background timers that capturepi: ExtensionAPIin closures:startHeartbeat()— 30s interval, line ~3736 callspi.sendMessage()in the takeover/yield branchstartEventTailer()— 10s interval,notify()callback also callspi.sendMessage()Pi forbids using a captured extension API after session replacement/reload. When a timer fires with a stale handle,
assertActivethrows. The exception is not caught, so it becomes a process-fataluncaughtException.Additional lifecycle gap:
activateSupervisor()assignsstate.heartbeatTimer = startHeartbeat(...)without clearing any existing heartbeat/event tailer timers first (cleanup only happens indeactivateSupervisor()). Re-activation or session churn can leave orphaned timers holding stalepireferences.Relevant code on
main(still present in 0.30.1):Expected behavior
withSession) or treat stale ctx as a shutdown signalpi.sendMessage()in try/catch and silently stop timers on stale ctxactivateSupervisor()should tear down existing heartbeat/event tailer timers before starting new onesSuggested fix
At top of
activateSupervisor()(before starting timers):In
startHeartbeat()and event tailernotify():Consider deferring timer start until after the activation
triggerTurncompletes, or use pi's recommendedwithSessionpattern for any post-session-replacement work.Impact
/orch-resume; batch state may survive but supervisor monitoring is unreliable until fixedRepro notes
Observed during batch startup with 3 lanes / 8 tasks in wave 1. Exact session-replacement trigger not confirmed, but crash site matches pi's stale ctx guard on a captured
piin a supervisor timer callback.