Skip to content

Supervisor heartbeat timer crashes pi with stale extension ctx (uncaughtException in startHeartbeat sendMessage) #597

@beettlle

Description

@beettlle

Summary

When a batch starts and the supervisor activates, pi can crash with an uncaught exception if a background supervisor timer calls pi.sendMessage() after the extension context has been replaced or reloaded.

This kills the entire pi session. The orchestrator engine may continue in a worker thread, but the operator loses the supervisor UI and monitoring.

Environment

  • taskplane: 0.30.1 (npm latest)
  • pi: @earendil-works/pi-coding-agent@0.77.0
  • OS: macOS (darwin 24.6.0)
  • Mode: repo mode, supervised autonomy

Observed behavior

  1. /orch all starts batch and wave 1 successfully
  2. Operator sees normal supervisor/orchestrator output, e.g.:
    • 🌊 Wave 1 starting with 8 task(s) across 3 lanes.
    • 🔀 Orchestrator · repo mode · 1...
  3. Pi exits immediately after with:
pi exiting due to uncaughtException:
Error: This extension ctx is stale after session replacement or reload. Do not use a captured pi or command ctx after ctx.newSession(), ctx.fork(), ctx.switchSession(), or ctx.reload(). For newSession, fork, and switchSession, move post-replacement work into withSession and use the ctx passed to withSession. For reload, do not use the old ctx after await ctx.reload().
    at Object.assertActive (file:///usr/local/lib/node_modules/@earendil-works/pi-coding-agent/dist/core/extensions/loader.js:105:19)
    at Object.sendMessage (file:///usr/local/lib/node_modules/@earendil-works/pi-coding-agent/dist/core/extensions/loader.js:197:21)
    at Timeout.<anonymous> (/Users/<user>/.pi/agent/npm/node_modules/taskplane/extensions/taskplane/supervisor.ts:3736:12)

Root cause (analysis)

In extensions/taskplane/supervisor.ts, activateSupervisor() starts background timers that capture pi: ExtensionAPI in closures:

  • startHeartbeat() — 30s interval, line ~3736 calls pi.sendMessage() in the takeover/yield branch
  • startEventTailer() — 10s interval, notify() callback also calls pi.sendMessage()

Pi forbids using a captured extension API after session replacement/reload. When a timer fires with a stale handle, assertActive throws. The exception is not caught, so it becomes a process-fatal uncaughtException.

Additional lifecycle gap: activateSupervisor() assigns state.heartbeatTimer = startHeartbeat(...) without clearing any existing heartbeat/event tailer timers first (cleanup only happens in deactivateSupervisor()). Re-activation or session churn can leave orphaned timers holding stale pi references.

Relevant code on main (still present in 0.30.1):

// activateSupervisor — no timer teardown before starting new ones
state.heartbeatTimer = startHeartbeat(stateRoot, state, pi);
startEventTailer(pi, state.eventTailer, state, ...);

// startHeartbeat — stale pi.sendMessage on takeover detection
if (currentLock && currentLock.sessionId !== sessionId) {
  clearInterval(timer);
  pi.sendMessage({ customType: "supervisor-yield", ... }, { triggerTurn: false });
  deactivateSupervisor(pi, state);
}

Expected behavior

  • Supervisor timers should either resolve a fresh extension context (e.g. via withSession) or treat stale ctx as a shutdown signal
  • Timer callbacks should never crash pi — wrap pi.sendMessage() in try/catch and silently stop timers on stale ctx
  • activateSupervisor() should tear down existing heartbeat/event tailer timers before starting new ones

Suggested fix

  1. At top of activateSupervisor() (before starting timers):

    stopEventTailer(state.eventTailer);
    if (state.heartbeatTimer) {
      clearInterval(state.heartbeatTimer);
      state.heartbeatTimer = null;
    }
  2. In startHeartbeat() and event tailer notify():

    try {
      pi.sendMessage(...);
    } catch (err) {
      if (isStaleExtensionCtx(err)) {
        clearInterval(timer);
        // deactivate without rethrowing — do not crash pi
        return;
      }
    }
  3. Consider deferring timer start until after the activation triggerTurn completes, or use pi's recommended withSession pattern for any post-session-replacement work.

Impact

  • Severity: high — process crash during normal batch startup
  • Workaround: restart pi and /orch-resume; batch state may survive but supervisor monitoring is unreliable until fixed
  • Not a task/worker failure — crash is in supervisor infrastructure, not lane worker code

Repro notes

Observed during batch startup with 3 lanes / 8 tasks in wave 1. Exact session-replacement trigger not confirmed, but crash site matches pi's stale ctx guard on a captured pi in a supervisor timer callback.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions