Skip to content

ACP session spawn fails silently after gateway restart — error swallowing leads to dead session loop #182

@anchor-jevons

Description

@anchor-jevons

Bug Description

ACP sessions (e.g. Codex, Claude) permanently fail to start after a gateway restart. Sessions are marked as dead and enter a retry loop that never succeeds. The gateway's acp startup identity reconcile reports resolved=0 failed=N — all sessions fail, none recover.

Steps to Reproduce

  1. Have one or more ACP agents configured in ~/.acpx/config.json (e.g. Codex via npx @zed-industries/codex-acp)
  2. Start the gateway — sessions initialize successfully
  3. Call openclaw gateway restart
  4. After restart, observe that ACP sessions are dead and never recover

Expected Behavior

After a gateway restart, ACP sessions should be re-established automatically, or a clear error should propagate to the user.

Actual Behavior

  • Log: acp startup identity reconcile: checked=N resolved=0 failed=N — all sessions fail, zero resolved
  • Log: acpx exited with code 3 — exit code 3 = TIMEOUT
  • Sessions remain dead; the system enters an infinite recovery loop

Root Cause Analysis

Three interconnected bugs in error handling:

Bug 1 — createNamedSession silently swallows errors

In cli.js, spawnAndCollect runs sessions new via the acpx CLI. When the CLI exits with code 3 (TIMEOUT, because the queue owner is dead), the error event is yielded. createNamedSession wraps this in try/catch and returns null on any error:

// cli.js — createNamedSession
try {
    events = await runControlCommand([...]);
    return { events };
} catch (err) {
    return null; // ALL errors become null; original error discarded
}

The caller cannot distinguish "session already exists" from "session creation completely failed."

Bug 2 — ensureSession passes null upward without context

// session module — ensureSession
const result = await createNamedSession({ ... });
if (result == null) return null; // ambiguous: "exists" vs "failed"?

When null arrives at recoverEnsureFailure, it is incorrectly treated as success:

// index.js — recoverEnsureFailure
if (result === null) {
    resolved += 1; // incorrectly counts a failed creation as "resolved"
    return;
}

Bug 3 — recoverEnsureFailure assumes null = session already exists

The null-check logic was likely written expecting null to mean "session already exists, no action needed." But the actual path is "creation failed entirely." This mislabeling prevents proper retry escalation.

Technical Details

  • Exit code 3 maps to TIMEOUT in EXIT_CODES (in queue-ipc-EQLpBMKv.js)
  • TIMEOUT occurs when connection.newSession() waits for the queue owner, which is dead after restart
  • The 60-second timeout (resolveClaudeAcpSessionCreateTimeoutMs()) eventually fires
  • createSession catches the timeout and throws ClaudeAcpSessionCreateTimeoutError
  • This becomes an error event in spawnAndCollect → caught as null by createNamedSession
  • The null bubbles up and is misinterpreted at every layer

Suggested Fix

  1. createNamedSession: Return a discriminated result type instead of null:

    type CreateResult = 
      | { ok: true; events: Event[] }
      | { ok: false; reason: "already_exists" | "spawn_error"; error?: Error };
  2. ensureSession / recoverEnsureFailure: Distinguish "session already exists" from "creation failed," and retry with exponential backoff on spawn failures.

  3. Add observability: Log the specific failure reason at the recoverEnsureFailure level so future errors are easier to diagnose.

Environment

  • macOS (Darwin arm64)
  • Node v24.13.0
  • OpenClaw gateway: recent stable
  • acpx: 0.3.1 (bundled with OpenClaw)
  • @zed-industries/codex-acp: ^0.9.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions