Bug Description
ACP sessions (e.g. Codex, Claude) permanently fail to start after a gateway restart. Sessions are marked as dead and enter a retry loop that never succeeds. The gateway's acp startup identity reconcile reports resolved=0 failed=N — all sessions fail, none recover.
Steps to Reproduce
- Have one or more ACP agents configured in
~/.acpx/config.json (e.g. Codex via npx @zed-industries/codex-acp)
- Start the gateway — sessions initialize successfully
- Call
openclaw gateway restart
- After restart, observe that ACP sessions are dead and never recover
Expected Behavior
After a gateway restart, ACP sessions should be re-established automatically, or a clear error should propagate to the user.
Actual Behavior
- Log:
acp startup identity reconcile: checked=N resolved=0 failed=N — all sessions fail, zero resolved
- Log:
acpx exited with code 3 — exit code 3 = TIMEOUT
- Sessions remain dead; the system enters an infinite recovery loop
Root Cause Analysis
Three interconnected bugs in error handling:
Bug 1 — createNamedSession silently swallows errors
In cli.js, spawnAndCollect runs sessions new via the acpx CLI. When the CLI exits with code 3 (TIMEOUT, because the queue owner is dead), the error event is yielded. createNamedSession wraps this in try/catch and returns null on any error:
// cli.js — createNamedSession
try {
events = await runControlCommand([...]);
return { events };
} catch (err) {
return null; // ALL errors become null; original error discarded
}
The caller cannot distinguish "session already exists" from "session creation completely failed."
Bug 2 — ensureSession passes null upward without context
// session module — ensureSession
const result = await createNamedSession({ ... });
if (result == null) return null; // ambiguous: "exists" vs "failed"?
When null arrives at recoverEnsureFailure, it is incorrectly treated as success:
// index.js — recoverEnsureFailure
if (result === null) {
resolved += 1; // incorrectly counts a failed creation as "resolved"
return;
}
Bug 3 — recoverEnsureFailure assumes null = session already exists
The null-check logic was likely written expecting null to mean "session already exists, no action needed." But the actual path is "creation failed entirely." This mislabeling prevents proper retry escalation.
Technical Details
- Exit code 3 maps to
TIMEOUT in EXIT_CODES (in queue-ipc-EQLpBMKv.js)
- TIMEOUT occurs when
connection.newSession() waits for the queue owner, which is dead after restart
- The 60-second timeout (
resolveClaudeAcpSessionCreateTimeoutMs()) eventually fires
createSession catches the timeout and throws ClaudeAcpSessionCreateTimeoutError
- This becomes an
error event in spawnAndCollect → caught as null by createNamedSession
- The
null bubbles up and is misinterpreted at every layer
Suggested Fix
-
createNamedSession: Return a discriminated result type instead of null:
type CreateResult =
| { ok: true; events: Event[] }
| { ok: false; reason: "already_exists" | "spawn_error"; error?: Error };
-
ensureSession / recoverEnsureFailure: Distinguish "session already exists" from "creation failed," and retry with exponential backoff on spawn failures.
-
Add observability: Log the specific failure reason at the recoverEnsureFailure level so future errors are easier to diagnose.
Environment
- macOS (Darwin arm64)
- Node v24.13.0
- OpenClaw gateway: recent stable
- acpx: 0.3.1 (bundled with OpenClaw)
@zed-industries/codex-acp: ^0.9.5
Bug Description
ACP sessions (e.g. Codex, Claude) permanently fail to start after a gateway restart. Sessions are marked as
deadand enter a retry loop that never succeeds. The gateway'sacp startup identity reconcilereportsresolved=0 failed=N— all sessions fail, none recover.Steps to Reproduce
~/.acpx/config.json(e.g. Codex vianpx @zed-industries/codex-acp)openclaw gateway restartExpected Behavior
After a gateway restart, ACP sessions should be re-established automatically, or a clear error should propagate to the user.
Actual Behavior
acp startup identity reconcile: checked=N resolved=0 failed=N— all sessions fail, zero resolvedacpx exited with code 3— exit code 3 = TIMEOUTRoot Cause Analysis
Three interconnected bugs in error handling:
Bug 1 —
createNamedSessionsilently swallows errorsIn
cli.js,spawnAndCollectrunssessions newvia the acpx CLI. When the CLI exits with code 3 (TIMEOUT, because the queue owner is dead), theerrorevent is yielded.createNamedSessionwraps this in try/catch and returnsnullon any error:The caller cannot distinguish "session already exists" from "session creation completely failed."
Bug 2 —
ensureSessionpasses null upward without contextWhen
nullarrives atrecoverEnsureFailure, it is incorrectly treated as success:Bug 3 —
recoverEnsureFailureassumesnull= session already existsThe null-check logic was likely written expecting
nullto mean "session already exists, no action needed." But the actual path is "creation failed entirely." This mislabeling prevents proper retry escalation.Technical Details
TIMEOUTinEXIT_CODES(inqueue-ipc-EQLpBMKv.js)connection.newSession()waits for the queue owner, which is dead after restartresolveClaudeAcpSessionCreateTimeoutMs()) eventually firescreateSessioncatches the timeout and throwsClaudeAcpSessionCreateTimeoutErrorerrorevent inspawnAndCollect→ caught asnullbycreateNamedSessionnullbubbles up and is misinterpreted at every layerSuggested Fix
createNamedSession: Return a discriminated result type instead ofnull:ensureSession/recoverEnsureFailure: Distinguish "session already exists" from "creation failed," and retry with exponential backoff on spawn failures.Add observability: Log the specific failure reason at the
recoverEnsureFailurelevel so future errors are easier to diagnose.Environment
@zed-industries/codex-acp: ^0.9.5