fix(runtime): prevent tracked jobs hanging forever on broker disconnect#184
fix(runtime): prevent tracked jobs hanging forever on broker disconnect#184jonny981 wants to merge 1 commit intoopenai:mainfrom
Conversation
A Codex task can become stuck in `status: running` indefinitely with no way to recover short of manually wiping the job-store and broker session. The user-visible symptom is that `/codex:status` keeps reporting a job as running for many minutes after the underlying codex CLI worker has died, and the next `/codex:cancel` reports `ECONNREFUSED` on the broker socket. Three independent root causes combine to produce the wedge: 1. `captureTurn` (lib/codex.mjs) awaits `state.completion` with no race against the client's `exitPromise`. When the underlying codex app-server (or the broker socket) dies mid-turn, no terminal event is ever delivered, so the await never resolves. Pending RPC requests are rejected by `handleExit` but the turn-completion promise is unrelated to those, so it sits forever. 2. `runTrackedJob` (lib/tracked-jobs.mjs) awaits the runner with no timeout fallback. When `captureTurn` hangs, the entire companion process hangs, and the per-job state file is never transitioned to a terminal status. 3. `app-server-broker.mjs` never listens for the underlying app-server client's exit. When the codex CLI process dies, the broker keeps running with a dead client, accepting new connections but unable to serve them. The next companion sees a stale broker.json that points at a half-dead broker. 4. Status reads (`buildStatusSnapshot` and `buildSingleJobSnapshot`) do not probe whether tracked PIDs are still alive. Once a job is wedged, every status query reports it as running. Fixes: - captureTurn now races `state.completion` against `client.exitPromise`. If the client closes before a terminal event arrives, the turn is rejected with the recorded exit error and progress emits a `failed` notification. - runTrackedJob races the runner against a hard timeout (default 30 minutes, override via `CODEX_COMPANION_JOB_TIMEOUT_MS` or the `timeoutMs` option). On timeout the job transitions to `failed` with a clear error message instead of leaving the companion stuck. - The broker now subscribes to its appClient's `exitPromise`. When the underlying codex CLI exits unexpectedly the broker logs the reason, fans out a `notifications/broker/shuttingDown` event to any connected socket, tears down the unix socket and pid file, then exits with status 1. The next companion will detect a dead endpoint via `ensureBrokerSession` and respawn cleanly. - A new `isProcessAlive` helper in lib/process.mjs uses `kill(pid, 0)` to probe pid liveness without affecting the process. - A new `markDeadPidJobFailed` and `reconcileIfDead` helper in lib/job-control.mjs reconcile any active job whose tracked PID is gone. Reconciliation now runs on every status read path, not only the `--wait` polling loop, so a single `/codex:status` call surfaces dead workers immediately. Closes openai#183 (runTrackedJob/captureTurn hang in finalizing). Refs openai#176, openai#164 (dead PID detection on status reads). Tests: - tests/process.test.mjs: 3 new cases for `isProcessAlive`. - tests/dead-pid-reconcile.test.mjs: 6 new cases covering `markDeadPidJobFailed` guards (active state, PID match) and the reconcile-on-status behaviour for both single-job and full snapshots. - tests/tracked-jobs-timeout.test.mjs: 2 new cases proving the hard timeout transitions a hung job to `failed` and that successful runners still complete normally. All 11 new tests pass. The 5 failing tests under the existing suite (`collectReviewContext`, three `runtime.test.mjs` snapshot tests, and `resolveStateDir`) are pre-existing on `main` and unrelated to this patch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0d13e46207
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const rawJobs = sortJobsNewestFirst(filterJobsForCurrentSession(listJobs(workspaceRoot), options)); | ||
| // Reconcile any active jobs whose tracked PID is dead before partitioning, | ||
| // so a single status read surfaces stuck workers immediately. | ||
| const jobs = rawJobs.map((job) => reconcileIfDead(workspaceRoot, job)); |
There was a problem hiding this comment.
Re-sort jobs after dead-PID reconciliation
buildStatusSnapshot sorts once (rawJobs) and then mutates job states via reconcileIfDead, but it never re-sorts before computing latestFinished/recent. If a stale running job near the end of the list is reconciled to failed, it keeps its old position and can be dropped by the maxJobs slice, so a single /codex:status call may reconcile a dead worker without showing it anywhere in the output. I reproduced this with >8 newer completed jobs: the reconciled dead job disappeared from both running and recent despite being newly failed.
Useful? React with 👍 / 👎.
Wrap the runner with Promise.race against a 30-minute default timeout. On expiry the job transitions to failed/phase:failed so zombie 'running' rows can't accumulate when a runner hangs. OPENCODE_COMPANION_JOB_TIMEOUT_MS overrides the default. Closes #41. Port of openai/codex-plugin-cc#184.
* fix: quote $ARGUMENTS in cancel/result/status commands Unquoted $ARGUMENTS allows shell splitting on user-supplied job IDs containing metacharacters. Wrap in double quotes to match review.md and adversarial-review.md. Closes #38. Port of openai/codex-plugin-cc#168. * fix: declare model: sonnet in opencode-rescue agent frontmatter Without a model declaration the agent tier was unpredictable. The rescue subagent is a thin forwarder that invokes the companion via a single Bash call and applies trivial routing logic — sonnet is sufficient and gives users a cost guarantee. Closes #39. Port of openai/codex-plugin-cc#169. * fix: scope /opencode:cancel default to current Claude session Without ref, resolveCancelableJob now filters running jobs by sessionId so a cancel in session A cannot kill jobs in session B. Explicit ref still searches all sessions — naming a job counts as intent. Closes #45. Port of openai/codex-plugin-cc#84. * fix: enforce hard wall-clock timeout on runTrackedJob Wrap the runner with Promise.race against a 30-minute default timeout. On expiry the job transitions to failed/phase:failed so zombie 'running' rows can't accumulate when a runner hangs. OPENCODE_COMPANION_JOB_TIMEOUT_MS overrides the default. Closes #41. Port of openai/codex-plugin-cc#184. * fix: reconcile dead-PID jobs on every status read Adds isProcessAlive helper and reconcileIfDead / reconcileAllJobs / markDeadPidJobFailed in job-control. buildStatusSnapshot and the handleResult/handleCancel paths now probe kill(pid, 0) on any active-state job and rewrite dead ones to failed before consuming the list. A single /opencode:status / result / cancel surfaces stuck workers without waiting for SessionEnd. markDeadPidJobFailed is race-safe: it re-reads state and refuses to downgrade terminal states or rewrite when the pid has changed. Closes #42. Port of openai/codex-plugin-cc#176 + dead-PID parts of #184. * fix: avoid embedding large diffs in review prompts Classify review scope before building the prompt. When the diff exceeds ~5 files or ~256 KB, fall back to a lightweight context (status, changed-files, diff_stat) and tell OpenCode to inspect the diff itself via read-only git commands. Prevents HTTP 400 / shallow findings on moderate-to-large changesets. Adversarial template grows a {{REVIEW_COLLECTION_GUIDANCE}} slot. Thresholds overridable via opts.maxInlineDiffFiles/Bytes. Closes #40. Port of openai/codex-plugin-cc#179. * fix: respect \$SHELL on Windows when spawning child processes Add platformShellOption() helper that picks false on POSIX, and \$SHELL || true on win32 so Git Bash users get their shell while cmd fallback still resolves .cmd/.bat shims. Apply to runCommand, spawnDetached, resolveOpencodeBinary, getOpencodeVersion, and the ensureServer spawn of 'opencode serve'. Uses 'where' instead of 'which' on win32, and parses the first line of its CRLF-separated output. Closes #46. Port of openai/codex-plugin-cc#178. * fix: migrate tmpdir state to CLAUDE_PLUGIN_DATA + fix /tmp literal The fallback path was hard-coded to '/tmp' — broken on Windows. Use os.tmpdir() so Windows and other platforms get a real tmp path. Additionally: when CLAUDE_PLUGIN_DATA is set on a later call but state was previously written to the tmpdir fallback, copy it into the plugin-data dir and rewrite absolute path references inside state.json and jobs/*.json so logFile pointers don't dangle. Prevents job history from being silently dropped when commands run under different env contexts within one Claude session. Closes #47. Port of openai/codex-plugin-cc#125. * feat: pass last review findings to rescue automatically After a successful /opencode:review or /opencode:adversarial-review, save the rendered output to ~/.opencode-companion/last-review-<hash>.md (per repo, SHA-256 of workspace path). Add a new 'last-review' subcommand that reports availability or streams the content. rescue.md now checks for a saved review when invoked without task text and asks via AskUserQuestion whether to fix the prior findings or describe a new task. The save is best-effort — a failed persistence never fails the review itself. Closes #44. Port of openai/codex-plugin-cc#129 (simplified: logic lives in the companion script rather than an inline node -e one-liner). * feat: throttle controls for stop-time review gate Add --review-gate-max and --review-gate-cooldown flags to /opencode:setup so users can bound the review gate's spend: /opencode:setup --review-gate-max 5 /opencode:setup --review-gate-cooldown 10 /opencode:setup --review-gate-max off The stop hook now loads state before touching stdin, checks reviewGateMaxPerSession and reviewGateCooldownMinutes against the current session's reviewGateUsage entry, and allows the stop without running OpenCode when a limit would be exceeded. Usage entries older than 7 days are pruned on each successful run so state.json doesn't grow unbounded. renderSetup surfaces the configured limits. Closes #48. Port of openai/codex-plugin-cc#20. * feat: --worktree flag for isolated write-capable rescue tasks Add a disposable-git-worktree mode so /opencode:rescue --write --worktree runs OpenCode inside .worktrees/opencode-<ts> on a fresh opencode/<ts> branch instead of editing the working tree in place. Useful for exploratory runs, parallel rescues, and running against a dirty tree. Pieces: - lib/git.mjs: createWorktree / removeWorktree / deleteWorktreeBranch / getWorktreeDiff / applyWorktreePatch. Adds .worktrees/ to .git/info/exclude on first use so the dir never shows in status. - lib/worktree.mjs: session wrapper — createWorktreeSession, diffWorktreeSession, cleanupWorktreeSession (keep applies patch back, discard just removes). - opencode-companion.mjs: handleTask threads --worktree + swaps cwd + stores session data on the job record + renders a keep/discard footer. New worktree-cleanup subcommand reads the stored session and runs the keep or discard path. - agents/opencode-rescue.md, commands/rescue.md, skills/opencode-runtime: propagate --worktree through the forwarding layer. - tests/worktree.test.mjs: create, diff, keep-applies, discard, no-change no-op. Closes #43. Port of openai/codex-plugin-cc#137. * fix: address pr51 review findings * fix: keep tracked job timeout referenced * fix: address pr51 review conversations * fix: add exclusive file lock to updateState for concurrency safety updateState's read-modify-write cycle was not protected against concurrent companion processes (background worker + status/cancel handler), which could silently lose each other's writes. Acquire an exclusive lock file (state.json.lock via O_EXCL) before reading, hold it through mutation and write, release in finally. Stale locks older than 30s are evicted. Blocks up to 5s with retry. Closes the pre-existing concurrency race amplified by PR #51's dead-PID reconciliation (which adds upsertJob calls on every status read). * fix: address brownfield discovery bugs Critical/high fixes: - BUG-1: saveLastReview use copyFileSync+unlinkSync instead of renameSync (fixes Windows compatibility issue where rename fails if target exists) - BUG-2: handleTask worktree leak - wrap in try/finally to guarantee cleanup - BUG-3: State migration race - add fallback directory lock during migration - BUG-4/13: handleTaskWorker missing signal handlers for graceful shutdown Medium fixes: - BUG-5: releaseStateLock now fsyncs directory after lock removal - BUG-11: Error from getConfiguredProviders now logged instead of swallowed Low fixes: - BUG-6: PR number validation now rejects negative values - BUG-7: getBundledConfigDir checks directory exists before returning - BUG-8: tailLines now properly filters empty lines after split - BUG-9: resolveReviewAgent always returns tools property (undefined if not used) - BUG-10: Diff retrieval failure now logs warning instead of silent swallow - BUG-12: resolveOpencodeBinary now handles spawn errors properly Additional pre-existing work included: - safe-command.mjs wrapper for secure command execution - Command documentation updates - Test improvements * fix: polish pr51 follow-up fixes * fix: address Copilot PR#51 review comments Four findings, all valid: C1 (prompts.mjs, git.mjs, process.mjs) — buildReviewPrompt materialized the full diff string before checking thresholds. For huge diffs the git/gh subprocess could OOM the companion before the size check ran. Fix: runCommand gains maxOutputBytes, killing the child and reporting overflowed once the cap is exceeded. getDiff and getPrDiff thread maxBytes through. buildReviewPrompt now bounds the read at maxBytes+1 and treats overflow as over-byte-limit without ever materializing the rest. C2 (git.mjs) — getDiffByteSize had a docstring claiming it avoided streaming the full contents, but the implementation did exactly that. It was also dead code (zero callers). Removed. C3 (tests/state-lock.test.mjs) — the test injected path.resolve(...) into a generated ESM import specifier. On Windows that path contains backslashes and a drive letter, producing an invalid module specifier. Fix: pathToFileURL(...).href for the injected specifier. C4 (tests/dead-pid-reconcile.test.mjs) — beforeEach mutated the object returned by loadState() without saving it back, leaving on-disk state from earlier tests intact. Fix: saveState(workspace, { config:{}, jobs:[] }). Adds coverage: - tests/process.test.mjs: runCommand overflow path and non-overflow path. - tests/review-prompt-size.test.mjs: bounds-huge-diff end-to-end test that writes a 50k-byte file and asserts fewer than 10k 'x' chars land in the prompt. All 167 tests pass. * fix: keep reading fallback state while migrate lock is held Addresses a Codex P1 on PR#51: when another migrator holds \`primaryDir.migrate.lock\`, migrateTmpdirStateIfNeeded waits up to 2s and then returns without copying anything. Before this fix, stateRoot still returned primaryDir — but primary/state.json didn't exist yet, so loadState returned an empty state and the next upsertJob created primary/state.json with only the new entry, orphaning every seeded fallback job. Fix: after a migration attempt, if primary/state.json is absent and fallback/state.json is present, stateRoot returns the fallback dir. Reads see real data, writes land in fallback, and a later migration retry can pick them up cleanly. Adds a regression test that pre-creates the migrate lock, seeds fallback with a job, switches to CLAUDE_PLUGIN_DATA, and verifies that stateRoot falls back, loadState sees the seeded job, and a subsequent write preserves both the seeded and the in-flight rows. The symlink-refusal test had to be updated because it was reusing stateRoot to name the "primary" dir — with the new fallback guard, that call now returns fallbackDir on failed migration. The test now computes the expected primary path directly.
Summary
A Codex task can become stuck in
status: runningindefinitely with no way to recover short of manually wiping the job-store and broker session. The user-visible symptom is that/codex:statuskeeps reporting a job as running for many minutes after the underlying codex CLI worker has died, and the next/codex:cancelreportsECONNREFUSEDon the broker socket. This PR fixes four independent root causes that combine to produce the wedge.Closes #183. Refs #176, #164.
Reproduction
Run any moderately heavy task that exercises the broker for ~3-5 minutes. In my case it was a Codex review of a ~4,500-line markdown corpus. After the codex CLI processed several
nl/jq/rgshell commands and started reasoning about the result, the worker fell silent. The job-store kept reportingphase: runningfor 9+ minutes with no log progress./codex:cancelreturnedECONNREFUSED /var/folders/.../cxc-XXXXXX/broker.sock. Both the companion PID and the broker PID were dead by the time I cancelled.Root causes
captureTurnhangs forever on disconnection.lib/codex.mjsawaitsstate.completionwith no race against the client'sexitPromise. When the underlying codex app-server (or the broker socket) dies mid-turn, no terminal event is delivered, so the await never resolves. Pending RPC requests are rejected byhandleExitbut the turn-completion promise is unrelated to those, so it sits forever.runTrackedJobhas no timeout fallback.lib/tracked-jobs.mjsawaits the runner without a timeout. WhencaptureTurnhangs, the entire companion process hangs and the per-job state file is never transitioned to a terminal status.The broker zombifies on app-server death.
app-server-broker.mjsnever listens for the underlying app-server client's exit. When the codex CLI process dies, the broker keeps running with a dead client, accepting new connections but unable to serve them. The next companion sees a stalebroker.jsonthat points at a half-dead broker.Status reads do not probe PID liveness.
buildStatusSnapshotandbuildSingleJobSnapshotnever check whether tracked PIDs are still alive. Once a job is wedged, every/codex:statusquery reports it as running.Fixes
captureTurnraces against client exit (lib/codex.mjs). A newexitWatchpromise rejectsstate.completionwith the recorded exit error when the client closes before a terminal event arrives. Progress emits afailednotification so the user sees what happened.runTrackedJobenforces a hard timeout (lib/tracked-jobs.mjs). Default 30 minutes, override viaCODEX_COMPANION_JOB_TIMEOUT_MSenv var or thetimeoutMsoption. On timeout the job transitions tofailedwith a clear error message instead of leaving the companion stuck.Broker subscribes to appClient exit (app-server-broker.mjs). When the underlying codex CLI exits unexpectedly the broker logs the reason, fans out a
notifications/broker/shuttingDownevent to any connected socket, tears down the unix socket and pid file, then exits with status 1. The next companion will detect a dead endpoint viaensureBrokerSessionand respawn cleanly.isProcessAlivehelper (lib/process.mjs). Useskill(pid, 0)to probe pid liveness without affecting the process.markDeadPidJobFailed+reconcileIfDead(lib/job-control.mjs). Reconcile any active job whose tracked PID is gone. Reconciliation runs on every status read path now, not only the--waitpolling loop, so a single/codex:statuscall surfaces dead workers immediately. This is broader than Fix stuck running jobs by detecting dead review/task PIDs #176, which only reconciled insidewaitForSingleJobSnapshot.Relationship to existing PRs/issues
status --wait. If Fix stuck running jobs by detecting dead review/task PIDs #176 lands first, the overlap is small (the helper functions and a couple of import lines).Test plan
tests/process.test.mjs— 3 new cases forisProcessAlive(invalid input, current pid, exited child).tests/dead-pid-reconcile.test.mjs— 6 new cases coveringmarkDeadPidJobFailedguards and reconcile-on-status for both single-job and full snapshots.tests/tracked-jobs-timeout.test.mjs— 2 new cases proving the hard timeout transitions a hung job tofailedand that fast runners still complete normally.node --test tests/*.test.mjs→ 91 pass, 5 fail. The 5 failures (collectReviewContext skips untracked directories, threeruntime.test.mjssnapshot tests, andresolveStateDir uses a temp-backed per-workspace directory) are pre-existing onmainand unrelated to this patch — I confirmed by running the same suite against pristinemain.Test plan (manual)
/codex:statusnow reconciles a dead job tofailedon a single call.failedstatus.process.kill(pid, 0)on Unix paths and the platform branch interminateProcessTreeis unchanged).🤖 Generated with Claude Code