fix(runtime): prevent tracked jobs hanging forever on broker disconnect by jonny981 · Pull Request #184 · openai/codex-plugin-cc

jonny981 · 2026-04-08T15:50:39Z

Summary

A Codex task can become stuck in status: running indefinitely with no way to recover short of manually wiping the job-store and broker session. The user-visible symptom is that /codex:status keeps reporting a job as running for many minutes after the underlying codex CLI worker has died, and the next /codex:cancel reports ECONNREFUSED on the broker socket. This PR fixes four independent root causes that combine to produce the wedge.

Closes #183. Refs #176, #164.

Reproduction

Run any moderately heavy task that exercises the broker for ~3-5 minutes. In my case it was a Codex review of a ~4,500-line markdown corpus. After the codex CLI processed several nl/jq/rg shell commands and started reasoning about the result, the worker fell silent. The job-store kept reporting phase: running for 9+ minutes with no log progress. /codex:cancel returned ECONNREFUSED /var/folders/.../cxc-XXXXXX/broker.sock. Both the companion PID and the broker PID were dead by the time I cancelled.

Root causes

captureTurn hangs forever on disconnection. lib/codex.mjs awaits state.completion with no race against the client's exitPromise. When the underlying codex app-server (or the broker socket) dies mid-turn, no terminal event is delivered, so the await never resolves. Pending RPC requests are rejected by handleExit but the turn-completion promise is unrelated to those, so it sits forever.
runTrackedJob has no timeout fallback. lib/tracked-jobs.mjs awaits the runner without a timeout. When captureTurn hangs, the entire companion process hangs and the per-job state file is never transitioned to a terminal status.
The broker zombifies on app-server death. app-server-broker.mjs never listens for the underlying app-server client's exit. When the codex CLI process dies, the broker keeps running with a dead client, accepting new connections but unable to serve them. The next companion sees a stale broker.json that points at a half-dead broker.
Status reads do not probe PID liveness. buildStatusSnapshot and buildSingleJobSnapshot never check whether tracked PIDs are still alive. Once a job is wedged, every /codex:status query reports it as running.

Fixes

captureTurn races against client exit (lib/codex.mjs). A new exitWatch promise rejects state.completion with the recorded exit error when the client closes before a terminal event arrives. Progress emits a failed notification so the user sees what happened.
runTrackedJob enforces a hard timeout (lib/tracked-jobs.mjs). Default 30 minutes, override via CODEX_COMPANION_JOB_TIMEOUT_MS env var or the timeoutMs option. On timeout the job transitions to failed with a clear error message instead of leaving the companion stuck.
Broker subscribes to appClient exit (app-server-broker.mjs). When the underlying codex CLI exits unexpectedly the broker logs the reason, fans out a notifications/broker/shuttingDown event to any connected socket, tears down the unix socket and pid file, then exits with status 1. The next companion will detect a dead endpoint via ensureBrokerSession and respawn cleanly.
isProcessAlive helper (lib/process.mjs). Uses kill(pid, 0) to probe pid liveness without affecting the process.
markDeadPidJobFailed + reconcileIfDead (lib/job-control.mjs). Reconcile any active job whose tracked PID is gone. Reconciliation runs on every status read path now, not only the --wait polling loop, so a single /codex:status call surfaces dead workers immediately. This is broader than Fix stuck running jobs by detecting dead review/task PIDs #176, which only reconciled inside waitForSingleJobSnapshot.

Relationship to existing PRs/issues

runTrackedJob / captureTurn can hang in phase: finalizing indefinitely (no timeout) #183 (runTrackedJob/captureTurn hang in finalizing) — fully addressed by the captureTurn race + the hard timeout.
Fix stuck running jobs by detecting dead review/task PIDs #176 (Fix stuck running jobs by detecting dead review/task PIDs) — this PR includes equivalent dead-PID detection but also reconciles on every status read, not only status --wait. If Fix stuck running jobs by detecting dead review/task PIDs #176 lands first, the overlap is small (the helper functions and a couple of import lines).
Review jobs stuck in 'running' after process death — no dead-PID detection #164 (Review jobs stuck in 'running' after process death) — also addressed.

Test plan

New unit tests (11 cases, all pass):
- tests/process.test.mjs — 3 new cases for isProcessAlive (invalid input, current pid, exited child).
- tests/dead-pid-reconcile.test.mjs — 6 new cases covering markDeadPidJobFailed guards and reconcile-on-status for both single-job and full snapshots.
- tests/tracked-jobs-timeout.test.mjs — 2 new cases proving the hard timeout transitions a hung job to failed and that fast runners still complete normally.
Full suite: node --test tests/*.test.mjs → 91 pass, 5 fail. The 5 failures (collectReviewContext skips untracked directories, three runtime.test.mjs snapshot tests, and resolveStateDir uses a temp-backed per-workspace directory) are pre-existing on main and unrelated to this patch — I confirmed by running the same suite against pristine main.

Test plan (manual)

Reproduced the wedge with a long-running Codex task on macOS.
Verified /codex:status now reconciles a dead job to failed on a single call.
Verified the new hard timeout fires and writes a failed status.
Smoke-test on Windows (I do not have a Windows host; the changes use only process.kill(pid, 0) on Unix paths and the platform branch in terminateProcessTree is unchanged).

🤖 Generated with Claude Code

A Codex task can become stuck in `status: running` indefinitely with no way to recover short of manually wiping the job-store and broker session. The user-visible symptom is that `/codex:status` keeps reporting a job as running for many minutes after the underlying codex CLI worker has died, and the next `/codex:cancel` reports `ECONNREFUSED` on the broker socket. Three independent root causes combine to produce the wedge: 1. `captureTurn` (lib/codex.mjs) awaits `state.completion` with no race against the client's `exitPromise`. When the underlying codex app-server (or the broker socket) dies mid-turn, no terminal event is ever delivered, so the await never resolves. Pending RPC requests are rejected by `handleExit` but the turn-completion promise is unrelated to those, so it sits forever. 2. `runTrackedJob` (lib/tracked-jobs.mjs) awaits the runner with no timeout fallback. When `captureTurn` hangs, the entire companion process hangs, and the per-job state file is never transitioned to a terminal status. 3. `app-server-broker.mjs` never listens for the underlying app-server client's exit. When the codex CLI process dies, the broker keeps running with a dead client, accepting new connections but unable to serve them. The next companion sees a stale broker.json that points at a half-dead broker. 4. Status reads (`buildStatusSnapshot` and `buildSingleJobSnapshot`) do not probe whether tracked PIDs are still alive. Once a job is wedged, every status query reports it as running. Fixes: - captureTurn now races `state.completion` against `client.exitPromise`. If the client closes before a terminal event arrives, the turn is rejected with the recorded exit error and progress emits a `failed` notification. - runTrackedJob races the runner against a hard timeout (default 30 minutes, override via `CODEX_COMPANION_JOB_TIMEOUT_MS` or the `timeoutMs` option). On timeout the job transitions to `failed` with a clear error message instead of leaving the companion stuck. - The broker now subscribes to its appClient's `exitPromise`. When the underlying codex CLI exits unexpectedly the broker logs the reason, fans out a `notifications/broker/shuttingDown` event to any connected socket, tears down the unix socket and pid file, then exits with status 1. The next companion will detect a dead endpoint via `ensureBrokerSession` and respawn cleanly. - A new `isProcessAlive` helper in lib/process.mjs uses `kill(pid, 0)` to probe pid liveness without affecting the process. - A new `markDeadPidJobFailed` and `reconcileIfDead` helper in lib/job-control.mjs reconcile any active job whose tracked PID is gone. Reconciliation now runs on every status read path, not only the `--wait` polling loop, so a single `/codex:status` call surfaces dead workers immediately. Closes openai#183 (runTrackedJob/captureTurn hang in finalizing). Refs openai#176, openai#164 (dead PID detection on status reads). Tests: - tests/process.test.mjs: 3 new cases for `isProcessAlive`. - tests/dead-pid-reconcile.test.mjs: 6 new cases covering `markDeadPidJobFailed` guards (active state, PID match) and the reconcile-on-status behaviour for both single-job and full snapshots. - tests/tracked-jobs-timeout.test.mjs: 2 new cases proving the hard timeout transitions a hung job to `failed` and that successful runners still complete normally. All 11 new tests pass. The 5 failing tests under the existing suite (`collectReviewContext`, three `runtime.test.mjs` snapshot tests, and `resolveStateDir`) are pre-existing on `main` and unrelated to this patch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0d13e46207

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T15:56:08Z

plugins/codex/scripts/lib/job-control.mjs

+  const rawJobs = sortJobsNewestFirst(filterJobsForCurrentSession(listJobs(workspaceRoot), options));
+  // Reconcile any active jobs whose tracked PID is dead before partitioning,
+  // so a single status read surfaces stuck workers immediately.
+  const jobs = rawJobs.map((job) => reconcileIfDead(workspaceRoot, job));


Re-sort jobs after dead-PID reconciliation

buildStatusSnapshot sorts once (rawJobs) and then mutates job states via reconcileIfDead, but it never re-sorts before computing latestFinished/recent. If a stale running job near the end of the list is reconciled to failed, it keeps its old position and can be dropped by the maxJobs slice, so a single /codex:status call may reconcile a dead worker without showing it anywhere in the output. I reproduced this with >8 newer completed jobs: the reconciled dead job disappeared from both running and recent despite being newly failed.

Useful? React with 👍 / 👎.

Wrap the runner with Promise.race against a 30-minute default timeout. On expiry the job transitions to failed/phase:failed so zombie 'running' rows can't accumulate when a runner hangs. OPENCODE_COMPANION_JOB_TIMEOUT_MS overrides the default. Closes #41. Port of openai/codex-plugin-cc#184.

* fix: quote $ARGUMENTS in cancel/result/status commands Unquoted $ARGUMENTS allows shell splitting on user-supplied job IDs containing metacharacters. Wrap in double quotes to match review.md and adversarial-review.md. Closes #38. Port of openai/codex-plugin-cc#168. * fix: declare model: sonnet in opencode-rescue agent frontmatter Without a model declaration the agent tier was unpredictable. The rescue subagent is a thin forwarder that invokes the companion via a single Bash call and applies trivial routing logic — sonnet is sufficient and gives users a cost guarantee. Closes #39. Port of openai/codex-plugin-cc#169. * fix: scope /opencode:cancel default to current Claude session Without ref, resolveCancelableJob now filters running jobs by sessionId so a cancel in session A cannot kill jobs in session B. Explicit ref still searches all sessions — naming a job counts as intent. Closes #45. Port of openai/codex-plugin-cc#84. * fix: enforce hard wall-clock timeout on runTrackedJob Wrap the runner with Promise.race against a 30-minute default timeout. On expiry the job transitions to failed/phase:failed so zombie 'running' rows can't accumulate when a runner hangs. OPENCODE_COMPANION_JOB_TIMEOUT_MS overrides the default. Closes #41. Port of openai/codex-plugin-cc#184. * fix: reconcile dead-PID jobs on every status read Adds isProcessAlive helper and reconcileIfDead / reconcileAllJobs / markDeadPidJobFailed in job-control. buildStatusSnapshot and the handleResult/handleCancel paths now probe kill(pid, 0) on any active-state job and rewrite dead ones to failed before consuming the list. A single /opencode:status / result / cancel surfaces stuck workers without waiting for SessionEnd. markDeadPidJobFailed is race-safe: it re-reads state and refuses to downgrade terminal states or rewrite when the pid has changed. Closes #42. Port of openai/codex-plugin-cc#176 + dead-PID parts of #184. * fix: avoid embedding large diffs in review prompts Classify review scope before building the prompt. When the diff exceeds ~5 files or ~256 KB, fall back to a lightweight context (status, changed-files, diff_stat) and tell OpenCode to inspect the diff itself via read-only git commands. Prevents HTTP 400 / shallow findings on moderate-to-large changesets. Adversarial template grows a {{REVIEW_COLLECTION_GUIDANCE}} slot. Thresholds overridable via opts.maxInlineDiffFiles/Bytes. Closes #40. Port of openai/codex-plugin-cc#179. * fix: respect \$SHELL on Windows when spawning child processes Add platformShellOption() helper that picks false on POSIX, and \$SHELL || true on win32 so Git Bash users get their shell while cmd fallback still resolves .cmd/.bat shims. Apply to runCommand, spawnDetached, resolveOpencodeBinary, getOpencodeVersion, and the ensureServer spawn of 'opencode serve'. Uses 'where' instead of 'which' on win32, and parses the first line of its CRLF-separated output. Closes #46. Port of openai/codex-plugin-cc#178. * fix: migrate tmpdir state to CLAUDE_PLUGIN_DATA + fix /tmp literal The fallback path was hard-coded to '/tmp' — broken on Windows. Use os.tmpdir() so Windows and other platforms get a real tmp path. Additionally: when CLAUDE_PLUGIN_DATA is set on a later call but state was previously written to the tmpdir fallback, copy it into the plugin-data dir and rewrite absolute path references inside state.json and jobs/*.json so logFile pointers don't dangle. Prevents job history from being silently dropped when commands run under different env contexts within one Claude session. Closes #47. Port of openai/codex-plugin-cc#125. * feat: pass last review findings to rescue automatically After a successful /opencode:review or /opencode:adversarial-review, save the rendered output to ~/.opencode-companion/last-review-<hash>.md (per repo, SHA-256 of workspace path). Add a new 'last-review' subcommand that reports availability or streams the content. rescue.md now checks for a saved review when invoked without task text and asks via AskUserQuestion whether to fix the prior findings or describe a new task. The save is best-effort — a failed persistence never fails the review itself. Closes #44. Port of openai/codex-plugin-cc#129 (simplified: logic lives in the companion script rather than an inline node -e one-liner). * feat: throttle controls for stop-time review gate Add --review-gate-max and --review-gate-cooldown flags to /opencode:setup so users can bound the review gate's spend: /opencode:setup --review-gate-max 5 /opencode:setup --review-gate-cooldown 10 /opencode:setup --review-gate-max off The stop hook now loads state before touching stdin, checks reviewGateMaxPerSession and reviewGateCooldownMinutes against the current session's reviewGateUsage entry, and allows the stop without running OpenCode when a limit would be exceeded. Usage entries older than 7 days are pruned on each successful run so state.json doesn't grow unbounded. renderSetup surfaces the configured limits. Closes #48. Port of openai/codex-plugin-cc#20. * feat: --worktree flag for isolated write-capable rescue tasks Add a disposable-git-worktree mode so /opencode:rescue --write --worktree runs OpenCode inside .worktrees/opencode-<ts> on a fresh opencode/<ts> branch instead of editing the working tree in place. Useful for exploratory runs, parallel rescues, and running against a dirty tree. Pieces: - lib/git.mjs: createWorktree / removeWorktree / deleteWorktreeBranch / getWorktreeDiff / applyWorktreePatch. Adds .worktrees/ to .git/info/exclude on first use so the dir never shows in status. - lib/worktree.mjs: session wrapper — createWorktreeSession, diffWorktreeSession, cleanupWorktreeSession (keep applies patch back, discard just removes). - opencode-companion.mjs: handleTask threads --worktree + swaps cwd + stores session data on the job record + renders a keep/discard footer. New worktree-cleanup subcommand reads the stored session and runs the keep or discard path. - agents/opencode-rescue.md, commands/rescue.md, skills/opencode-runtime: propagate --worktree through the forwarding layer. - tests/worktree.test.mjs: create, diff, keep-applies, discard, no-change no-op. Closes #43. Port of openai/codex-plugin-cc#137. * fix: address pr51 review findings * fix: keep tracked job timeout referenced * fix: address pr51 review conversations * fix: add exclusive file lock to updateState for concurrency safety updateState's read-modify-write cycle was not protected against concurrent companion processes (background worker + status/cancel handler), which could silently lose each other's writes. Acquire an exclusive lock file (state.json.lock via O_EXCL) before reading, hold it through mutation and write, release in finally. Stale locks older than 30s are evicted. Blocks up to 5s with retry. Closes the pre-existing concurrency race amplified by PR #51's dead-PID reconciliation (which adds upsertJob calls on every status read). * fix: address brownfield discovery bugs Critical/high fixes: - BUG-1: saveLastReview use copyFileSync+unlinkSync instead of renameSync (fixes Windows compatibility issue where rename fails if target exists) - BUG-2: handleTask worktree leak - wrap in try/finally to guarantee cleanup - BUG-3: State migration race - add fallback directory lock during migration - BUG-4/13: handleTaskWorker missing signal handlers for graceful shutdown Medium fixes: - BUG-5: releaseStateLock now fsyncs directory after lock removal - BUG-11: Error from getConfiguredProviders now logged instead of swallowed Low fixes: - BUG-6: PR number validation now rejects negative values - BUG-7: getBundledConfigDir checks directory exists before returning - BUG-8: tailLines now properly filters empty lines after split - BUG-9: resolveReviewAgent always returns tools property (undefined if not used) - BUG-10: Diff retrieval failure now logs warning instead of silent swallow - BUG-12: resolveOpencodeBinary now handles spawn errors properly Additional pre-existing work included: - safe-command.mjs wrapper for secure command execution - Command documentation updates - Test improvements * fix: polish pr51 follow-up fixes * fix: address Copilot PR#51 review comments Four findings, all valid: C1 (prompts.mjs, git.mjs, process.mjs) — buildReviewPrompt materialized the full diff string before checking thresholds. For huge diffs the git/gh subprocess could OOM the companion before the size check ran. Fix: runCommand gains maxOutputBytes, killing the child and reporting overflowed once the cap is exceeded. getDiff and getPrDiff thread maxBytes through. buildReviewPrompt now bounds the read at maxBytes+1 and treats overflow as over-byte-limit without ever materializing the rest. C2 (git.mjs) — getDiffByteSize had a docstring claiming it avoided streaming the full contents, but the implementation did exactly that. It was also dead code (zero callers). Removed. C3 (tests/state-lock.test.mjs) — the test injected path.resolve(...) into a generated ESM import specifier. On Windows that path contains backslashes and a drive letter, producing an invalid module specifier. Fix: pathToFileURL(...).href for the injected specifier. C4 (tests/dead-pid-reconcile.test.mjs) — beforeEach mutated the object returned by loadState() without saving it back, leaving on-disk state from earlier tests intact. Fix: saveState(workspace, { config:{}, jobs:[] }). Adds coverage: - tests/process.test.mjs: runCommand overflow path and non-overflow path. - tests/review-prompt-size.test.mjs: bounds-huge-diff end-to-end test that writes a 50k-byte file and asserts fewer than 10k 'x' chars land in the prompt. All 167 tests pass. * fix: keep reading fallback state while migrate lock is held Addresses a Codex P1 on PR#51: when another migrator holds \`primaryDir.migrate.lock\`, migrateTmpdirStateIfNeeded waits up to 2s and then returns without copying anything. Before this fix, stateRoot still returned primaryDir — but primary/state.json didn't exist yet, so loadState returned an empty state and the next upsertJob created primary/state.json with only the new entry, orphaning every seeded fallback job. Fix: after a migration attempt, if primary/state.json is absent and fallback/state.json is present, stateRoot returns the fallback dir. Reads see real data, writes land in fallback, and a later migration retry can pick them up cleanly. Adds a regression test that pre-creates the migrate lock, seeds fallback with a job, switches to CLAUDE_PLUGIN_DATA, and verifies that stateRoot falls back, loadState sees the seeded job, and a subsequent write preserves both the seeded and the in-flight rows. The symlink-refusal test had to be updated because it was reusing stateRoot to name the "primary" dir — with the new fallback guard, that call now returns fallbackDir on failed migration. The test now computes the expected primary path directly.

jonny981 requested a review from a team April 8, 2026 15:50

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

This was referenced Apr 12, 2026

Port: tracked-job hard timeout + exit-promise race (ref impl for #13) JohnnyVicious/opencode-plugin-cc#41

Closed

Port: reconcile dead-PID jobs on every status read (ref impl for #14) JohnnyVicious/opencode-plugin-cc#42

Closed

JohnnyVicious mentioned this pull request Apr 12, 2026

Port 11 high-quality fixes/features from openai/codex-plugin-cc JohnnyVicious/opencode-plugin-cc#51

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(runtime): prevent tracked jobs hanging forever on broker disconnect#184

fix(runtime): prevent tracked jobs hanging forever on broker disconnect#184
jonny981 wants to merge 1 commit intoopenai:mainfrom
jonny981:fix/captureturn-hang-and-dead-pid-detection

jonny981 commented Apr 8, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jonny981 commented Apr 8, 2026

Summary

Reproduction

Root causes

Fixes

Relationship to existing PRs/issues

Test plan

Test plan (manual)

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant