Skip to content

fix(runtime): prevent tracked jobs hanging forever on broker disconnect#184

Open
jonny981 wants to merge 1 commit intoopenai:mainfrom
jonny981:fix/captureturn-hang-and-dead-pid-detection
Open

fix(runtime): prevent tracked jobs hanging forever on broker disconnect#184
jonny981 wants to merge 1 commit intoopenai:mainfrom
jonny981:fix/captureturn-hang-and-dead-pid-detection

Conversation

@jonny981
Copy link
Copy Markdown

@jonny981 jonny981 commented Apr 8, 2026

Summary

A Codex task can become stuck in status: running indefinitely with no way to recover short of manually wiping the job-store and broker session. The user-visible symptom is that /codex:status keeps reporting a job as running for many minutes after the underlying codex CLI worker has died, and the next /codex:cancel reports ECONNREFUSED on the broker socket. This PR fixes four independent root causes that combine to produce the wedge.

Closes #183. Refs #176, #164.

Reproduction

Run any moderately heavy task that exercises the broker for ~3-5 minutes. In my case it was a Codex review of a ~4,500-line markdown corpus. After the codex CLI processed several nl/jq/rg shell commands and started reasoning about the result, the worker fell silent. The job-store kept reporting phase: running for 9+ minutes with no log progress. /codex:cancel returned ECONNREFUSED /var/folders/.../cxc-XXXXXX/broker.sock. Both the companion PID and the broker PID were dead by the time I cancelled.

Root causes

  1. captureTurn hangs forever on disconnection. lib/codex.mjs awaits state.completion with no race against the client's exitPromise. When the underlying codex app-server (or the broker socket) dies mid-turn, no terminal event is delivered, so the await never resolves. Pending RPC requests are rejected by handleExit but the turn-completion promise is unrelated to those, so it sits forever.

  2. runTrackedJob has no timeout fallback. lib/tracked-jobs.mjs awaits the runner without a timeout. When captureTurn hangs, the entire companion process hangs and the per-job state file is never transitioned to a terminal status.

  3. The broker zombifies on app-server death. app-server-broker.mjs never listens for the underlying app-server client's exit. When the codex CLI process dies, the broker keeps running with a dead client, accepting new connections but unable to serve them. The next companion sees a stale broker.json that points at a half-dead broker.

  4. Status reads do not probe PID liveness. buildStatusSnapshot and buildSingleJobSnapshot never check whether tracked PIDs are still alive. Once a job is wedged, every /codex:status query reports it as running.

Fixes

  • captureTurn races against client exit (lib/codex.mjs). A new exitWatch promise rejects state.completion with the recorded exit error when the client closes before a terminal event arrives. Progress emits a failed notification so the user sees what happened.

  • runTrackedJob enforces a hard timeout (lib/tracked-jobs.mjs). Default 30 minutes, override via CODEX_COMPANION_JOB_TIMEOUT_MS env var or the timeoutMs option. On timeout the job transitions to failed with a clear error message instead of leaving the companion stuck.

  • Broker subscribes to appClient exit (app-server-broker.mjs). When the underlying codex CLI exits unexpectedly the broker logs the reason, fans out a notifications/broker/shuttingDown event to any connected socket, tears down the unix socket and pid file, then exits with status 1. The next companion will detect a dead endpoint via ensureBrokerSession and respawn cleanly.

  • isProcessAlive helper (lib/process.mjs). Uses kill(pid, 0) to probe pid liveness without affecting the process.

  • markDeadPidJobFailed + reconcileIfDead (lib/job-control.mjs). Reconcile any active job whose tracked PID is gone. Reconciliation runs on every status read path now, not only the --wait polling loop, so a single /codex:status call surfaces dead workers immediately. This is broader than Fix stuck running jobs by detecting dead review/task PIDs #176, which only reconciled inside waitForSingleJobSnapshot.

Relationship to existing PRs/issues

Test plan

  • New unit tests (11 cases, all pass):
    • tests/process.test.mjs — 3 new cases for isProcessAlive (invalid input, current pid, exited child).
    • tests/dead-pid-reconcile.test.mjs — 6 new cases covering markDeadPidJobFailed guards and reconcile-on-status for both single-job and full snapshots.
    • tests/tracked-jobs-timeout.test.mjs — 2 new cases proving the hard timeout transitions a hung job to failed and that fast runners still complete normally.
  • Full suite: node --test tests/*.test.mjs → 91 pass, 5 fail. The 5 failures (collectReviewContext skips untracked directories, three runtime.test.mjs snapshot tests, and resolveStateDir uses a temp-backed per-workspace directory) are pre-existing on main and unrelated to this patch — I confirmed by running the same suite against pristine main.

Test plan (manual)

  • Reproduced the wedge with a long-running Codex task on macOS.
  • Verified /codex:status now reconciles a dead job to failed on a single call.
  • Verified the new hard timeout fires and writes a failed status.
  • Smoke-test on Windows (I do not have a Windows host; the changes use only process.kill(pid, 0) on Unix paths and the platform branch in terminateProcessTree is unchanged).

🤖 Generated with Claude Code

A Codex task can become stuck in `status: running` indefinitely with no
way to recover short of manually wiping the job-store and broker session.
The user-visible symptom is that `/codex:status` keeps reporting a job as
running for many minutes after the underlying codex CLI worker has died,
and the next `/codex:cancel` reports `ECONNREFUSED` on the broker socket.

Three independent root causes combine to produce the wedge:

1. `captureTurn` (lib/codex.mjs) awaits `state.completion` with no race
   against the client's `exitPromise`. When the underlying codex
   app-server (or the broker socket) dies mid-turn, no terminal event is
   ever delivered, so the await never resolves. Pending RPC requests are
   rejected by `handleExit` but the turn-completion promise is unrelated
   to those, so it sits forever.

2. `runTrackedJob` (lib/tracked-jobs.mjs) awaits the runner with no
   timeout fallback. When `captureTurn` hangs, the entire companion
   process hangs, and the per-job state file is never transitioned to a
   terminal status.

3. `app-server-broker.mjs` never listens for the underlying app-server
   client's exit. When the codex CLI process dies, the broker keeps
   running with a dead client, accepting new connections but unable to
   serve them. The next companion sees a stale broker.json that points
   at a half-dead broker.

4. Status reads (`buildStatusSnapshot` and `buildSingleJobSnapshot`) do
   not probe whether tracked PIDs are still alive. Once a job is wedged,
   every status query reports it as running.

Fixes:

- captureTurn now races `state.completion` against `client.exitPromise`.
  If the client closes before a terminal event arrives, the turn is
  rejected with the recorded exit error and progress emits a `failed`
  notification.

- runTrackedJob races the runner against a hard timeout (default 30
  minutes, override via `CODEX_COMPANION_JOB_TIMEOUT_MS` or the
  `timeoutMs` option). On timeout the job transitions to `failed` with
  a clear error message instead of leaving the companion stuck.

- The broker now subscribes to its appClient's `exitPromise`. When the
  underlying codex CLI exits unexpectedly the broker logs the reason,
  fans out a `notifications/broker/shuttingDown` event to any connected
  socket, tears down the unix socket and pid file, then exits with
  status 1. The next companion will detect a dead endpoint via
  `ensureBrokerSession` and respawn cleanly.

- A new `isProcessAlive` helper in lib/process.mjs uses `kill(pid, 0)`
  to probe pid liveness without affecting the process.

- A new `markDeadPidJobFailed` and `reconcileIfDead` helper in
  lib/job-control.mjs reconcile any active job whose tracked PID is
  gone. Reconciliation now runs on every status read path, not only the
  `--wait` polling loop, so a single `/codex:status` call surfaces dead
  workers immediately.

Closes openai#183 (runTrackedJob/captureTurn hang in finalizing).
Refs openai#176, openai#164 (dead PID detection on status reads).

Tests:

- tests/process.test.mjs: 3 new cases for `isProcessAlive`.
- tests/dead-pid-reconcile.test.mjs: 6 new cases covering
  `markDeadPidJobFailed` guards (active state, PID match) and the
  reconcile-on-status behaviour for both single-job and full snapshots.
- tests/tracked-jobs-timeout.test.mjs: 2 new cases proving the hard
  timeout transitions a hung job to `failed` and that successful
  runners still complete normally.

All 11 new tests pass. The 5 failing tests under the existing suite
(`collectReviewContext`, three `runtime.test.mjs` snapshot tests, and
`resolveStateDir`) are pre-existing on `main` and unrelated to this
patch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jonny981 jonny981 requested a review from a team April 8, 2026 15:50
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0d13e46207

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +312 to +315
const rawJobs = sortJobsNewestFirst(filterJobsForCurrentSession(listJobs(workspaceRoot), options));
// Reconcile any active jobs whose tracked PID is dead before partitioning,
// so a single status read surfaces stuck workers immediately.
const jobs = rawJobs.map((job) => reconcileIfDead(workspaceRoot, job));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Re-sort jobs after dead-PID reconciliation

buildStatusSnapshot sorts once (rawJobs) and then mutates job states via reconcileIfDead, but it never re-sorts before computing latestFinished/recent. If a stale running job near the end of the list is reconciled to failed, it keeps its old position and can be dropped by the maxJobs slice, so a single /codex:status call may reconcile a dead worker without showing it anywhere in the output. I reproduced this with >8 newer completed jobs: the reconciled dead job disappeared from both running and recent despite being newly failed.

Useful? React with 👍 / 👎.

JohnnyVicious added a commit to JohnnyVicious/opencode-plugin-cc that referenced this pull request Apr 12, 2026
Wrap the runner with Promise.race against a 30-minute default timeout.
On expiry the job transitions to failed/phase:failed so zombie
'running' rows can't accumulate when a runner hangs.

OPENCODE_COMPANION_JOB_TIMEOUT_MS overrides the default.

Closes #41. Port of openai/codex-plugin-cc#184.
JohnnyVicious added a commit to JohnnyVicious/opencode-plugin-cc that referenced this pull request Apr 12, 2026
* fix: quote $ARGUMENTS in cancel/result/status commands

Unquoted $ARGUMENTS allows shell splitting on user-supplied job IDs
containing metacharacters. Wrap in double quotes to match review.md
and adversarial-review.md.

Closes #38. Port of openai/codex-plugin-cc#168.

* fix: declare model: sonnet in opencode-rescue agent frontmatter

Without a model declaration the agent tier was unpredictable. The
rescue subagent is a thin forwarder that invokes the companion via a
single Bash call and applies trivial routing logic — sonnet is
sufficient and gives users a cost guarantee.

Closes #39. Port of openai/codex-plugin-cc#169.

* fix: scope /opencode:cancel default to current Claude session

Without ref, resolveCancelableJob now filters running jobs by
sessionId so a cancel in session A cannot kill jobs in session B.
Explicit ref still searches all sessions — naming a job counts as
intent.

Closes #45. Port of openai/codex-plugin-cc#84.

* fix: enforce hard wall-clock timeout on runTrackedJob

Wrap the runner with Promise.race against a 30-minute default timeout.
On expiry the job transitions to failed/phase:failed so zombie
'running' rows can't accumulate when a runner hangs.

OPENCODE_COMPANION_JOB_TIMEOUT_MS overrides the default.

Closes #41. Port of openai/codex-plugin-cc#184.

* fix: reconcile dead-PID jobs on every status read

Adds isProcessAlive helper and reconcileIfDead / reconcileAllJobs /
markDeadPidJobFailed in job-control. buildStatusSnapshot and the
handleResult/handleCancel paths now probe kill(pid, 0) on any
active-state job and rewrite dead ones to failed before consuming the
list. A single /opencode:status / result / cancel surfaces stuck
workers without waiting for SessionEnd.

markDeadPidJobFailed is race-safe: it re-reads state and refuses to
downgrade terminal states or rewrite when the pid has changed.

Closes #42. Port of openai/codex-plugin-cc#176 + dead-PID parts of #184.

* fix: avoid embedding large diffs in review prompts

Classify review scope before building the prompt. When the diff exceeds
~5 files or ~256 KB, fall back to a lightweight context (status,
changed-files, diff_stat) and tell OpenCode to inspect the diff itself
via read-only git commands. Prevents HTTP 400 / shallow findings on
moderate-to-large changesets.

Adversarial template grows a {{REVIEW_COLLECTION_GUIDANCE}} slot.
Thresholds overridable via opts.maxInlineDiffFiles/Bytes.

Closes #40. Port of openai/codex-plugin-cc#179.

* fix: respect \$SHELL on Windows when spawning child processes

Add platformShellOption() helper that picks false on POSIX, and
\$SHELL || true on win32 so Git Bash users get their shell while cmd
fallback still resolves .cmd/.bat shims. Apply to runCommand,
spawnDetached, resolveOpencodeBinary, getOpencodeVersion, and the
ensureServer spawn of 'opencode serve'.

Uses 'where' instead of 'which' on win32, and parses the first line
of its CRLF-separated output.

Closes #46. Port of openai/codex-plugin-cc#178.

* fix: migrate tmpdir state to CLAUDE_PLUGIN_DATA + fix /tmp literal

The fallback path was hard-coded to '/tmp' — broken on Windows. Use
os.tmpdir() so Windows and other platforms get a real tmp path.

Additionally: when CLAUDE_PLUGIN_DATA is set on a later call but
state was previously written to the tmpdir fallback, copy it into
the plugin-data dir and rewrite absolute path references inside
state.json and jobs/*.json so logFile pointers don't dangle.

Prevents job history from being silently dropped when commands run
under different env contexts within one Claude session.

Closes #47. Port of openai/codex-plugin-cc#125.

* feat: pass last review findings to rescue automatically

After a successful /opencode:review or /opencode:adversarial-review,
save the rendered output to ~/.opencode-companion/last-review-<hash>.md
(per repo, SHA-256 of workspace path). Add a new 'last-review'
subcommand that reports availability or streams the content.

rescue.md now checks for a saved review when invoked without task
text and asks via AskUserQuestion whether to fix the prior findings
or describe a new task.

The save is best-effort — a failed persistence never fails the review
itself.

Closes #44. Port of openai/codex-plugin-cc#129 (simplified: logic
lives in the companion script rather than an inline node -e one-liner).

* feat: throttle controls for stop-time review gate

Add --review-gate-max and --review-gate-cooldown flags to
/opencode:setup so users can bound the review gate's spend:

  /opencode:setup --review-gate-max 5
  /opencode:setup --review-gate-cooldown 10
  /opencode:setup --review-gate-max off

The stop hook now loads state before touching stdin, checks
reviewGateMaxPerSession and reviewGateCooldownMinutes against the
current session's reviewGateUsage entry, and allows the stop without
running OpenCode when a limit would be exceeded.

Usage entries older than 7 days are pruned on each successful run so
state.json doesn't grow unbounded. renderSetup surfaces the configured
limits.

Closes #48. Port of openai/codex-plugin-cc#20.

* feat: --worktree flag for isolated write-capable rescue tasks

Add a disposable-git-worktree mode so /opencode:rescue --write --worktree
runs OpenCode inside .worktrees/opencode-<ts> on a fresh opencode/<ts>
branch instead of editing the working tree in place. Useful for
exploratory runs, parallel rescues, and running against a dirty tree.

Pieces:
- lib/git.mjs: createWorktree / removeWorktree / deleteWorktreeBranch /
  getWorktreeDiff / applyWorktreePatch. Adds .worktrees/ to
  .git/info/exclude on first use so the dir never shows in status.
- lib/worktree.mjs: session wrapper — createWorktreeSession,
  diffWorktreeSession, cleanupWorktreeSession (keep applies patch back,
  discard just removes).
- opencode-companion.mjs: handleTask threads --worktree + swaps cwd +
  stores session data on the job record + renders a keep/discard
  footer. New worktree-cleanup subcommand reads the stored session and
  runs the keep or discard path.
- agents/opencode-rescue.md, commands/rescue.md, skills/opencode-runtime:
  propagate --worktree through the forwarding layer.
- tests/worktree.test.mjs: create, diff, keep-applies, discard,
  no-change no-op.

Closes #43. Port of openai/codex-plugin-cc#137.

* fix: address pr51 review findings

* fix: keep tracked job timeout referenced

* fix: address pr51 review conversations

* fix: add exclusive file lock to updateState for concurrency safety

updateState's read-modify-write cycle was not protected against
concurrent companion processes (background worker + status/cancel
handler), which could silently lose each other's writes.

Acquire an exclusive lock file (state.json.lock via O_EXCL) before
reading, hold it through mutation and write, release in finally.
Stale locks older than 30s are evicted. Blocks up to 5s with retry.

Closes the pre-existing concurrency race amplified by PR #51's
dead-PID reconciliation (which adds upsertJob calls on every status
read).

* fix: address brownfield discovery bugs

Critical/high fixes:
- BUG-1: saveLastReview use copyFileSync+unlinkSync instead of renameSync
  (fixes Windows compatibility issue where rename fails if target exists)
- BUG-2: handleTask worktree leak - wrap in try/finally to guarantee cleanup
- BUG-3: State migration race - add fallback directory lock during migration
- BUG-4/13: handleTaskWorker missing signal handlers for graceful shutdown

Medium fixes:
- BUG-5: releaseStateLock now fsyncs directory after lock removal
- BUG-11: Error from getConfiguredProviders now logged instead of swallowed

Low fixes:
- BUG-6: PR number validation now rejects negative values
- BUG-7: getBundledConfigDir checks directory exists before returning
- BUG-8: tailLines now properly filters empty lines after split
- BUG-9: resolveReviewAgent always returns tools property (undefined if not used)
- BUG-10: Diff retrieval failure now logs warning instead of silent swallow
- BUG-12: resolveOpencodeBinary now handles spawn errors properly

Additional pre-existing work included:
- safe-command.mjs wrapper for secure command execution
- Command documentation updates
- Test improvements

* fix: polish pr51 follow-up fixes

* fix: address Copilot PR#51 review comments

Four findings, all valid:

C1 (prompts.mjs, git.mjs, process.mjs) — buildReviewPrompt materialized
the full diff string before checking thresholds. For huge diffs the
git/gh subprocess could OOM the companion before the size check ran.
Fix: runCommand gains maxOutputBytes, killing the child and reporting
overflowed once the cap is exceeded. getDiff and getPrDiff thread
maxBytes through. buildReviewPrompt now bounds the read at maxBytes+1
and treats overflow as over-byte-limit without ever materializing
the rest.

C2 (git.mjs) — getDiffByteSize had a docstring claiming it avoided
streaming the full contents, but the implementation did exactly that.
It was also dead code (zero callers). Removed.

C3 (tests/state-lock.test.mjs) — the test injected path.resolve(...)
into a generated ESM import specifier. On Windows that path contains
backslashes and a drive letter, producing an invalid module specifier.
Fix: pathToFileURL(...).href for the injected specifier.

C4 (tests/dead-pid-reconcile.test.mjs) — beforeEach mutated the object
returned by loadState() without saving it back, leaving on-disk state
from earlier tests intact. Fix: saveState(workspace, { config:{}, jobs:[] }).

Adds coverage:
- tests/process.test.mjs: runCommand overflow path and non-overflow path.
- tests/review-prompt-size.test.mjs: bounds-huge-diff end-to-end test
  that writes a 50k-byte file and asserts fewer than 10k 'x' chars land
  in the prompt.

All 167 tests pass.

* fix: keep reading fallback state while migrate lock is held

Addresses a Codex P1 on PR#51: when another migrator holds
\`primaryDir.migrate.lock\`, migrateTmpdirStateIfNeeded waits up to 2s
and then returns without copying anything. Before this fix, stateRoot
still returned primaryDir — but primary/state.json didn't exist yet,
so loadState returned an empty state and the next upsertJob created
primary/state.json with only the new entry, orphaning every seeded
fallback job.

Fix: after a migration attempt, if primary/state.json is absent and
fallback/state.json is present, stateRoot returns the fallback dir.
Reads see real data, writes land in fallback, and a later migration
retry can pick them up cleanly.

Adds a regression test that pre-creates the migrate lock, seeds
fallback with a job, switches to CLAUDE_PLUGIN_DATA, and verifies
that stateRoot falls back, loadState sees the seeded job, and a
subsequent write preserves both the seeded and the in-flight rows.

The symlink-refusal test had to be updated because it was reusing
stateRoot to name the "primary" dir — with the new fallback guard,
that call now returns fallbackDir on failed migration. The test
now computes the expected primary path directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

runTrackedJob / captureTurn can hang in phase: finalizing indefinitely (no timeout)

1 participant