Fix stuck running jobs by detecting dead review/task PIDs#176
Fix stuck running jobs by detecting dead review/task PIDs#176Abdooo2235 wants to merge 12 commits intoopenai:mainfrom
Conversation
- detect dead tracked PIDs and reconcile queued/running jobs to failed\n- persist failure state to state index and job file with clear diagnostics\n- surface error details and wait timeout context in status output\n- add regression test covering dead-PID status --wait behavior
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 39b560f530
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Addressed the Codex race-condition feedback in commit 5ff5fc5. Changes made:
Validation: npm test passes (68/68). |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ff5fc5d60
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Dead-PID Handling UpdateImplemented on branch SummaryThis PR now handles stale What changed
Validationnpm test
@codex review |
|
Codex Review: Didn't find any major issues. Breezy! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Codex Review: Didn't find any major issues. More of your lovely PRs please. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
- Add isProcessAlive(pid) helper to process.mjs - Add markDeadPidJobFailed() to job-control.mjs with race guards - Call dead-PID check inside waitForSingleJobSnapshot poll loop - Surface errorMessage in pushJobDetails for failed jobs - Add regression tests: dead-PID fast-fail and concurrent-completion race Fixes openai#164.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b2dd44a044
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
A Codex task can become stuck in `status: running` indefinitely with no way to recover short of manually wiping the job-store and broker session. The user-visible symptom is that `/codex:status` keeps reporting a job as running for many minutes after the underlying codex CLI worker has died, and the next `/codex:cancel` reports `ECONNREFUSED` on the broker socket. Three independent root causes combine to produce the wedge: 1. `captureTurn` (lib/codex.mjs) awaits `state.completion` with no race against the client's `exitPromise`. When the underlying codex app-server (or the broker socket) dies mid-turn, no terminal event is ever delivered, so the await never resolves. Pending RPC requests are rejected by `handleExit` but the turn-completion promise is unrelated to those, so it sits forever. 2. `runTrackedJob` (lib/tracked-jobs.mjs) awaits the runner with no timeout fallback. When `captureTurn` hangs, the entire companion process hangs, and the per-job state file is never transitioned to a terminal status. 3. `app-server-broker.mjs` never listens for the underlying app-server client's exit. When the codex CLI process dies, the broker keeps running with a dead client, accepting new connections but unable to serve them. The next companion sees a stale broker.json that points at a half-dead broker. 4. Status reads (`buildStatusSnapshot` and `buildSingleJobSnapshot`) do not probe whether tracked PIDs are still alive. Once a job is wedged, every status query reports it as running. Fixes: - captureTurn now races `state.completion` against `client.exitPromise`. If the client closes before a terminal event arrives, the turn is rejected with the recorded exit error and progress emits a `failed` notification. - runTrackedJob races the runner against a hard timeout (default 30 minutes, override via `CODEX_COMPANION_JOB_TIMEOUT_MS` or the `timeoutMs` option). On timeout the job transitions to `failed` with a clear error message instead of leaving the companion stuck. - The broker now subscribes to its appClient's `exitPromise`. When the underlying codex CLI exits unexpectedly the broker logs the reason, fans out a `notifications/broker/shuttingDown` event to any connected socket, tears down the unix socket and pid file, then exits with status 1. The next companion will detect a dead endpoint via `ensureBrokerSession` and respawn cleanly. - A new `isProcessAlive` helper in lib/process.mjs uses `kill(pid, 0)` to probe pid liveness without affecting the process. - A new `markDeadPidJobFailed` and `reconcileIfDead` helper in lib/job-control.mjs reconcile any active job whose tracked PID is gone. Reconciliation now runs on every status read path, not only the `--wait` polling loop, so a single `/codex:status` call surfaces dead workers immediately. Closes openai#183 (runTrackedJob/captureTurn hang in finalizing). Refs openai#176, openai#164 (dead PID detection on status reads). Tests: - tests/process.test.mjs: 3 new cases for `isProcessAlive`. - tests/dead-pid-reconcile.test.mjs: 6 new cases covering `markDeadPidJobFailed` guards (active state, PID match) and the reconcile-on-status behaviour for both single-job and full snapshots. - tests/tracked-jobs-timeout.test.mjs: 2 new cases proving the hard timeout transitions a hung job to `failed` and that successful runners still complete normally. All 11 new tests pass. The 5 failing tests under the existing suite (`collectReviewContext`, three `runtime.test.mjs` snapshot tests, and `resolveStateDir`) are pre-existing on `main` and unrelated to this patch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Detect dead tracked processes throughout status wait polling, harden reconciliation race guards with PID normalization, and surface clear timeout/failure messaging. Add regression coverage for immediate timeout reconciliation and string/number PID matching. Made-with: Cursor
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b155dab283
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Repair runtime.test parse error, ensure status wait consistently reconciles dead PIDs across polling phases, and restore persisted-state preference for stale running index entries without mutating read paths broadly. Also remove duplicate failed-job error rendering and add reconciliation error breadcrumbs. Made-with: Cursor
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6132e48506
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5c0d7cdac9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f0b9379efa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Update: metadata consistency follow-up addressed and pushed to this PR branch. Summary of changes
Validation
Branch status
|
|
Updated PR branch with the latest fix. Commit
Validation
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 10f8528fb1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| function normalizeTrackedPid(pid) { | ||
| const numeric = Number(pid); | ||
| return Number.isFinite(numeric) && numeric > 0 ? numeric : null; | ||
| } |
There was a problem hiding this comment.
Reintroduce session-scoping task helper functions
This hunk replaces the block that previously defined getCurrentClaudeSessionId, filterJobsForCurrentClaudeSession, and findLatestResumableTaskJob, but call sites for those names still exist in resolveLatestTrackedTaskThread and handleTaskResumeCandidate. In this commit, invoking task-resume-candidate (for example node plugins/codex/scripts/codex-companion.mjs task-resume-candidate --json) now fails with ReferenceError: getCurrentClaudeSessionId is not defined, and resume-last task flows crash for the same reason. Please restore these helpers (or update the callers to a replacement API) so resume commands remain usable.
Useful? React with 👍 / 👎.
|
Updated PR branch with the latest fix. Summary
Commit
Validation
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f9e80a5c3e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return jobs | ||
| .filter((job) => job.jobClass === "task" && job.status === "completed") | ||
| .find((job) => job.threadId != null); |
There was a problem hiding this comment.
Include failed/cancelled jobs in resumable task lookup
findLatestResumableTaskJob now only considers task jobs with status === "completed", but both task-resume-candidate and task --resume-last depend on this helper. That means a most-recent task that ended as failed or cancelled (including the new dead-PID failure path) is treated as non-resumable even when it has a valid threadId, causing users to get “No previous Codex task thread…” or resume an older thread instead of the latest one.
Useful? React with 👍 / 👎.
|
Updated PR branch with the latest fix for the new suggestion. Summary
Commit
Validation
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 05a55a5d48
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (didFail) { | ||
| return storedJob; |
There was a problem hiding this comment.
Preserve recency fields when returning dead-PID reconciliation
After marking a dead PID as failed, this branch returns storedJob from the per-job file, but that file does not contain updatedAt. In the same command invocation, callers immediately sort by updatedAt, so the just-failed job can be treated as oldest and /codex:status or default /codex:result may show an older finished job instead of the newly failed one. This is visible when there are multiple finished jobs and dead-PID reconciliation happens during the read path.
Useful? React with 👍 / 👎.
Adds isProcessAlive helper and reconcileIfDead / reconcileAllJobs / markDeadPidJobFailed in job-control. buildStatusSnapshot and the handleResult/handleCancel paths now probe kill(pid, 0) on any active-state job and rewrite dead ones to failed before consuming the list. A single /opencode:status / result / cancel surfaces stuck workers without waiting for SessionEnd. markDeadPidJobFailed is race-safe: it re-reads state and refuses to downgrade terminal states or rewrite when the pid has changed. Closes #42. Port of openai/codex-plugin-cc#176 + dead-PID parts of #184.
* fix: quote $ARGUMENTS in cancel/result/status commands Unquoted $ARGUMENTS allows shell splitting on user-supplied job IDs containing metacharacters. Wrap in double quotes to match review.md and adversarial-review.md. Closes #38. Port of openai/codex-plugin-cc#168. * fix: declare model: sonnet in opencode-rescue agent frontmatter Without a model declaration the agent tier was unpredictable. The rescue subagent is a thin forwarder that invokes the companion via a single Bash call and applies trivial routing logic — sonnet is sufficient and gives users a cost guarantee. Closes #39. Port of openai/codex-plugin-cc#169. * fix: scope /opencode:cancel default to current Claude session Without ref, resolveCancelableJob now filters running jobs by sessionId so a cancel in session A cannot kill jobs in session B. Explicit ref still searches all sessions — naming a job counts as intent. Closes #45. Port of openai/codex-plugin-cc#84. * fix: enforce hard wall-clock timeout on runTrackedJob Wrap the runner with Promise.race against a 30-minute default timeout. On expiry the job transitions to failed/phase:failed so zombie 'running' rows can't accumulate when a runner hangs. OPENCODE_COMPANION_JOB_TIMEOUT_MS overrides the default. Closes #41. Port of openai/codex-plugin-cc#184. * fix: reconcile dead-PID jobs on every status read Adds isProcessAlive helper and reconcileIfDead / reconcileAllJobs / markDeadPidJobFailed in job-control. buildStatusSnapshot and the handleResult/handleCancel paths now probe kill(pid, 0) on any active-state job and rewrite dead ones to failed before consuming the list. A single /opencode:status / result / cancel surfaces stuck workers without waiting for SessionEnd. markDeadPidJobFailed is race-safe: it re-reads state and refuses to downgrade terminal states or rewrite when the pid has changed. Closes #42. Port of openai/codex-plugin-cc#176 + dead-PID parts of #184. * fix: avoid embedding large diffs in review prompts Classify review scope before building the prompt. When the diff exceeds ~5 files or ~256 KB, fall back to a lightweight context (status, changed-files, diff_stat) and tell OpenCode to inspect the diff itself via read-only git commands. Prevents HTTP 400 / shallow findings on moderate-to-large changesets. Adversarial template grows a {{REVIEW_COLLECTION_GUIDANCE}} slot. Thresholds overridable via opts.maxInlineDiffFiles/Bytes. Closes #40. Port of openai/codex-plugin-cc#179. * fix: respect \$SHELL on Windows when spawning child processes Add platformShellOption() helper that picks false on POSIX, and \$SHELL || true on win32 so Git Bash users get their shell while cmd fallback still resolves .cmd/.bat shims. Apply to runCommand, spawnDetached, resolveOpencodeBinary, getOpencodeVersion, and the ensureServer spawn of 'opencode serve'. Uses 'where' instead of 'which' on win32, and parses the first line of its CRLF-separated output. Closes #46. Port of openai/codex-plugin-cc#178. * fix: migrate tmpdir state to CLAUDE_PLUGIN_DATA + fix /tmp literal The fallback path was hard-coded to '/tmp' — broken on Windows. Use os.tmpdir() so Windows and other platforms get a real tmp path. Additionally: when CLAUDE_PLUGIN_DATA is set on a later call but state was previously written to the tmpdir fallback, copy it into the plugin-data dir and rewrite absolute path references inside state.json and jobs/*.json so logFile pointers don't dangle. Prevents job history from being silently dropped when commands run under different env contexts within one Claude session. Closes #47. Port of openai/codex-plugin-cc#125. * feat: pass last review findings to rescue automatically After a successful /opencode:review or /opencode:adversarial-review, save the rendered output to ~/.opencode-companion/last-review-<hash>.md (per repo, SHA-256 of workspace path). Add a new 'last-review' subcommand that reports availability or streams the content. rescue.md now checks for a saved review when invoked without task text and asks via AskUserQuestion whether to fix the prior findings or describe a new task. The save is best-effort — a failed persistence never fails the review itself. Closes #44. Port of openai/codex-plugin-cc#129 (simplified: logic lives in the companion script rather than an inline node -e one-liner). * feat: throttle controls for stop-time review gate Add --review-gate-max and --review-gate-cooldown flags to /opencode:setup so users can bound the review gate's spend: /opencode:setup --review-gate-max 5 /opencode:setup --review-gate-cooldown 10 /opencode:setup --review-gate-max off The stop hook now loads state before touching stdin, checks reviewGateMaxPerSession and reviewGateCooldownMinutes against the current session's reviewGateUsage entry, and allows the stop without running OpenCode when a limit would be exceeded. Usage entries older than 7 days are pruned on each successful run so state.json doesn't grow unbounded. renderSetup surfaces the configured limits. Closes #48. Port of openai/codex-plugin-cc#20. * feat: --worktree flag for isolated write-capable rescue tasks Add a disposable-git-worktree mode so /opencode:rescue --write --worktree runs OpenCode inside .worktrees/opencode-<ts> on a fresh opencode/<ts> branch instead of editing the working tree in place. Useful for exploratory runs, parallel rescues, and running against a dirty tree. Pieces: - lib/git.mjs: createWorktree / removeWorktree / deleteWorktreeBranch / getWorktreeDiff / applyWorktreePatch. Adds .worktrees/ to .git/info/exclude on first use so the dir never shows in status. - lib/worktree.mjs: session wrapper — createWorktreeSession, diffWorktreeSession, cleanupWorktreeSession (keep applies patch back, discard just removes). - opencode-companion.mjs: handleTask threads --worktree + swaps cwd + stores session data on the job record + renders a keep/discard footer. New worktree-cleanup subcommand reads the stored session and runs the keep or discard path. - agents/opencode-rescue.md, commands/rescue.md, skills/opencode-runtime: propagate --worktree through the forwarding layer. - tests/worktree.test.mjs: create, diff, keep-applies, discard, no-change no-op. Closes #43. Port of openai/codex-plugin-cc#137. * fix: address pr51 review findings * fix: keep tracked job timeout referenced * fix: address pr51 review conversations * fix: add exclusive file lock to updateState for concurrency safety updateState's read-modify-write cycle was not protected against concurrent companion processes (background worker + status/cancel handler), which could silently lose each other's writes. Acquire an exclusive lock file (state.json.lock via O_EXCL) before reading, hold it through mutation and write, release in finally. Stale locks older than 30s are evicted. Blocks up to 5s with retry. Closes the pre-existing concurrency race amplified by PR #51's dead-PID reconciliation (which adds upsertJob calls on every status read). * fix: address brownfield discovery bugs Critical/high fixes: - BUG-1: saveLastReview use copyFileSync+unlinkSync instead of renameSync (fixes Windows compatibility issue where rename fails if target exists) - BUG-2: handleTask worktree leak - wrap in try/finally to guarantee cleanup - BUG-3: State migration race - add fallback directory lock during migration - BUG-4/13: handleTaskWorker missing signal handlers for graceful shutdown Medium fixes: - BUG-5: releaseStateLock now fsyncs directory after lock removal - BUG-11: Error from getConfiguredProviders now logged instead of swallowed Low fixes: - BUG-6: PR number validation now rejects negative values - BUG-7: getBundledConfigDir checks directory exists before returning - BUG-8: tailLines now properly filters empty lines after split - BUG-9: resolveReviewAgent always returns tools property (undefined if not used) - BUG-10: Diff retrieval failure now logs warning instead of silent swallow - BUG-12: resolveOpencodeBinary now handles spawn errors properly Additional pre-existing work included: - safe-command.mjs wrapper for secure command execution - Command documentation updates - Test improvements * fix: polish pr51 follow-up fixes * fix: address Copilot PR#51 review comments Four findings, all valid: C1 (prompts.mjs, git.mjs, process.mjs) — buildReviewPrompt materialized the full diff string before checking thresholds. For huge diffs the git/gh subprocess could OOM the companion before the size check ran. Fix: runCommand gains maxOutputBytes, killing the child and reporting overflowed once the cap is exceeded. getDiff and getPrDiff thread maxBytes through. buildReviewPrompt now bounds the read at maxBytes+1 and treats overflow as over-byte-limit without ever materializing the rest. C2 (git.mjs) — getDiffByteSize had a docstring claiming it avoided streaming the full contents, but the implementation did exactly that. It was also dead code (zero callers). Removed. C3 (tests/state-lock.test.mjs) — the test injected path.resolve(...) into a generated ESM import specifier. On Windows that path contains backslashes and a drive letter, producing an invalid module specifier. Fix: pathToFileURL(...).href for the injected specifier. C4 (tests/dead-pid-reconcile.test.mjs) — beforeEach mutated the object returned by loadState() without saving it back, leaving on-disk state from earlier tests intact. Fix: saveState(workspace, { config:{}, jobs:[] }). Adds coverage: - tests/process.test.mjs: runCommand overflow path and non-overflow path. - tests/review-prompt-size.test.mjs: bounds-huge-diff end-to-end test that writes a 50k-byte file and asserts fewer than 10k 'x' chars land in the prompt. All 167 tests pass. * fix: keep reading fallback state while migrate lock is held Addresses a Codex P1 on PR#51: when another migrator holds \`primaryDir.migrate.lock\`, migrateTmpdirStateIfNeeded waits up to 2s and then returns without copying anything. Before this fix, stateRoot still returned primaryDir — but primary/state.json didn't exist yet, so loadState returned an empty state and the next upsertJob created primary/state.json with only the new entry, orphaning every seeded fallback job. Fix: after a migration attempt, if primary/state.json is absent and fallback/state.json is present, stateRoot returns the fallback dir. Reads see real data, writes land in fallback, and a later migration retry can pick them up cleanly. Adds a regression test that pre-creates the migrate lock, seeds fallback with a job, switches to CLAUDE_PLUGIN_DATA, and verifies that stateRoot falls back, loadState sees the seeded job, and a subsequent write preserves both the seeded and the in-flight rows. The symlink-refusal test had to be updated because it was reusing stateRoot to name the "primary" dir — with the new fallback guard, that call now returns fallbackDir on failed migration. The test now computes the expected primary path directly.
Summary
Fixes stuck Codex jobs that stayed in
queued/runningafter the tracked process died unexpectedly.Closes #164.
Problem
When a review/task process exited unexpectedly, job state could remain
runningindefinitely.That caused
status --waitto keep polling until timeout with no actionable failure reason.What Changed
failed.errorMessagein status output for easier diagnosis.status --waitbehavior.Behavior After Fix
failedinstead of remainingrunning.status --waitreturns quickly with failure state rather than timing out.Test Plan
npm teststatus --wait.