Skip to content

Fix stuck running jobs by detecting dead review/task PIDs#176

Open
Abdooo2235 wants to merge 12 commits intoopenai:mainfrom
Abdooo2235:fix/issue-164-dead-pid-detection
Open

Fix stuck running jobs by detecting dead review/task PIDs#176
Abdooo2235 wants to merge 12 commits intoopenai:mainfrom
Abdooo2235:fix/issue-164-dead-pid-detection

Conversation

@Abdooo2235
Copy link
Copy Markdown

@Abdooo2235 Abdooo2235 commented Apr 7, 2026

Summary

Fixes stuck Codex jobs that stayed in queued/running after the tracked process died unexpectedly.

Closes #164.

Problem

When a review/task process exited unexpectedly, job state could remain running indefinitely.
That caused status --wait to keep polling until timeout with no actionable failure reason.

What Changed

  • Added PID liveness detection for tracked jobs.
  • Added reconciliation that converts stale active jobs with dead PIDs to failed.
  • Persisted reconciliation to both the state index and per-job file.
  • Stored a clear failure message and completion timestamp when dead PID is detected.
  • Surfaced errorMessage in status output for easier diagnosis.
  • Included wait-timeout context in single-job status rendering.
  • Added a regression test for dead-PID plus status --wait behavior.

Behavior After Fix

  • Dead worker/reviewer PID is detected promptly.
  • Job transitions to failed instead of remaining running.
  • status --wait returns quickly with failure state rather than timing out.
  • Diagnostic reason is visible in status output and persisted job artifacts.

Test Plan

  • Ran full suite: npm test
  • Result: 67 passed, 0 failed.
  • Includes regression coverage for dead-PID reconciliation in status --wait.

- detect dead tracked PIDs and reconcile queued/running jobs to failed\n- persist failure state to state index and job file with clear diagnostics\n- surface error details and wait timeout context in status output\n- add regression test covering dead-PID status --wait behavior
@Abdooo2235 Abdooo2235 requested a review from a team April 7, 2026 16:16
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39b560f530

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@Abdooo2235
Copy link
Copy Markdown
Author

Addressed the Codex race-condition feedback in commit 5ff5fc5.

Changes made:

  • Updated markDeadPidJobFailed to re-read the latest persisted job before applying failedPatch.
  • Added guards so failure is only written when the latest persisted state is still active (queued/running) and PID still matches the originally observed PID.
  • This prevents downgrading legitimately completed/failed jobs during status polling races.
  • Added regression test: "status dead-pid reconciliation does not downgrade a concurrently completed job".

Validation: npm test passes (68/68).

@Abdooo2235
Copy link
Copy Markdown
Author

Abdooo2235 commented Apr 7, 2026

Implemented follow-up fix for the race noted in Codex feedback (commit 5ff5fc5), including persisted-state + PID guard checks and a regression test.

Validation is green: npm test (68/68).

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ff5fc5d60

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@Abdooo2235
Copy link
Copy Markdown
Author

Abdooo2235 commented Apr 7, 2026

Dead-PID Handling Update

Implemented on branch fix/issue-164-dead-pid-detection.

Summary

This PR now handles stale running/queued jobs whose tracked process has died by reconciling them into a consistent terminal state.

What changed

  • Added dead-PID detection and reconciliation in plugins/codex/scripts/lib/job-control.mjs.
  • Added race guards so terminal states are not overwritten during polling races.
  • Synced state.json from the latest persisted snapshot on early-return paths to prevent index/file split-brain.
  • Improved status rendering so failure details are visible in job output.
  • Added regression coverage for dead-PID timeout behavior and concurrent-completion race cases.

Validation

npm test
  • 68 passed, 0 failed.

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Breezy!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. More of your lovely PRs please.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Add isProcessAlive(pid) helper to process.mjs
- Add markDeadPidJobFailed() to job-control.mjs with race guards
- Call dead-PID check inside waitForSingleJobSnapshot poll loop
- Surface errorMessage in pushJobDetails for failed jobs
- Add regression tests: dead-PID fast-fail and concurrent-completion race

Fixes openai#164.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b2dd44a044

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

jonny981 added a commit to jonny981/codex-plugin-cc that referenced this pull request Apr 8, 2026
A Codex task can become stuck in `status: running` indefinitely with no
way to recover short of manually wiping the job-store and broker session.
The user-visible symptom is that `/codex:status` keeps reporting a job as
running for many minutes after the underlying codex CLI worker has died,
and the next `/codex:cancel` reports `ECONNREFUSED` on the broker socket.

Three independent root causes combine to produce the wedge:

1. `captureTurn` (lib/codex.mjs) awaits `state.completion` with no race
   against the client's `exitPromise`. When the underlying codex
   app-server (or the broker socket) dies mid-turn, no terminal event is
   ever delivered, so the await never resolves. Pending RPC requests are
   rejected by `handleExit` but the turn-completion promise is unrelated
   to those, so it sits forever.

2. `runTrackedJob` (lib/tracked-jobs.mjs) awaits the runner with no
   timeout fallback. When `captureTurn` hangs, the entire companion
   process hangs, and the per-job state file is never transitioned to a
   terminal status.

3. `app-server-broker.mjs` never listens for the underlying app-server
   client's exit. When the codex CLI process dies, the broker keeps
   running with a dead client, accepting new connections but unable to
   serve them. The next companion sees a stale broker.json that points
   at a half-dead broker.

4. Status reads (`buildStatusSnapshot` and `buildSingleJobSnapshot`) do
   not probe whether tracked PIDs are still alive. Once a job is wedged,
   every status query reports it as running.

Fixes:

- captureTurn now races `state.completion` against `client.exitPromise`.
  If the client closes before a terminal event arrives, the turn is
  rejected with the recorded exit error and progress emits a `failed`
  notification.

- runTrackedJob races the runner against a hard timeout (default 30
  minutes, override via `CODEX_COMPANION_JOB_TIMEOUT_MS` or the
  `timeoutMs` option). On timeout the job transitions to `failed` with
  a clear error message instead of leaving the companion stuck.

- The broker now subscribes to its appClient's `exitPromise`. When the
  underlying codex CLI exits unexpectedly the broker logs the reason,
  fans out a `notifications/broker/shuttingDown` event to any connected
  socket, tears down the unix socket and pid file, then exits with
  status 1. The next companion will detect a dead endpoint via
  `ensureBrokerSession` and respawn cleanly.

- A new `isProcessAlive` helper in lib/process.mjs uses `kill(pid, 0)`
  to probe pid liveness without affecting the process.

- A new `markDeadPidJobFailed` and `reconcileIfDead` helper in
  lib/job-control.mjs reconcile any active job whose tracked PID is
  gone. Reconciliation now runs on every status read path, not only the
  `--wait` polling loop, so a single `/codex:status` call surfaces dead
  workers immediately.

Closes openai#183 (runTrackedJob/captureTurn hang in finalizing).
Refs openai#176, openai#164 (dead PID detection on status reads).

Tests:

- tests/process.test.mjs: 3 new cases for `isProcessAlive`.
- tests/dead-pid-reconcile.test.mjs: 6 new cases covering
  `markDeadPidJobFailed` guards (active state, PID match) and the
  reconcile-on-status behaviour for both single-job and full snapshots.
- tests/tracked-jobs-timeout.test.mjs: 2 new cases proving the hard
  timeout transitions a hung job to `failed` and that successful
  runners still complete normally.

All 11 new tests pass. The 5 failing tests under the existing suite
(`collectReviewContext`, three `runtime.test.mjs` snapshot tests, and
`resolveStateDir`) are pre-existing on `main` and unrelated to this
patch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Detect dead tracked processes throughout status wait polling, harden reconciliation race guards with PID normalization, and surface clear timeout/failure messaging. Add regression coverage for immediate timeout reconciliation and string/number PID matching.

Made-with: Cursor
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b155dab283

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Repair runtime.test parse error, ensure status wait consistently reconciles dead PIDs across polling phases, and restore persisted-state preference for stale running index entries without mutating read paths broadly. Also remove duplicate failed-job error rendering and add reconciliation error breadcrumbs.

Made-with: Cursor
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6132e48506

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5c0d7cdac9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0b9379efa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Author

Abdooo2235 commented Apr 9, 2026

Update: metadata consistency follow-up addressed and pushed to this PR branch.

Summary of changes

  • Kept single-job status reconciliation writeback conditional using index diff checks in plugins/codex/scripts/lib/job-control.mjs so read-only status calls do not rewrite updatedAt or reorder recency.
  • Added fallback handling for malformed per-job JSON during status polling.
  • Added dead-PID metadata preservation assertions in tests/runtime.test.mjs so index fields remain intact, including jobClass and summary.

Validation

  • node --test tests/runtime.test.mjs
  • npm test
  • Result: all passing in this environment.

Branch status

  • Originally prepared as local commit e06618b.
  • Rebased and published to this PR branch as head commit 52e4f07.
  • PR head is now up to date with these changes.

@Abdooo2235
Copy link
Copy Markdown
Author

Abdooo2235 commented Apr 9, 2026

Updated PR branch with the latest fix.

Commit

  • 10f8528 fix: improve dead-PID reconciliation by returning a built job snapshot

Validation

  • node --test tests/runtime.test.mjs
  • npm test
  • Result: all passing in this environment

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 10f8528fb1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +292 to 295
function normalizeTrackedPid(pid) {
const numeric = Number(pid);
return Number.isFinite(numeric) && numeric > 0 ? numeric : null;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reintroduce session-scoping task helper functions

This hunk replaces the block that previously defined getCurrentClaudeSessionId, filterJobsForCurrentClaudeSession, and findLatestResumableTaskJob, but call sites for those names still exist in resolveLatestTrackedTaskThread and handleTaskResumeCandidate. In this commit, invoking task-resume-candidate (for example node plugins/codex/scripts/codex-companion.mjs task-resume-candidate --json) now fails with ReferenceError: getCurrentClaudeSessionId is not defined, and resume-last task flows crash for the same reason. Please restore these helpers (or update the callers to a replacement API) so resume commands remain usable.

Useful? React with 👍 / 👎.

@Abdooo2235
Copy link
Copy Markdown
Author

Updated PR branch with the latest fix.

Summary

  • Restored session-scoping task helper functions used by resume flows.
  • Kept task-resume-candidate and task --resume-last paths functional by restoring:
    • getCurrentClaudeSessionId
    • filterJobsForCurrentClaudeSession
    • findLatestResumableTaskJob

Commit

  • f9e80a5 fix: restore session-scoping task helpers

Validation

  • node plugins/codex/scripts/codex-companion.mjs task-resume-candidate --json
  • node --test tests/runtime.test.mjs
  • Result: all passing in this environment (62/62 tests).

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f9e80a5c3e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +305 to +307
return jobs
.filter((job) => job.jobClass === "task" && job.status === "completed")
.find((job) => job.threadId != null);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include failed/cancelled jobs in resumable task lookup

findLatestResumableTaskJob now only considers task jobs with status === "completed", but both task-resume-candidate and task --resume-last depend on this helper. That means a most-recent task that ended as failed or cancelled (including the new dead-PID failure path) is treated as non-resumable even when it has a valid threadId, causing users to get “No previous Codex task thread…” or resume an older thread instead of the latest one.

Useful? React with 👍 / 👎.

@Abdooo2235
Copy link
Copy Markdown
Author

Updated PR branch with the latest fix for the new suggestion.

Summary

  • Updated resumable task lookup to include non-active task states, not only completed tasks.
  • This now treats failed and cancelled task jobs with a valid threadId as resumable candidates.
  • Prevents resume flows from skipping the latest failed/cancelled task thread and falling back to older entries.

Commit

  • 05a55a5 fix: include failed and cancelled tasks in resumable lookup

Validation

  • node --test tests/runtime.test.mjs
  • New regression case observed passing in suite output: task-resume-candidate includes failed/cancelled tasks in resumable lookup

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 05a55a5d48

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +127 to +128
if (didFail) {
return storedJob;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve recency fields when returning dead-PID reconciliation

After marking a dead PID as failed, this branch returns storedJob from the per-job file, but that file does not contain updatedAt. In the same command invocation, callers immediately sort by updatedAt, so the just-failed job can be treated as oldest and /codex:status or default /codex:result may show an older finished job instead of the newly failed one. This is visible when there are multiple finished jobs and dead-PID reconciliation happens during the read path.

Useful? React with 👍 / 👎.

JohnnyVicious added a commit to JohnnyVicious/opencode-plugin-cc that referenced this pull request Apr 12, 2026
Adds isProcessAlive helper and reconcileIfDead / reconcileAllJobs /
markDeadPidJobFailed in job-control. buildStatusSnapshot and the
handleResult/handleCancel paths now probe kill(pid, 0) on any
active-state job and rewrite dead ones to failed before consuming the
list. A single /opencode:status / result / cancel surfaces stuck
workers without waiting for SessionEnd.

markDeadPidJobFailed is race-safe: it re-reads state and refuses to
downgrade terminal states or rewrite when the pid has changed.

Closes #42. Port of openai/codex-plugin-cc#176 + dead-PID parts of #184.
JohnnyVicious added a commit to JohnnyVicious/opencode-plugin-cc that referenced this pull request Apr 12, 2026
* fix: quote $ARGUMENTS in cancel/result/status commands

Unquoted $ARGUMENTS allows shell splitting on user-supplied job IDs
containing metacharacters. Wrap in double quotes to match review.md
and adversarial-review.md.

Closes #38. Port of openai/codex-plugin-cc#168.

* fix: declare model: sonnet in opencode-rescue agent frontmatter

Without a model declaration the agent tier was unpredictable. The
rescue subagent is a thin forwarder that invokes the companion via a
single Bash call and applies trivial routing logic — sonnet is
sufficient and gives users a cost guarantee.

Closes #39. Port of openai/codex-plugin-cc#169.

* fix: scope /opencode:cancel default to current Claude session

Without ref, resolveCancelableJob now filters running jobs by
sessionId so a cancel in session A cannot kill jobs in session B.
Explicit ref still searches all sessions — naming a job counts as
intent.

Closes #45. Port of openai/codex-plugin-cc#84.

* fix: enforce hard wall-clock timeout on runTrackedJob

Wrap the runner with Promise.race against a 30-minute default timeout.
On expiry the job transitions to failed/phase:failed so zombie
'running' rows can't accumulate when a runner hangs.

OPENCODE_COMPANION_JOB_TIMEOUT_MS overrides the default.

Closes #41. Port of openai/codex-plugin-cc#184.

* fix: reconcile dead-PID jobs on every status read

Adds isProcessAlive helper and reconcileIfDead / reconcileAllJobs /
markDeadPidJobFailed in job-control. buildStatusSnapshot and the
handleResult/handleCancel paths now probe kill(pid, 0) on any
active-state job and rewrite dead ones to failed before consuming the
list. A single /opencode:status / result / cancel surfaces stuck
workers without waiting for SessionEnd.

markDeadPidJobFailed is race-safe: it re-reads state and refuses to
downgrade terminal states or rewrite when the pid has changed.

Closes #42. Port of openai/codex-plugin-cc#176 + dead-PID parts of #184.

* fix: avoid embedding large diffs in review prompts

Classify review scope before building the prompt. When the diff exceeds
~5 files or ~256 KB, fall back to a lightweight context (status,
changed-files, diff_stat) and tell OpenCode to inspect the diff itself
via read-only git commands. Prevents HTTP 400 / shallow findings on
moderate-to-large changesets.

Adversarial template grows a {{REVIEW_COLLECTION_GUIDANCE}} slot.
Thresholds overridable via opts.maxInlineDiffFiles/Bytes.

Closes #40. Port of openai/codex-plugin-cc#179.

* fix: respect \$SHELL on Windows when spawning child processes

Add platformShellOption() helper that picks false on POSIX, and
\$SHELL || true on win32 so Git Bash users get their shell while cmd
fallback still resolves .cmd/.bat shims. Apply to runCommand,
spawnDetached, resolveOpencodeBinary, getOpencodeVersion, and the
ensureServer spawn of 'opencode serve'.

Uses 'where' instead of 'which' on win32, and parses the first line
of its CRLF-separated output.

Closes #46. Port of openai/codex-plugin-cc#178.

* fix: migrate tmpdir state to CLAUDE_PLUGIN_DATA + fix /tmp literal

The fallback path was hard-coded to '/tmp' — broken on Windows. Use
os.tmpdir() so Windows and other platforms get a real tmp path.

Additionally: when CLAUDE_PLUGIN_DATA is set on a later call but
state was previously written to the tmpdir fallback, copy it into
the plugin-data dir and rewrite absolute path references inside
state.json and jobs/*.json so logFile pointers don't dangle.

Prevents job history from being silently dropped when commands run
under different env contexts within one Claude session.

Closes #47. Port of openai/codex-plugin-cc#125.

* feat: pass last review findings to rescue automatically

After a successful /opencode:review or /opencode:adversarial-review,
save the rendered output to ~/.opencode-companion/last-review-<hash>.md
(per repo, SHA-256 of workspace path). Add a new 'last-review'
subcommand that reports availability or streams the content.

rescue.md now checks for a saved review when invoked without task
text and asks via AskUserQuestion whether to fix the prior findings
or describe a new task.

The save is best-effort — a failed persistence never fails the review
itself.

Closes #44. Port of openai/codex-plugin-cc#129 (simplified: logic
lives in the companion script rather than an inline node -e one-liner).

* feat: throttle controls for stop-time review gate

Add --review-gate-max and --review-gate-cooldown flags to
/opencode:setup so users can bound the review gate's spend:

  /opencode:setup --review-gate-max 5
  /opencode:setup --review-gate-cooldown 10
  /opencode:setup --review-gate-max off

The stop hook now loads state before touching stdin, checks
reviewGateMaxPerSession and reviewGateCooldownMinutes against the
current session's reviewGateUsage entry, and allows the stop without
running OpenCode when a limit would be exceeded.

Usage entries older than 7 days are pruned on each successful run so
state.json doesn't grow unbounded. renderSetup surfaces the configured
limits.

Closes #48. Port of openai/codex-plugin-cc#20.

* feat: --worktree flag for isolated write-capable rescue tasks

Add a disposable-git-worktree mode so /opencode:rescue --write --worktree
runs OpenCode inside .worktrees/opencode-<ts> on a fresh opencode/<ts>
branch instead of editing the working tree in place. Useful for
exploratory runs, parallel rescues, and running against a dirty tree.

Pieces:
- lib/git.mjs: createWorktree / removeWorktree / deleteWorktreeBranch /
  getWorktreeDiff / applyWorktreePatch. Adds .worktrees/ to
  .git/info/exclude on first use so the dir never shows in status.
- lib/worktree.mjs: session wrapper — createWorktreeSession,
  diffWorktreeSession, cleanupWorktreeSession (keep applies patch back,
  discard just removes).
- opencode-companion.mjs: handleTask threads --worktree + swaps cwd +
  stores session data on the job record + renders a keep/discard
  footer. New worktree-cleanup subcommand reads the stored session and
  runs the keep or discard path.
- agents/opencode-rescue.md, commands/rescue.md, skills/opencode-runtime:
  propagate --worktree through the forwarding layer.
- tests/worktree.test.mjs: create, diff, keep-applies, discard,
  no-change no-op.

Closes #43. Port of openai/codex-plugin-cc#137.

* fix: address pr51 review findings

* fix: keep tracked job timeout referenced

* fix: address pr51 review conversations

* fix: add exclusive file lock to updateState for concurrency safety

updateState's read-modify-write cycle was not protected against
concurrent companion processes (background worker + status/cancel
handler), which could silently lose each other's writes.

Acquire an exclusive lock file (state.json.lock via O_EXCL) before
reading, hold it through mutation and write, release in finally.
Stale locks older than 30s are evicted. Blocks up to 5s with retry.

Closes the pre-existing concurrency race amplified by PR #51's
dead-PID reconciliation (which adds upsertJob calls on every status
read).

* fix: address brownfield discovery bugs

Critical/high fixes:
- BUG-1: saveLastReview use copyFileSync+unlinkSync instead of renameSync
  (fixes Windows compatibility issue where rename fails if target exists)
- BUG-2: handleTask worktree leak - wrap in try/finally to guarantee cleanup
- BUG-3: State migration race - add fallback directory lock during migration
- BUG-4/13: handleTaskWorker missing signal handlers for graceful shutdown

Medium fixes:
- BUG-5: releaseStateLock now fsyncs directory after lock removal
- BUG-11: Error from getConfiguredProviders now logged instead of swallowed

Low fixes:
- BUG-6: PR number validation now rejects negative values
- BUG-7: getBundledConfigDir checks directory exists before returning
- BUG-8: tailLines now properly filters empty lines after split
- BUG-9: resolveReviewAgent always returns tools property (undefined if not used)
- BUG-10: Diff retrieval failure now logs warning instead of silent swallow
- BUG-12: resolveOpencodeBinary now handles spawn errors properly

Additional pre-existing work included:
- safe-command.mjs wrapper for secure command execution
- Command documentation updates
- Test improvements

* fix: polish pr51 follow-up fixes

* fix: address Copilot PR#51 review comments

Four findings, all valid:

C1 (prompts.mjs, git.mjs, process.mjs) — buildReviewPrompt materialized
the full diff string before checking thresholds. For huge diffs the
git/gh subprocess could OOM the companion before the size check ran.
Fix: runCommand gains maxOutputBytes, killing the child and reporting
overflowed once the cap is exceeded. getDiff and getPrDiff thread
maxBytes through. buildReviewPrompt now bounds the read at maxBytes+1
and treats overflow as over-byte-limit without ever materializing
the rest.

C2 (git.mjs) — getDiffByteSize had a docstring claiming it avoided
streaming the full contents, but the implementation did exactly that.
It was also dead code (zero callers). Removed.

C3 (tests/state-lock.test.mjs) — the test injected path.resolve(...)
into a generated ESM import specifier. On Windows that path contains
backslashes and a drive letter, producing an invalid module specifier.
Fix: pathToFileURL(...).href for the injected specifier.

C4 (tests/dead-pid-reconcile.test.mjs) — beforeEach mutated the object
returned by loadState() without saving it back, leaving on-disk state
from earlier tests intact. Fix: saveState(workspace, { config:{}, jobs:[] }).

Adds coverage:
- tests/process.test.mjs: runCommand overflow path and non-overflow path.
- tests/review-prompt-size.test.mjs: bounds-huge-diff end-to-end test
  that writes a 50k-byte file and asserts fewer than 10k 'x' chars land
  in the prompt.

All 167 tests pass.

* fix: keep reading fallback state while migrate lock is held

Addresses a Codex P1 on PR#51: when another migrator holds
\`primaryDir.migrate.lock\`, migrateTmpdirStateIfNeeded waits up to 2s
and then returns without copying anything. Before this fix, stateRoot
still returned primaryDir — but primary/state.json didn't exist yet,
so loadState returned an empty state and the next upsertJob created
primary/state.json with only the new entry, orphaning every seeded
fallback job.

Fix: after a migration attempt, if primary/state.json is absent and
fallback/state.json is present, stateRoot returns the fallback dir.
Reads see real data, writes land in fallback, and a later migration
retry can pick them up cleanly.

Adds a regression test that pre-creates the migrate lock, seeds
fallback with a job, switches to CLAUDE_PLUGIN_DATA, and verifies
that stateRoot falls back, loadState sees the seeded job, and a
subsequent write preserves both the seeded and the in-flight rows.

The symlink-refusal test had to be updated because it was reusing
stateRoot to name the "primary" dir — with the new fallback guard,
that call now returns fallbackDir on failed migration. The test
now computes the expected primary path directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Review jobs stuck in 'running' after process death — no dead-PID detection

1 participant