Skip to content

runTrackedJob / captureTurn can hang in phase: finalizing indefinitely (no timeout) #183

@lumatic2

Description

@lumatic2

Summary

A job can be left forever in `status: "running"` + `phase: "finalizing"` after either (a) a successful spark-model turn or (b) a `cancel` that didn't deliver an `interrupted` notification. The job file never transitions to a terminal state, the lockfile persists, and subsequent `--resume-last` against the same workspace is blocked by the zombie.

Plugin version: 1.0.3
Files: `scripts/lib/tracked-jobs.mjs`, `scripts/lib/codex.mjs`

Repro patterns

  • Cancel path: start a long job, `cancel` it. Sometimes the AppServer never emits a `turn/completed (status=interrupted)` or the worker exits before reading it. Job file remains `running/finalizing` forever.
  • Spark path: run a short spark task. Some fraction of turns appear to never emit `final_answer`. `state.finalAnswerSeen` stays false → `scheduleInferredCompletion` doesn't fire → `captureTurn`'s promise stays pending.

Root cause hypothesis

`scripts/lib/tracked-jobs.mjs:142` `runTrackedJob` does:

```js
const execution = await runner(); // <-- no timeout
```

`runner()` resolves only when `captureTurn` (in `scripts/lib/codex.mjs`) sees a terminal AppServer event (`turn/completed`, or an inferred completion via `final_answer`). If neither arrives, the await never resolves and no `catch` runs, so the job file is never written to a terminal status. There is no idle/heartbeat fallback either.

Suggested fix (any of, ideally all)

  1. Hard timeout in `runTrackedJob`: add `taskTimeoutMs` option (default ~15 min, configurable). On expiry, write `failed` + `phase: "timeout"` and reject the runner promise.
  2. Idle watchdog in `captureTurn`: track the last received progress event timestamp; if `now - lastProgress > N` (e.g. 90 s) reject with `no notifications`.
  3. Forced reject after interrupt: `interruptAppServerTurn` should arm a short timer; if the worker hasn't observed an `interrupted` turn within T seconds, reject `captureTurn` with `cancel-not-acknowledged`.

(1) alone is enough to stop the zombies; (2) catches subtler hang modes; (3) makes cancel reliable.

Workaround

None clean. My MCP wrapper can't reach into the companion's job store without coupling to its on-disk format (unstable across upgrades). Operationally I just `--fresh` past the zombie and document manual job-store wipes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions