runTrackedJob / captureTurn can hang in phase: finalizing indefinitely (no timeout)

## Summary

A job can be left forever in \`status: \"running\"\` + \`phase: \"finalizing\"\` after either (a) a successful spark-model turn or (b) a \`cancel\` that didn't deliver an \`interrupted\` notification. The job file never transitions to a terminal state, the lockfile persists, and subsequent \`--resume-last\` against the same workspace is blocked by the zombie.

**Plugin version**: 1.0.3
**Files**: \`scripts/lib/tracked-jobs.mjs\`, \`scripts/lib/codex.mjs\`

## Repro patterns

- **Cancel path**: start a long job, \`cancel\` it. Sometimes the AppServer never emits a \`turn/completed (status=interrupted)\` or the worker exits before reading it. Job file remains \`running/finalizing\` forever.
- **Spark path**: run a short spark task. Some fraction of turns appear to never emit \`final_answer\`. \`state.finalAnswerSeen\` stays false → \`scheduleInferredCompletion\` doesn't fire → \`captureTurn\`'s promise stays pending.

## Root cause hypothesis

\`scripts/lib/tracked-jobs.mjs:142\` \`runTrackedJob\` does:

\`\`\`js
const execution = await runner();   // <-- no timeout
\`\`\`

\`runner()\` resolves only when \`captureTurn\` (in \`scripts/lib/codex.mjs\`) sees a terminal AppServer event (\`turn/completed\`, or an inferred completion via \`final_answer\`). If neither arrives, the await never resolves and no \`catch\` runs, so the job file is never written to a terminal status. There is no idle/heartbeat fallback either.

## Suggested fix (any of, ideally all)

1. **Hard timeout in \`runTrackedJob\`**: add \`taskTimeoutMs\` option (default ~15 min, configurable). On expiry, write \`failed\` + \`phase: \"timeout\"\` and reject the runner promise.
2. **Idle watchdog in \`captureTurn\`**: track the last received progress event timestamp; if \`now - lastProgress > N\` (e.g. 90 s) reject with \`no notifications\`.
3. **Forced reject after interrupt**: \`interruptAppServerTurn\` should arm a short timer; if the worker hasn't observed an \`interrupted\` turn within T seconds, reject \`captureTurn\` with \`cancel-not-acknowledged\`.

(1) alone is enough to stop the zombies; (2) catches subtler hang modes; (3) makes cancel reliable.

## Workaround

None clean. My MCP wrapper can't reach into the companion's job store without coupling to its on-disk format (unstable across upgrades). Operationally I just \`--fresh\` past the zombie and document manual job-store wipes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runTrackedJob / captureTurn can hang in phase: finalizing indefinitely (no timeout) #183

Summary

Repro patterns

Root cause hypothesis

Suggested fix (any of, ideally all)

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

runTrackedJob / captureTurn can hang in phase: finalizing indefinitely (no timeout) #183

Description

Summary

Repro patterns

Root cause hypothesis

Suggested fix (any of, ideally all)

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions