Skip to content

clauck work hangs indefinitely after inner claude -p emits terminal JSON #136

@CoreyRDean

Description

@CoreyRDean

Summary

clauck work <text> can hang forever after the work has actually completed. The spawned claude -p subprocess emits its terminal result JSON envelope but never exits, and the wrapper's subprocess.run(..., capture_output=True) (no timeout=) blocks on communicate() waiting for pipe EOF that never comes.

User-visible symptom: the spinner keeps incrementing (executing (sonnet, medium effort)... (1455s) and counting) long after the work it described is done and durably committed to disk. There is no log file for clauck work itself, so a casual observer can't tell whether work is in flight or whether the wrapper is zombie-waiting.

Repro (observed)

  1. clauck work "<long natural-language install prompt>" invoked at 11:35 local.
  2. Stage 1 interpreter classified, stage 2 dispatched claude -p ... --output-format json (lib/clauck:4337–4350).
  3. Inner claude -p ran the requested work cleanly: job file written at 11:36, clauck fire dream-pass ran 11:37–11:41 to exit_code=0, dream summary posted to Slack. Fire log ~/.clauck/dream-pass-<ts>-<pid>.log ends with the full success envelope: terminal_reason: completed, duration_ms: 274989, total_cost_usd: 0.77.
  4. 75 minutes later, the wrapper is still spinning. ps: inner claude -p (PID 27870) idle at 0% CPU, STAT S+, no clauck-mcp children, no live TCP connections (lsof). Process is alive but not doing anything.
  5. The spinner is the only signal the user has, and it is positively misleading — it implies ongoing execution.

Diagnosis

lib/clauck:4337:

result = subprocess.run(
    [claude_bin, "-p", enhanced_prompt,
     ...
     "--output-format", "json",
     "--setting-sources", ""],
    capture_output=True, text=True,
)
  • capture_output=TruePopen.communicate() under the hood, which drains stdout and stderr and waits for the child to exit, not just for terminal output.
  • No timeout= argument.
  • If claude -p emits its result JSON to stdout but doesn't close its file descriptors / exit (a known failure mode where a spawned MCP server or worker keeps fds alive after the agent loop ends), the wrapper waits indefinitely. The terminal envelope is sitting in the captured-stdout buffer, never delivered to the user, never parsed.

The same subprocess.run(..., capture_output=True) pattern without timeout exists in cmd_semantic's stage-1 interpreter call (lib/clauck:4195) and is the pattern used by _parse_interpreter_result. Stage 1 hangs would be much rarer (no MCP, 3-turn cap, $0.30 budget) but the structural risk is the same.

Why it's worse than a bare hang

There is no log file for clauck work. Fired/scheduled jobs write to ~/.clauck/<name>-<ts>-<pid>.log via run-job.sh, but the work-alias path streams to the terminal and captures via subprocess.run. So when the wrapper hangs:

  • The spinner says executing (sonnet, medium effort)... (Ns) — implies in-flight work.
  • No log to tail. The fire log of any job the work triggered exists (and shows clean completion), but that's two layers down and not obviously connected.
  • The user has no way to discover that work is done short of running ps / lsof on the spawned claude -p.

Proposed fix

Two changes, both small:

  1. Stream + watchdog instead of blocking communicate. Switch the stage-2 (and stage-1) call to subprocess.Popen reading stdout line-by-line. As soon as a JSON line with "type":"result" and "terminal_reason":"completed" (or any of the other terminal reasons) lands, render it, then terminate() the child if it doesn't exit on its own within e.g. 5s, escalating to kill() after another 5s. Same approach run-job.sh already takes for tombstoning. Don't trust the child to close pipes voluntarily.

  2. Write a clauck work log alongside fired-job logs. Today fired/scheduled jobs are observable via ~/.clauck/<name>-<ts>-<pid>.log; clauck work is not. Mirror the pattern: ~/.clauck/_work-<ts>-<pid>.log containing the routing decision, the enhanced prompt, the spawned argv, and the streamed JSON. Then clauck logs --last 1 _work works for diagnosis. (Alternatively, the work alias could route through run-job.sh for free, but that's a bigger refactor.)

The watchdog change is the load-bearing one — it prevents hangs entirely and surfaces results immediately. The log change is an observability backstop: even if a future failure mode escapes the watchdog, the user has somewhere to look.

Workaround until fixed

If you see clauck work spinning past ~5× the expected duration, in another terminal:

ps -ef | grep "claude -p" | grep -v grep        # find the inner claude PID
ls -lt ~/.clauck/*.log | head                   # check whether a fire it triggered finished
kill <inner-pid>                                # collapse the wrapper

The work the inner claude -p did is durable; killing the parent doesn't roll it back.

Environment

  • clauck v1.5.x (main, recent)
  • Claude CLI 2.1.129
  • macOS Darwin 25.4.0, Python 3.14.4 (homebrew), /usr/bin/python3 for scheduler

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions