Skip to content

Stop hook: skip read-only turns and run CI poll asynchronously#17

Open
gnguralnick wants to merge 1 commit into
mainfrom
mngr/stop-hook-noise-tuning
Open

Stop hook: skip read-only turns and run CI poll asynchronously#17
gnguralnick wants to merge 1 commit into
mainfrom
mngr/stop-hook-noise-tuning

Conversation

@gnguralnick
Copy link
Copy Markdown

Summary

Two stop-hook noise sources that make the lead-proxy / multi-worker pattern painful:

  1. Synchronous CI poll blocks the message slot for up to 10 minutes.
    The orchestrator currently waits on poll_pr_checks.sh (timeout 600s, interval 15s)
    in parallel with the gates. While it waits, the agent's tmux slot is occupied and
    cannot process queued user messages -- in the lead-proxy pattern, that turns 30s
    gate-approval iterations into 10-minute ones. The only workaround so far has been
    to manually kill <pid> the poll, which silently fails the hook.

    Fix: make the poll fire-and-forget. The orchestrator now spawns the poll
    detached via nohup + disown + stdio redirection, records the SHA + PID it was
    launched for, and surfaces the poll's outcome to the next turn that lands on
    the same commit. If HEAD has moved between turns, the stale poll is terminated
    and a fresh one is started. The pending reset previously written by ensure-pr
    is removed so that the prior-turn pr_status survives across ensure-pr calls.

  2. Stop hook fires even on Read-only turns.
    In a recent session, a stop-hook ran the full pipeline after a turn that contained
    ~21 tool calls -- all Read. Read-only turns produce no code-affecting work; the
    pipeline (fetch+merge, push, ensure-pr, review gates) is wasted churn.

    Fix: parse transcript_path from the hook's stdin JSON, enumerate tool_use
    names since the last human user turn via a small Python helper, and exit 0 if
    they're all in {Read, Glob, Grep, LS} (or empty). Gated on a new
    stop_hook.skip_readonly_turns config knob (default true). Falls through to
    the full pipeline if the transcript is missing or unparseable, so this is safe.

Test plan

Manually exercised each branch with a constructed transcript and a sandboxed copy
of the orchestrator:

  • Read-only transcript (Read/Glob/Grep) on a real repo: orchestrator exits 0
    immediately with Read-only turn (...); skipping all gates log entry.
  • Substantive transcript (Read/Edit/Bash): orchestrator skips the read-only
    branch and proceeds into the rest of the pipeline.
  • CI poll fire-and-forget: orchestrator returns in 0s, background poll process
    reparented to PID 1, state files written (pr_status_pid, pr_status_sha,
    pr_status=pending).
  • Prior-turn failure for current SHA: orchestrator exits 2, surfaces the failure
    with the original error text, no new poll launched.
  • Prior-turn success for current SHA: skipped, no new poll launched.
  • Already-running poll for current SHA: skipped, no duplicate poll launched.
  • HEAD changed: stale poll terminated, fresh poll launched for the new SHA.
  • On exit, the background poll's EXIT trap clears pr_status_pid (only if
    the file still matches its own PID).

Two related sources of noise in the stop-hook orchestrator make the
lead-proxy / multi-worker pattern painful:

1. Synchronous CI poll. The orchestrator waits up to ci.timeout (default
   600s) for CI checks to complete. While it waits, the agent's tmux
   slot is occupied and cannot process queued user messages.

   Fix: launch the CI poll detached (nohup + disown + stdio redirection)
   and return immediately. The orchestrator records the SHA + PID it
   launched the poll for, and surfaces the poll's outcome to the *next*
   turn that lands on the same commit. If HEAD has moved, a stale poll
   is terminated and a fresh one is launched. Removes the 'pending'
   reset from ensure-pr so prior-turn results survive the next ensure-pr.

2. Read-only turns trigger the full pipeline. A turn that only used
   Read/Glob/Grep/LS produces no code-affecting work, but the orchestrator
   still runs fetch+merge, push, ensure-pr, and review gates.

   Fix: parse transcript_path from the hook's stdin JSON, walk the
   transcript with a small Python helper to enumerate tool_use names
   used since the last human user turn, and exit 0 immediately if all
   of them are in {Read, Glob, Grep, LS} (or empty). Gated on the new
   stop_hook.skip_readonly_turns config knob (default true). Falls
   through to the normal pipeline if the transcript is missing or
   unparseable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant