Skip to content

fix(harnesses): reap the runner subprocess when spawn is cancelled mid-bind#1982

Open
tomsen-ai wants to merge 1 commit into
omnigent-ai:mainfrom
tomsen-ai:fix/mid-spawn-cancel-leak
Open

fix(harnesses): reap the runner subprocess when spawn is cancelled mid-bind#1982
tomsen-ai wants to merge 1 commit into
omnigent-ai:mainfrom
tomsen-ai:fix/mid-spawn-cancel-leak

Conversation

@tomsen-ai

@tomsen-ai tomsen-ai commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

Related issue

N/A

Summary

  • Cancelling a turn task while _spawn_entry is still waiting in _wait_for_bind (session delete during cold start, sub-agent teardown, AP shutdown) leaked the just-spawned runner subprocess: it exists from create_subprocess_exec onward but is only registered in _entries after _spawn_entry returns, so release() no-ops on the conversation and the idle reaper — which only walks _entries — never sees it. The orphan (~100 MB per the regression test's own peak-RSS meter) lives until the AP daemon itself exits.
  • Fix: wrap everything after the spawn in try/except BaseException and reap on any unwind — kill, shield the corpse-wait against a second cancellation, close the subprocess transport, remove the socket file, re-raise. The bind-timeout path already kills before raising; this extends the same ownership discipline to cancellation and any other unwind.

ELI5: the manager phones up a chef (spawns a runner) and only writes him into the staff register once he walks in. If the customer cancels while the chef is still on his way, the manager used to just walk off — the chef still arrives, is on nobody's register, and stands in the kitchen forever. Now the manager waits at the door and politely sends him home.

create_subprocess_exec ──► _wait_for_bind ──► register in _entries
        │                       │ (1-30s of await points)
        │                       └── CancelledError lands here
        └── process is ALIVE but on no register:
            release() no-ops · reaper can't see it · leaks until AP exits

Why the guard is airtight: between _wait_for_bind returning and registration in get_client there is no await point, so cancellation can only land inside the guarded region. Already-dead arrivals (exited-during-spawn, bind timeout) skip the kill via the returncode check.

Test Plan

  • New regression test test_mid_spawn_cancellation_reaps_subprocess: hangs _wait_for_bind, cancels get_client mid-spawn, asserts the captured subprocess is reaped and the session is unregistered. Red on main (times out waiting for the process to die; pytest's peak-RSS meter shows the leaked runner at ~109 MB), green with the fix.
  • test_mid_spawn_double_cancellation_still_reaps: a second cancellation landing during cleanup must not abandon the reap (covers the asyncio.shield on the corpse-wait).
  • Full tests/runtime/harnesses/ suite: 126 passed.

Demo

Red → green on the regression test (the ~109 MB peak-RSS delta in the red run is the leaked runner itself):
omnigent-1982-demo

# main @ b9332cc6 — RED: cancelled mid-spawn, the subprocess survives
$ python -m pytest tests/runtime/harnesses/test_process_manager.py -k mid_spawn -q
E               TimeoutError          # process.wait() times out — leaked runner still alive
+  109.0 MB  test_mid_spawn_cancellation_reaps_subprocess   # peak-RSS: the orphan's footprint
2 failed, 30 deselected in 10.15s

# with this PR — GREEN: the corpse-wait completes instantly
$ python -m pytest tests/runtime/harnesses/test_process_manager.py -k mid_spawn -q
2 passed, 30 deselected in 0.04s
$ python -m pytest tests/runtime/harnesses/ -q
126 passed, 1 warning in 44.65s

Type of change

  • Bug fix
  • Feature
  • UI / frontend change
  • Refactor / chore
  • Docs
  • Test / CI
  • Breaking change

Test coverage

  • Unit tests added / updated
  • Integration tests added / updated
  • E2E tests added / updated
  • Manual verification completed
  • Existing tests cover this change
  • Not applicable

Coverage notes

The two new tests drive the cancellation path end-to-end against a real spawned runner; the existing process-manager suite covers the untouched success/timeout paths.

Changelog

Cancelling or deleting a session during agent cold start no longer leaks the harness subprocess.

🤖 Generated with Claude Code

@github-actions github-actions Bot added the size/L Pull request size: L label Jul 5, 2026
@github-actions github-actions Bot requested a review from dhruv0811 July 5, 2026 10:01
…d-bind

A turn-task cancellation (session delete, sub-agent teardown, AP
shutdown) landing inside _wait_for_bind leaks the just-spawned runner:
the subprocess exists from create_subprocess_exec onward but is only
registered in _entries after _spawn_entry returns, so release() no-ops
on the conversation and the idle reaper — which only walks _entries —
never sees it. The orphaned runner (a full FastAPI + SDK import,
~100 MB by the regression test's own peak-RSS meter) lives until the
AP daemon itself exits.

Wrap everything after the spawn in try/except BaseException and reap
on any unwind: kill (the bind-timeout path at _wait_for_bind already
kills before raising — this extends the same ownership discipline to
cancellation), shield the corpse-wait against a second cancellation,
close the subprocess transport, remove the socket file, then re-raise
so cancellation semantics are unchanged. Bind-timeout and
exited-during-spawn arrivals are already dead and skip the kill.

The window is airtight by construction: between _wait_for_bind
returning and registration in get_client there is no await point, so
cancellation can only land inside the guarded region.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@tomsen-ai tomsen-ai force-pushed the fix/mid-spawn-cancel-leak branch from 5c855c8 to 5c525e1 Compare July 5, 2026 10:02
@github-actions

github-actions Bot commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

@tomsen-ai This PR is a Bug fix, Feature, or UI / frontend change but the Demo section is missing or only contains a placeholder.

These change types require a screenshot or screen recording so reviewers can see the new behaviour without checking out the branch. Please update the Demo section with:

  • A screenshot or screen recording of the change, or
  • A link to a hosted video or GIF showing the new behaviour.

Use N/A only when the change has no user-visible effect whatsoever (e.g. a pure refactor or test-only change). If that's the case, uncheck the relevant type box and check Refactor / chore or Test / CI instead.

@github-actions github-actions Bot added the needs-demo PR needs a demo screenshot or recording label Jul 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-demo PR needs a demo screenshot or recording size/L Pull request size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants