Skip to content

Abort in-flight requests when the client disconnects#136

Open
merceod wants to merge 5 commits into
mainfrom
fix/cancelled-request-gpu-leak
Open

Abort in-flight requests when the client disconnects#136
merceod wants to merge 5 commits into
mainfrom
fix/cancelled-request-gpu-leak

Conversation

@merceod

@merceod merceod commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Propagate client cancellation from the API server to the conductor, which removes the request from its workers and frees KV-cache pages and the concurrency slot instead of running cancelled requests to completion.

What does this PR do?

Closes Issue #135.

When a client cancels a request, M* now aborts it instead of running it to EOS/max_tokens. The API server detects the disconnect - streaming via the response generator's teardown, non-streaming via Request.is_disconnected() - and signals the conductor, which removes the request from its workers and admits the next queued request. The worker-side teardown is unchanged; this adds the missing trigger.

How was it tested?

  • ruff check . and test/modular/test_openai_router.py pass.
  • Live Orpheus server with max_concurrent_requests=1: a cancelled request previously held the only slot until natural EOS (~7s); now the slot frees within ~2ms of disconnect and the next request starts immediately - verified for streaming, non-streaming, and timeout paths.
  • Normal outputs unchanged across Orpheus (audio), BAGEL (text + image), Pi0.5 (actions), and V-JEPA 2 (rollout).

Checklist

  • ruff check . passes
  • Added or updated tests / docs where relevant

Propagate client cancellation from the API server to the conductor, which
removes the request from its workers and frees KV-cache pages and the
concurrency slot instead of running cancelled requests to completion.

@NSagan271 NSagan271 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me (just some nitpicks). I'd do some more testing with removes to make sure that our worker / speculation path is robust to in-flight removal (since REMOVE_REQUEST has only previously been sent after all workers are done).

finished = False
try:
while True:
if time.time() - start > self.timeout_seconds:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably for another PR (I can raise an issue), but I think it could be worthwhile to be able to raise/lower timeout_seconds per request? e.g., to fail fast if we know the request should be short, or to bump it up for a longer request. Thoughts?

if not active:
return
logger.info("Client cancelled request %s; releasing resources", request_id)
self.preprocess_worker.abort_request(request_id)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it cleaner to have preprocess_worker.abort_request also do the cleanup path, instead of the caller having to remember to call both?

return

request_data = self.requests.get(request_id)
if request_data is None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: have a logger.info / warning here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants