Abort in-flight requests when the client disconnects#136
Conversation
Propagate client cancellation from the API server to the conductor, which removes the request from its workers and frees KV-cache pages and the concurrency slot instead of running cancelled requests to completion.
NSagan271
left a comment
There was a problem hiding this comment.
Looks good to me (just some nitpicks). I'd do some more testing with removes to make sure that our worker / speculation path is robust to in-flight removal (since REMOVE_REQUEST has only previously been sent after all workers are done).
| finished = False | ||
| try: | ||
| while True: | ||
| if time.time() - start > self.timeout_seconds: |
There was a problem hiding this comment.
Probably for another PR (I can raise an issue), but I think it could be worthwhile to be able to raise/lower timeout_seconds per request? e.g., to fail fast if we know the request should be short, or to bump it up for a longer request. Thoughts?
| if not active: | ||
| return | ||
| logger.info("Client cancelled request %s; releasing resources", request_id) | ||
| self.preprocess_worker.abort_request(request_id) |
There was a problem hiding this comment.
nit: is it cleaner to have preprocess_worker.abort_request also do the cleanup path, instead of the caller having to remember to call both?
| return | ||
|
|
||
| request_data = self.requests.get(request_id) | ||
| if request_data is None: |
There was a problem hiding this comment.
nit: have a logger.info / warning here
* fix race condition * Guard schedule of removed rids
Propagate client cancellation from the API server to the conductor, which removes the request from its workers and frees KV-cache pages and the concurrency slot instead of running cancelled requests to completion.
What does this PR do?
Closes Issue #135.
When a client cancels a request, M* now aborts it instead of running it to EOS/
max_tokens. The API server detects the disconnect - streaming via the response generator's teardown, non-streaming viaRequest.is_disconnected()- and signals the conductor, which removes the request from its workers and admits the next queued request. The worker-side teardown is unchanged; this adds the missing trigger.How was it tested?
ruff check .andtest/modular/test_openai_router.pypass.max_concurrent_requests=1: a cancelled request previously held the only slot until natural EOS (~7s); now the slot frees within ~2ms of disconnect and the next request starts immediately - verified for streaming, non-streaming, and timeout paths.Checklist
ruff check .passes