Fix decode load tracking for streaming PD responses#176
Conversation
Signed-off-by: qcy615 <qin_changyan@163.com>
7fda228 to
cdd0a8a
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cdd0a8a918
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: qcy615 <qin_changyan@163.com>
991a83c to
c0deda6
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c0deda6769
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| .await | ||
| .map_err(|e| PDRouterError::NetworkError { | ||
| // Read the full decode response before releasing decode load. | ||
| let decode_body = match decode_response.bytes().await { |
There was a problem hiding this comment.
Add a drop guard while buffering logprobs responses
For non-streaming requests with logprobs or echo, the handler now keeps the decode worker load incremented while awaiting the full upstream body, but this branch has no drop guard like LoadTrackedDecodeStream. If the client disconnects or the axum service future is otherwise cancelled while decode_response.bytes().await is still pending, execution never reaches either explicit decrement_load() and the decode worker remains permanently over-counted. Wrap this phase in an RAII guard (or reuse the tracked stream pattern) so cancellation releases the load as well as EOF/errors.
Useful? React with 👍 / 👎.
Purpose
Fix decode worker load accounting for vLLM PD streaming responses.
Before this change,
process_vllm_two_stage_requestdecremented decode load immediately after receiving decode response headers. For streaming responses, the decode backend continues generating and forwarding tokens after headers are returned, so decode load was released too early. In cache-aware routing, this made decode workers appear idle even while they still had active running requests, preventing the load-balance policy from detecting imbalance and redistributing traffic.This PR keeps decode load active until the decode response body is actually finished or dropped without buffering pass-through decode responses:
LoadTrackedDecodeStreamfor both streaming and non-streaming requests that do not need logprobs merging.Test Plan
Send streaming requests through the router and compare cache-aware load-balance logs plus the running-requests dashboard before and after the change.
Run focused unit and compile checks in Docker because the local host does not have Cargo installed:
cargo fmt --check cargo test test_load_tracked_decode_stream --lib cargo check --lib --bin vllm-routerTest Result
Automated validation passed in Docker with
--network=host:max_load=0, min_load=0in cache-aware logs even while the scheduler dashboard showed running decode requests concentrated on one worker. Because the recorded decode load stayed at zero, the load-balance policy sawis_unbalanced=falseand did not trigger redistribution.max_load=39, min_load=2, is_unbalanced=true, and the running-requests dashboard showed traffic spread across decode workers. The load-balance policy could trigger normally.Essential Elements of an Effective PR Description Checklist