Skip to content

Fix decode load tracking for streaming PD responses#176

Open
qcy615 wants to merge 2 commits into
vllm-project:mainfrom
qcy615:feat/fix-vllm-pd-stream-load
Open

Fix decode load tracking for streaming PD responses#176
qcy615 wants to merge 2 commits into
vllm-project:mainfrom
qcy615:feat/fix-vllm-pd-stream-load

Conversation

@qcy615

@qcy615 qcy615 commented May 26, 2026

Copy link
Copy Markdown

Purpose

Fix decode worker load accounting for vLLM PD streaming responses.

Before this change, process_vllm_two_stage_request decremented decode load immediately after receiving decode response headers. For streaming responses, the decode backend continues generating and forwarding tokens after headers are returned, so decode load was released too early. In cache-aware routing, this made decode workers appear idle even while they still had active running requests, preventing the load-balance policy from detecting imbalance and redistributing traffic.

This PR keeps decode load active until the decode response body is actually finished or dropped without buffering pass-through decode responses:

  • Wrap pass-through decode bodies in LoadTrackedDecodeStream for both streaming and non-streaming requests that do not need logprobs merging.
  • Decrement decode load once on upstream stream EOF.
  • Decrement decode load on body drop, covering client disconnects and early response disposal.
  • Keep full-body buffering only for non-streaming logprobs merging, where the JSON body must be modified before returning it.
  • Add focused tests for EOF and drop behavior.

Test Plan

  1. Start the router in vLLM PD disaggregation mode with cache-aware routing:
vllm-router \
  --vllm-pd-disaggregation \
  --host 0.0.0.0 \
  --port 29100 \
  --policy cache_aware \
  --cache-threshold 0.5 \
  --balance-abs-threshold 32 \
  --balance-rel-threshold 1.1 \
  --intra-node-data-parallel-size 4 \
  --prefill http://0.0.0.0:18000 \
  --decode http://0.0.0.0:28000 \
  --log-level debug
  1. Send streaming requests through the router and compare cache-aware load-balance logs plus the running-requests dashboard before and after the change.

  2. Run focused unit and compile checks in Docker because the local host does not have Cargo installed:

cargo fmt --check
cargo test test_load_tracked_decode_stream --lib
cargo check --lib --bin vllm-router

Test Result

Automated validation passed in Docker with --network=host:

cargo fmt --check
passed

cargo test test_load_tracked_decode_stream --lib
2 passed; 0 failed

cargo check --lib --bin vllm-router
Finished `dev` profile
  • Before the fix, decode load stayed at max_load=0, min_load=0 in cache-aware logs even while the scheduler dashboard showed running decode requests concentrated on one worker. Because the recorded decode load stayed at zero, the load-balance policy saw is_unbalanced=false and did not trigger redistribution.
image image
  • After the fix, decode load was recorded during streaming. Cache-aware logs showed decode imbalance such as max_load=39, min_load=2, is_unbalanced=true, and the running-requests dashboard showed traffic spread across decode workers. The load-balance policy could trigger normally.
image image
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results.

Signed-off-by: qcy615 <qin_changyan@163.com>
@qcy615 qcy615 force-pushed the feat/fix-vllm-pd-stream-load branch from 7fda228 to cdd0a8a Compare May 26, 2026 10:38
@qcy615

qcy615 commented May 30, 2026

Copy link
Copy Markdown
Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cdd0a8a918

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/routers/http/vllm_pd_router.rs Outdated
Signed-off-by: qcy615 <qin_changyan@163.com>
@qcy615 qcy615 force-pushed the feat/fix-vllm-pd-stream-load branch from 991a83c to c0deda6 Compare June 1, 2026 02:46

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c0deda6769

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

.await
.map_err(|e| PDRouterError::NetworkError {
// Read the full decode response before releasing decode load.
let decode_body = match decode_response.bytes().await {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add a drop guard while buffering logprobs responses

For non-streaming requests with logprobs or echo, the handler now keeps the decode worker load incremented while awaiting the full upstream body, but this branch has no drop guard like LoadTrackedDecodeStream. If the client disconnects or the axum service future is otherwise cancelled while decode_response.bytes().await is still pending, execution never reaches either explicit decrement_load() and the decode worker remains permanently over-counted. Wrap this phase in an RAII guard (or reuse the tracked stream pattern) so cancellation releases the load as well as EOF/errors.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant