Fix decode load tracking for streaming PD responses by qcy615 · Pull Request #176 · vllm-project/router

qcy615 · 2026-05-26T10:27:53Z

Purpose

Fix decode worker load accounting for vLLM PD streaming responses.

Before this change, process_vllm_two_stage_request decremented decode load immediately after receiving decode response headers. For streaming responses, the decode backend continues generating and forwarding tokens after headers are returned, so decode load was released too early. In cache-aware routing, this made decode workers appear idle even while they still had active running requests, preventing the load-balance policy from detecting imbalance and redistributing traffic.

This PR keeps decode load active until the decode response body is actually finished or dropped without buffering pass-through decode responses:

Wrap pass-through decode bodies in LoadTrackedDecodeStream for both streaming and non-streaming requests that do not need logprobs merging.
Decrement decode load once on upstream stream EOF.
Decrement decode load on body drop, covering client disconnects and early response disposal.
Keep full-body buffering only for non-streaming logprobs merging, where the JSON body must be modified before returning it.
Add focused tests for EOF and drop behavior.

Test Plan

Start the router in vLLM PD disaggregation mode with cache-aware routing:

vllm-router \
  --vllm-pd-disaggregation \
  --host 0.0.0.0 \
  --port 29100 \
  --policy cache_aware \
  --cache-threshold 0.5 \
  --balance-abs-threshold 32 \
  --balance-rel-threshold 1.1 \
  --intra-node-data-parallel-size 4 \
  --prefill http://0.0.0.0:18000 \
  --decode http://0.0.0.0:28000 \
  --log-level debug

Send streaming requests through the router and compare cache-aware load-balance logs plus the running-requests dashboard before and after the change.
Run focused unit and compile checks in Docker because the local host does not have Cargo installed:

cargo fmt --check
cargo test test_load_tracked_decode_stream --lib
cargo check --lib --bin vllm-router

Test Result

Automated validation passed in Docker with --network=host:

cargo fmt --check
passed

cargo test test_load_tracked_decode_stream --lib
2 passed; 0 failed

cargo check --lib --bin vllm-router
Finished `dev` profile

Before the fix, decode load stayed at max_load=0, min_load=0 in cache-aware logs even while the scheduler dashboard showed running decode requests concentrated on one worker. Because the recorded decode load stayed at zero, the load-balance policy saw is_unbalanced=false and did not trigger redistribution.

After the fix, decode load was recorded during streaming. Cache-aware logs showed decode imbalance such as max_load=39, min_load=2, is_unbalanced=true, and the running-requests dashboard showed traffic spread across decode workers. The load-balance policy could trigger normally.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results.

Signed-off-by: qcy615 <qin_changyan@163.com>

qcy615 · 2026-05-30T08:40:20Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cdd0a8a918

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: qcy615 <qin_changyan@163.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c0deda6769

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-01T02:49:22Z

-                    .await
-                    .map_err(|e| PDRouterError::NetworkError {
+            // Read the full decode response before releasing decode load.
+            let decode_body = match decode_response.bytes().await {


Add a drop guard while buffering logprobs responses

For non-streaming requests with logprobs or echo, the handler now keeps the decode worker load incremented while awaiting the full upstream body, but this branch has no drop guard like LoadTrackedDecodeStream. If the client disconnects or the axum service future is otherwise cancelled while decode_response.bytes().await is still pending, execution never reaches either explicit decrement_load() and the decode worker remains permanently over-counted. Wrap this phase in an RAII guard (or reuse the tracked stream pattern) so cancellation releases the load as well as EOF/errors.

Useful? React with 👍 / 👎.

Fix decode load tracking for streaming PD responses

cdd0a8a

Signed-off-by: qcy615 <qin_changyan@163.com>

qcy615 force-pushed the feat/fix-vllm-pd-stream-load branch from 7fda228 to cdd0a8a Compare May 26, 2026 10:38

chatgpt-codex-connector Bot reviewed May 30, 2026

View reviewed changes

Comment thread src/routers/http/vllm_pd_router.rs Outdated

Avoid buffering proxied decode responses

c0deda6

Signed-off-by: qcy615 <qin_changyan@163.com>

qcy615 force-pushed the feat/fix-vllm-pd-stream-load branch from 991a83c to c0deda6 Compare June 1, 2026 02:46

chatgpt-codex-connector Bot reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix decode load tracking for streaming PD responses#176

Fix decode load tracking for streaming PD responses#176
qcy615 wants to merge 2 commits into
vllm-project:mainfrom
qcy615:feat/fix-vllm-pd-stream-load

qcy615 commented May 26, 2026 •

edited

Loading

Uh oh!

qcy615 commented May 30, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qcy615 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

qcy615 commented May 30, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qcy615 commented May 26, 2026 •

edited

Loading