[codex] Fix decode load tracking for streaming PD responses by qcy615 · Pull Request #1 · qcy615/router

qcy615 · 2026-05-26T09:54:46Z

Purpose

Fix decode worker load accounting for vLLM PD streaming responses.

Before this change, process_vllm_two_stage_request decremented decode load immediately after receiving decode response headers. For streaming responses, the decode backend continues generating and forwarding tokens after headers are returned, so decode load was released too early. In cache-aware routing, this made decode workers appear idle even while they still had active running requests, preventing the load-balance policy from detecting imbalance and redistributing traffic.

This PR keeps decode load active until the decode response body is actually finished or dropped:

Wrap streaming decode bodies in LoadTrackedDecodeStream.
Decrement decode load once on upstream stream EOF.
Decrement decode load on body drop, covering client disconnects and early response disposal.
Move non-streaming decode load release to after the full body has been read.
Add focused tests for EOF and drop behavior.

Test Plan

Run the focused load-tracking unit tests in Docker because the local host does not have Cargo installed:

docker run --rm --network=host \
  -v "${PWD}:/app" \
  -v vllm-router-cargo-registry:/usr/local/cargo/registry \
  -v vllm-router-cargo-git:/usr/local/cargo/git \
  -v vllm-router-target:/app/target \
  -w /app rustlang/rust:nightly-bullseye \
  bash -lc 'export PATH=/usr/local/cargo/bin:/usr/local/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/bin:$PATH; cargo test test_load_tracked_decode_stream --lib'

Run a compile check for the library and router binary in the same Docker environment:

docker run --rm --network=host \
  -v "${PWD}:/app" \
  -v vllm-router-cargo-registry:/usr/local/cargo/registry \
  -v vllm-router-cargo-git:/usr/local/cargo/git \
  -v vllm-router-target:/app/target \
  -w /app rustlang/rust:nightly-bullseye \
  bash -lc 'export PATH=/usr/local/cargo/bin:/usr/local/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/bin:$PATH; cargo check --lib --bin vllm-router'

Start the router in vLLM PD disaggregation mode with cache-aware routing:

vllm-router \
  --vllm-pd-disaggregation \
  --host 0.0.0.0 \
  --port 29100 \
  --policy cache_aware \
  --cache-threshold 0.5 \
  --balance-abs-threshold 32 \
  --balance-rel-threshold 1.1 \
  --intra-node-data-parallel-size 4 \
  --prefill http://0.0.0.0:18000 \
  --decode http://0.0.0.0:28000 \
  --log-level debug

Send streaming requests through the router and compare cache-aware load-balance logs plus the running-requests dashboard before and after the change.

Test Result

Automated validation passed:

running 2 tests
test routers::http::vllm_pd_router::tests::test_load_tracked_decode_stream_decrements_on_drop ... ok
test routers::http::vllm_pd_router::tests::test_load_tracked_decode_stream_decrements_after_eof ... ok

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 476 filtered out; finished in 0.00s

cargo check --lib --bin vllm-router
Finished `dev` profile [unoptimized + debuginfo]

Manual streaming validation:

Before the fix, decode load stayed at max_load=0, min_load=0 in cache-aware logs even while the scheduler dashboard showed running decode requests concentrated on one worker. Because the recorded decode load stayed at zero, the load-balance policy saw is_unbalanced=false and did not trigger redistribution.
After the fix, decode load was recorded during streaming. Cache-aware logs showed decode imbalance such as max_load=39, min_load=2, is_unbalanced=true, and the running-requests dashboard showed traffic spread across decode workers. The load-balance policy could trigger normally.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results.

Signed-off-by: qcy615 <qin_changyan@163.com>

Fix decode load tracking for streaming PD responses

cdd0a8a

Signed-off-by: qcy615 <qin_changyan@163.com>

qcy615 force-pushed the feat/fix-vllm-pd-stream-load branch from 7fda228 to cdd0a8a Compare May 26, 2026 10:38

Avoid buffering proxied decode responses

c0deda6

Signed-off-by: qcy615 <qin_changyan@163.com>

qcy615 force-pushed the feat/fix-vllm-pd-stream-load branch from 991a83c to c0deda6 Compare June 1, 2026 02:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Fix decode load tracking for streaming PD responses#1

[codex] Fix decode load tracking for streaming PD responses#1
qcy615 wants to merge 2 commits into
mainfrom
feat/fix-vllm-pd-stream-load

qcy615 commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qcy615 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qcy615 commented May 26, 2026 •

edited

Loading