Skip to content

[codex] Fix decode load tracking for streaming PD responses#1

Draft
qcy615 wants to merge 2 commits into
mainfrom
feat/fix-vllm-pd-stream-load
Draft

[codex] Fix decode load tracking for streaming PD responses#1
qcy615 wants to merge 2 commits into
mainfrom
feat/fix-vllm-pd-stream-load

Conversation

@qcy615

@qcy615 qcy615 commented May 26, 2026

Copy link
Copy Markdown
Owner

Purpose

Fix decode worker load accounting for vLLM PD streaming responses.

Before this change, process_vllm_two_stage_request decremented decode load immediately after receiving decode response headers. For streaming responses, the decode backend continues generating and forwarding tokens after headers are returned, so decode load was released too early. In cache-aware routing, this made decode workers appear idle even while they still had active running requests, preventing the load-balance policy from detecting imbalance and redistributing traffic.

This PR keeps decode load active until the decode response body is actually finished or dropped:

  • Wrap streaming decode bodies in LoadTrackedDecodeStream.
  • Decrement decode load once on upstream stream EOF.
  • Decrement decode load on body drop, covering client disconnects and early response disposal.
  • Move non-streaming decode load release to after the full body has been read.
  • Add focused tests for EOF and drop behavior.

Test Plan

  1. Run the focused load-tracking unit tests in Docker because the local host does not have Cargo installed:
docker run --rm --network=host \
  -v "${PWD}:/app" \
  -v vllm-router-cargo-registry:/usr/local/cargo/registry \
  -v vllm-router-cargo-git:/usr/local/cargo/git \
  -v vllm-router-target:/app/target \
  -w /app rustlang/rust:nightly-bullseye \
  bash -lc 'export PATH=/usr/local/cargo/bin:/usr/local/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/bin:$PATH; cargo test test_load_tracked_decode_stream --lib'
  1. Run a compile check for the library and router binary in the same Docker environment:
docker run --rm --network=host \
  -v "${PWD}:/app" \
  -v vllm-router-cargo-registry:/usr/local/cargo/registry \
  -v vllm-router-cargo-git:/usr/local/cargo/git \
  -v vllm-router-target:/app/target \
  -w /app rustlang/rust:nightly-bullseye \
  bash -lc 'export PATH=/usr/local/cargo/bin:/usr/local/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/bin:$PATH; cargo check --lib --bin vllm-router'
  1. Start the router in vLLM PD disaggregation mode with cache-aware routing:
vllm-router \
  --vllm-pd-disaggregation \
  --host 0.0.0.0 \
  --port 29100 \
  --policy cache_aware \
  --cache-threshold 0.5 \
  --balance-abs-threshold 32 \
  --balance-rel-threshold 1.1 \
  --intra-node-data-parallel-size 4 \
  --prefill http://0.0.0.0:18000 \
  --decode http://0.0.0.0:28000 \
  --log-level debug
  1. Send streaming requests through the router and compare cache-aware load-balance logs plus the running-requests dashboard before and after the change.

Test Result

Automated validation passed:

running 2 tests
test routers::http::vllm_pd_router::tests::test_load_tracked_decode_stream_decrements_on_drop ... ok
test routers::http::vllm_pd_router::tests::test_load_tracked_decode_stream_decrements_after_eof ... ok

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 476 filtered out; finished in 0.00s
cargo check --lib --bin vllm-router
Finished `dev` profile [unoptimized + debuginfo]

Manual streaming validation:

  • Before the fix, decode load stayed at max_load=0, min_load=0 in cache-aware logs even while the scheduler dashboard showed running decode requests concentrated on one worker. Because the recorded decode load stayed at zero, the load-balance policy saw is_unbalanced=false and did not trigger redistribution.
  • After the fix, decode load was recorded during streaming. Cache-aware logs showed decode imbalance such as max_load=39, min_load=2, is_unbalanced=true, and the running-requests dashboard showed traffic spread across decode workers. The load-balance policy could trigger normally.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results.

Signed-off-by: qcy615 <qin_changyan@163.com>
@qcy615 qcy615 force-pushed the feat/fix-vllm-pd-stream-load branch from 7fda228 to cdd0a8a Compare May 26, 2026 10:38
Signed-off-by: qcy615 <qin_changyan@163.com>
@qcy615 qcy615 force-pushed the feat/fix-vllm-pd-stream-load branch from 991a83c to c0deda6 Compare June 1, 2026 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant