fix(router): track worker load in route_transparent for cache_aware by aoshen02 · Pull Request #180 · vllm-project/router

aoshen02 · 2026-05-31T08:56:59Z

Fixes #179.

Problem

route_transparent (the path that serves endpoints without a first-class typed route, e.g. /inference/v1/generate) selects a worker via the load-balancing policy but never calls increment_load/decrement_load, unlike route_typed_request. For the default cache_aware policy this means every worker stays at load 0, the imbalance check (max-min > balance_abs_threshold) never fires, and the policy stays in pure prefix-affinity mode — so every request sharing a prefix collapses onto a single worker while the others sit idle. Observed in an 8-engine deployment: one engine served all 480 concurrent requests and eventually OOM'd its KV cache; 7 engines idle.

Fix

Mirror route_typed_request's load handling in route_transparent:

After selecting a worker, increment_load() when policy.name() == "cache_aware".
decrement_load() when the proxied response stream completes (forward the upstream bytes_stream through a task that decrements on end — same approach as send_typed_request's streaming branch) or when the upstream request fails.

With load tracked, cache_aware's shortest-queue rebalancing engages and requests spread across workers. No behavior change for policies other than cache_aware (gated on policy.name()), matching the existing typed-route gate.

Test

Added test_transparent_load_tracking_spreads_shared_prefix_for_cache_aware to tests/test_transparent_proxy_routing.rs (the file's established policy-simulation style). It pins the contract: with identical (shared-prefix) requests and a low balance threshold, the only difference between the two loops is whether load is incremented after selection (exactly what route_transparent now does) — without it all 30 requests concentrate on one worker; with it they spread across ≥2.

cargo test --release --test test_transparent_proxy_routing
running 12 tests
...
test transparent_proxy_routing_tests::test_transparent_load_tracking_spreads_shared_prefix_for_cache_aware ... ok
test result: ok. 12 passed; 0 failed

cargo check --release and cargo fmt clean. AI assistance (Claude Code) was used to diagnose and implement this change; I have reviewed every changed line.

🤖 Generated with Claude Code

The transparent-proxy path (route_transparent) selected a worker via the load-balancing policy but never called increment_load/decrement_load, unlike the typed-route path (route_typed_request). For the cache_aware policy this means every worker's load stays 0, the imbalance check (max-min > balance_abs_threshold) never fires, and the policy stays in pure prefix-affinity mode -- so every request sharing a prefix is routed to a single worker while the others sit idle. This is hit by any endpoint served through the transparent proxy (e.g. /inference/v1/generate), and can OOM the lone worker's KV cache under load. Increment worker load after selection when the policy is cache_aware (mirroring route_typed_request), and decrement it when the proxied response stream completes or the upstream request fails. With load tracked, cache_aware's shortest-queue rebalancing engages and requests spread across workers. Adds a regression test pinning the contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39428bd69b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-31T08:59:54Z

+        let load_incremented = if policy.name() == "cache_aware" {
+            worker.increment_load();


Validate methods before incrementing load

In transparent-proxy mode with the cache_aware policy, unsupported methods (for example OPTIONS/TRACE hitting the fallback handler) now increment the selected worker's load and then return 405 from the method match without hitting any of the new decrement paths. Each such request leaves a permanent in-flight count on that worker, so repeated preflights or bad-method probes can skew cache-aware routing and running_requests metrics until the process restarts; validate the method before incrementing or decrement on this early return.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(router): track worker load in route_transparent for cache_aware#180

fix(router): track worker load in route_transparent for cache_aware#180
aoshen02 wants to merge 1 commit into
vllm-project:mainfrom
aoshen02:fix/transparent-proxy-load-tracking

aoshen02 commented May 31, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		let load_incremented = if policy.name() == "cache_aware" {
		worker.increment_load();

Conversation

aoshen02 commented May 31, 2026

Problem

Fix

Test

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant