Skip to content

fix(router): track worker load in route_transparent for cache_aware#180

Open
aoshen02 wants to merge 1 commit into
vllm-project:mainfrom
aoshen02:fix/transparent-proxy-load-tracking
Open

fix(router): track worker load in route_transparent for cache_aware#180
aoshen02 wants to merge 1 commit into
vllm-project:mainfrom
aoshen02:fix/transparent-proxy-load-tracking

Conversation

@aoshen02

Copy link
Copy Markdown

Fixes #179.

Problem

route_transparent (the path that serves endpoints without a first-class typed route, e.g. /inference/v1/generate) selects a worker via the load-balancing policy but never calls increment_load/decrement_load, unlike route_typed_request. For the default cache_aware policy this means every worker stays at load 0, the imbalance check (max-min > balance_abs_threshold) never fires, and the policy stays in pure prefix-affinity mode — so every request sharing a prefix collapses onto a single worker while the others sit idle. Observed in an 8-engine deployment: one engine served all 480 concurrent requests and eventually OOM'd its KV cache; 7 engines idle.

Fix

Mirror route_typed_request's load handling in route_transparent:

  • After selecting a worker, increment_load() when policy.name() == "cache_aware".
  • decrement_load() when the proxied response stream completes (forward the upstream bytes_stream through a task that decrements on end — same approach as send_typed_request's streaming branch) or when the upstream request fails.

With load tracked, cache_aware's shortest-queue rebalancing engages and requests spread across workers. No behavior change for policies other than cache_aware (gated on policy.name()), matching the existing typed-route gate.

Test

Added test_transparent_load_tracking_spreads_shared_prefix_for_cache_aware to tests/test_transparent_proxy_routing.rs (the file's established policy-simulation style). It pins the contract: with identical (shared-prefix) requests and a low balance threshold, the only difference between the two loops is whether load is incremented after selection (exactly what route_transparent now does) — without it all 30 requests concentrate on one worker; with it they spread across ≥2.

cargo test --release --test test_transparent_proxy_routing
running 12 tests
...
test transparent_proxy_routing_tests::test_transparent_load_tracking_spreads_shared_prefix_for_cache_aware ... ok
test result: ok. 12 passed; 0 failed

cargo check --release and cargo fmt clean. AI assistance (Claude Code) was used to diagnose and implement this change; I have reviewed every changed line.

🤖 Generated with Claude Code

The transparent-proxy path (route_transparent) selected a worker via the
load-balancing policy but never called increment_load/decrement_load, unlike
the typed-route path (route_typed_request). For the cache_aware policy this
means every worker's load stays 0, the imbalance check
(max-min > balance_abs_threshold) never fires, and the policy stays in pure
prefix-affinity mode -- so every request sharing a prefix is routed to a single
worker while the others sit idle. This is hit by any endpoint served through the
transparent proxy (e.g. /inference/v1/generate), and can OOM the lone worker's
KV cache under load.

Increment worker load after selection when the policy is cache_aware (mirroring
route_typed_request), and decrement it when the proxied response stream
completes or the upstream request fails. With load tracked, cache_aware's
shortest-queue rebalancing engages and requests spread across workers.

Adds a regression test pinning the contract.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39428bd69b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1696 to +1697
let load_incremented = if policy.name() == "cache_aware" {
worker.increment_load();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate methods before incrementing load

In transparent-proxy mode with the cache_aware policy, unsupported methods (for example OPTIONS/TRACE hitting the fallback handler) now increment the selected worker's load and then return 405 from the method match without hitting any of the new decrement paths. Each such request leaves a permanent in-flight count on that worker, so repeated preflights or bad-method probes can skew cache-aware routing and running_requests metrics until the process restarts; validate the method before incrementing or decrement on this early return.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cache_aware degenerates to a single worker on transparent-proxy endpoints (route_transparent never tracks worker load)

1 participant