fix(router): track worker load in route_transparent for cache_aware#180
fix(router): track worker load in route_transparent for cache_aware#180aoshen02 wants to merge 1 commit into
Conversation
The transparent-proxy path (route_transparent) selected a worker via the load-balancing policy but never called increment_load/decrement_load, unlike the typed-route path (route_typed_request). For the cache_aware policy this means every worker's load stays 0, the imbalance check (max-min > balance_abs_threshold) never fires, and the policy stays in pure prefix-affinity mode -- so every request sharing a prefix is routed to a single worker while the others sit idle. This is hit by any endpoint served through the transparent proxy (e.g. /inference/v1/generate), and can OOM the lone worker's KV cache under load. Increment worker load after selection when the policy is cache_aware (mirroring route_typed_request), and decrement it when the proxied response stream completes or the upstream request fails. With load tracked, cache_aware's shortest-queue rebalancing engages and requests spread across workers. Adds a regression test pinning the contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 39428bd69b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let load_incremented = if policy.name() == "cache_aware" { | ||
| worker.increment_load(); |
There was a problem hiding this comment.
Validate methods before incrementing load
In transparent-proxy mode with the cache_aware policy, unsupported methods (for example OPTIONS/TRACE hitting the fallback handler) now increment the selected worker's load and then return 405 from the method match without hitting any of the new decrement paths. Each such request leaves a permanent in-flight count on that worker, so repeated preflights or bad-method probes can skew cache-aware routing and running_requests metrics until the process restarts; validate the method before incrementing or decrement on this early return.
Useful? React with 👍 / 👎.
Fixes #179.
Problem
route_transparent(the path that serves endpoints without a first-class typed route, e.g./inference/v1/generate) selects a worker via the load-balancing policy but never callsincrement_load/decrement_load, unlikeroute_typed_request. For the defaultcache_awarepolicy this means every worker stays at load 0, the imbalance check (max-min > balance_abs_threshold) never fires, and the policy stays in pure prefix-affinity mode — so every request sharing a prefix collapses onto a single worker while the others sit idle. Observed in an 8-engine deployment: one engine served all 480 concurrent requests and eventually OOM'd its KV cache; 7 engines idle.Fix
Mirror
route_typed_request's load handling inroute_transparent:increment_load()whenpolicy.name() == "cache_aware".decrement_load()when the proxied response stream completes (forward the upstreambytes_streamthrough a task that decrements on end — same approach assend_typed_request's streaming branch) or when the upstream request fails.With load tracked,
cache_aware's shortest-queue rebalancing engages and requests spread across workers. No behavior change for policies other thancache_aware(gated onpolicy.name()), matching the existing typed-route gate.Test
Added
test_transparent_load_tracking_spreads_shared_prefix_for_cache_awaretotests/test_transparent_proxy_routing.rs(the file's established policy-simulation style). It pins the contract: with identical (shared-prefix) requests and a low balance threshold, the only difference between the two loops is whether load is incremented after selection (exactly whatroute_transparentnow does) — without it all 30 requests concentrate on one worker; with it they spread across ≥2.cargo check --releaseandcargo fmtclean. AI assistance (Claude Code) was used to diagnose and implement this change; I have reviewed every changed line.🤖 Generated with Claude Code