[Bugfix] Support external KV connectors on hybrid (mamba) models with prefix caching by arthurrasmusson-lb · Pull Request #7 · LightBitsLabs/vllm

arthurrasmusson-lb · 2026-06-12T06:38:00Z

Summary

Hybrid models (Qwen3.5/3.6: GDN linear attention + full attention) currently cannot be used with an external KV connector when prefix caching is on (mamba align mode): the scheduler asserts external connectors out of the aligned split, async external loads wedge the scheduler in an infinite re-claim loop, and KV-load-failure recovery crashes on the multi-group block table. This PR fixes all three in vllm/v1/core/sched/scheduler.py:

_mamba_block_aligned_split — replace the blanket assert with the real contract. The split arithmetic already composes num_external_computed_tokens; what it needs is the same invariant MambaManager.find_longest_cache_hit imposes on local hits: claims aligned to the unified token-block size. The assert now enforces alignment instead of forbidding external connectors entirely. (The connector remains responsible for restoring the GDN/mamba boundary state snapshot alongside the attention KV — loading attention KV alone corrupts output; comment documents this.)
Skip the aligned split for load_kv_async. Async loads schedule zero new tokens by design; the split floors 0→0 and the if num_new_tokens == 0: break exits before allocate_slots/update_state_after_alloc, so the request never enters WAITING_FOR_REMOTE_KVS and the scheduler re-claims the same prefix every step, forever. Diagnosed with py-spy (engine MainThread healthy in the parked-wait loop) + DEBUG trail (match_prefix ... matched=10336 repeating, KV cache usage 0.0%, Waiting: 1).
Multi-group _update_requests_with_invalid_blocks (the TODO (davidb) hybrid-allocator gap): iterate every KV cache group's block list instead of (req_block_ids,) = ... — a failed external load on a 2-group hybrid currently dies with ValueError: too many values to unpack. Per-request truncation bookkeeping is shared across groups, matching single-group semantics when only one group carries externally loaded blocks.

Notably, no allocator changes are needed: MambaManager.get_num_blocks_to_allocate/allocate_new_blocks in align mode already produce the null-padded 1-real-block shape for the claimed span, and allocate_slots explicitly supports num_new_tokens == 0 with external tokens.

Validation

Validated end-to-end on Qwen3.6-35B-A3B (TP=8, 2× 8×H100, vLLM v0.22.1 + these changes applied at deploy time) through the LightBits LCF KV connector (LCF PR LightBitsLabs/Light-Coretex-Fabric#543 / vllm-project#545):

External claim → WAITING_FOR_REMOTE_KVS → async load → resume: no scheduler wedge (previously: infinite re-claim loop on the first claimed request).
Forced load-miss path: invalid-blocks recovery truncates to recompute, engine stays healthy, output correct (previously: ValueError engine death).
Store path with prefix caching on: GDN boundary state snapshots materialize and offload (put_ops=560/step = 80 attention + 480 state objects on TP=8, put_errors=0).
Without these changes, the same deployment reproduces the assert kill (engine fatal on first claim) and the re-claim wedge deterministically.

Full investigation notes: https://gist.github.com/arthurrasmusson-lb/fc4b89e27056c3013f48e0d21bf44064

Jira: LCF-871

… prefix caching Three scheduler changes that let an external KV connector claim and load prefixes on hybrid models (e.g. Qwen3.5/3.6 GDN + full attention) when prefix caching (mamba align mode) is enabled: 1. _mamba_block_aligned_split: replace the blanket `num_external_computed_tokens == 0` assert with the actual contract — external claims must be block-aligned, mirroring what MambaManager.find_longest_cache_hit imposes on local hits. The split arithmetic already composes external tokens correctly. 2. Skip the aligned split when load_kv_async: the scheduler sets num_new_tokens=0 by design while remote KV streams in, and flooring 0 to 0 broke out of scheduling before allocate_slots / update_state_after_alloc ran — the request stayed WAITING and re-claimed the same prefix every step, forever (verified with py-spy + DEBUG traces on an 8xH100 TP=8 deployment). 3. _update_requests_with_invalid_blocks: iterate every KV cache group's block list instead of the single-group unpack (the TODO(davidb) hybrid-allocator gap), so a failed external load on a hybrid model recovers to recompute instead of dying with "too many values to unpack". Validated end-to-end on Qwen3.6-35B-A3B (TP=8, 2-node H100 cluster) through the LightBits LCF KV connector: claim -> WAITING_FOR_REMOTE_KVS -> async load -> miss recovery -> recompute all function; no scheduler wedge, engine stable across restarts. Jira: LCF-871 Signed-off-by: Arthur Hanson Rasmusson <arthur.r@lightbitslabs.com>

…nt after load-failure recovery A KV-load-failure recovery truncates a hybrid request's computed tokens; the request's mamba block table still holds the blocks allocated for the original (longer) claim, so the next allocate computes fewer required blocks than the table length and the strict `assert num_required_blocks > len(req_blocks)` kills the engine ("num_required_blocks 39 < len(req_blocks) 40"). Nothing needs allocating in that case — return the empty list; the surplus blocks are freed by remove_skipped_blocks as the recompute progresses. Reproduced live on Qwen3.6-35B (TP=8) via a transient external-KV load failure followed by the recovery resched. Jira: LCF-871 Signed-off-by: Arthur Hanson Rasmusson <arthur.r@lightbitslabs.com>

arthurrasmusson-lb added 2 commits June 12, 2026 02:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Support external KV connectors on hybrid (mamba) models with prefix caching#7

[Bugfix] Support external KV connectors on hybrid (mamba) models with prefix caching#7
arthurrasmusson-lb wants to merge 2 commits into
mainfrom
lcf/hybrid-external-kv-connector

arthurrasmusson-lb commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

arthurrasmusson-lb commented Jun 12, 2026

Summary

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant