Skip to content

[Bugfix] Support external KV connectors on hybrid (mamba) models with prefix caching#7

Open
arthurrasmusson-lb wants to merge 2 commits into
mainfrom
lcf/hybrid-external-kv-connector
Open

[Bugfix] Support external KV connectors on hybrid (mamba) models with prefix caching#7
arthurrasmusson-lb wants to merge 2 commits into
mainfrom
lcf/hybrid-external-kv-connector

Conversation

@arthurrasmusson-lb

Copy link
Copy Markdown
Collaborator

Summary

Hybrid models (Qwen3.5/3.6: GDN linear attention + full attention) currently cannot be used with an external KV connector when prefix caching is on (mamba align mode): the scheduler asserts external connectors out of the aligned split, async external loads wedge the scheduler in an infinite re-claim loop, and KV-load-failure recovery crashes on the multi-group block table. This PR fixes all three in vllm/v1/core/sched/scheduler.py:

  1. _mamba_block_aligned_split — replace the blanket assert with the real contract. The split arithmetic already composes num_external_computed_tokens; what it needs is the same invariant MambaManager.find_longest_cache_hit imposes on local hits: claims aligned to the unified token-block size. The assert now enforces alignment instead of forbidding external connectors entirely. (The connector remains responsible for restoring the GDN/mamba boundary state snapshot alongside the attention KV — loading attention KV alone corrupts output; comment documents this.)

  2. Skip the aligned split for load_kv_async. Async loads schedule zero new tokens by design; the split floors 0→0 and the if num_new_tokens == 0: break exits before allocate_slots/update_state_after_alloc, so the request never enters WAITING_FOR_REMOTE_KVS and the scheduler re-claims the same prefix every step, forever. Diagnosed with py-spy (engine MainThread healthy in the parked-wait loop) + DEBUG trail (match_prefix ... matched=10336 repeating, KV cache usage 0.0%, Waiting: 1).

  3. Multi-group _update_requests_with_invalid_blocks (the TODO (davidb) hybrid-allocator gap): iterate every KV cache group's block list instead of (req_block_ids,) = ... — a failed external load on a 2-group hybrid currently dies with ValueError: too many values to unpack. Per-request truncation bookkeeping is shared across groups, matching single-group semantics when only one group carries externally loaded blocks.

Notably, no allocator changes are needed: MambaManager.get_num_blocks_to_allocate/allocate_new_blocks in align mode already produce the null-padded 1-real-block shape for the claimed span, and allocate_slots explicitly supports num_new_tokens == 0 with external tokens.

Validation

Validated end-to-end on Qwen3.6-35B-A3B (TP=8, 2× 8×H100, vLLM v0.22.1 + these changes applied at deploy time) through the LightBits LCF KV connector (LCF PR LightBitsLabs/Light-Coretex-Fabric#543 / vllm-project#545):

  • External claim → WAITING_FOR_REMOTE_KVS → async load → resume: no scheduler wedge (previously: infinite re-claim loop on the first claimed request).
  • Forced load-miss path: invalid-blocks recovery truncates to recompute, engine stays healthy, output correct (previously: ValueError engine death).
  • Store path with prefix caching on: GDN boundary state snapshots materialize and offload (put_ops=560/step = 80 attention + 480 state objects on TP=8, put_errors=0).
  • Without these changes, the same deployment reproduces the assert kill (engine fatal on first claim) and the re-claim wedge deterministically.

Full investigation notes: https://gist.github.com/arthurrasmusson-lb/fc4b89e27056c3013f48e0d21bf44064

Jira: LCF-871

… prefix caching

Three scheduler changes that let an external KV connector claim and load
prefixes on hybrid models (e.g. Qwen3.5/3.6 GDN + full attention) when
prefix caching (mamba align mode) is enabled:

1. _mamba_block_aligned_split: replace the blanket
   `num_external_computed_tokens == 0` assert with the actual contract —
   external claims must be block-aligned, mirroring what
   MambaManager.find_longest_cache_hit imposes on local hits. The split
   arithmetic already composes external tokens correctly.

2. Skip the aligned split when load_kv_async: the scheduler sets
   num_new_tokens=0 by design while remote KV streams in, and flooring 0
   to 0 broke out of scheduling before allocate_slots /
   update_state_after_alloc ran — the request stayed WAITING and
   re-claimed the same prefix every step, forever (verified with py-spy
   + DEBUG traces on an 8xH100 TP=8 deployment).

3. _update_requests_with_invalid_blocks: iterate every KV cache group's
   block list instead of the single-group unpack (the TODO(davidb)
   hybrid-allocator gap), so a failed external load on a hybrid model
   recovers to recompute instead of dying with "too many values to
   unpack".

Validated end-to-end on Qwen3.6-35B-A3B (TP=8, 2-node H100 cluster)
through the LightBits LCF KV connector: claim -> WAITING_FOR_REMOTE_KVS
-> async load -> miss recovery -> recompute all function; no scheduler
wedge, engine stable across restarts.

Jira: LCF-871
Signed-off-by: Arthur Hanson Rasmusson <arthur.r@lightbitslabs.com>
…nt after load-failure recovery

A KV-load-failure recovery truncates a hybrid request's computed
tokens; the request's mamba block table still holds the blocks
allocated for the original (longer) claim, so the next allocate
computes fewer required blocks than the table length and the strict
`assert num_required_blocks > len(req_blocks)` kills the engine
("num_required_blocks 39 < len(req_blocks) 40"). Nothing needs
allocating in that case — return the empty list; the surplus blocks
are freed by remove_skipped_blocks as the recompute progresses.

Reproduced live on Qwen3.6-35B (TP=8) via a transient external-KV
load failure followed by the recovery resched.

Jira: LCF-871
Signed-off-by: Arthur Hanson Rasmusson <arthur.r@lightbitslabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant