[Bugfix] Support external KV connectors on hybrid (mamba) models with prefix caching#7
Open
arthurrasmusson-lb wants to merge 2 commits into
Open
[Bugfix] Support external KV connectors on hybrid (mamba) models with prefix caching#7arthurrasmusson-lb wants to merge 2 commits into
arthurrasmusson-lb wants to merge 2 commits into
Conversation
… prefix caching Three scheduler changes that let an external KV connector claim and load prefixes on hybrid models (e.g. Qwen3.5/3.6 GDN + full attention) when prefix caching (mamba align mode) is enabled: 1. _mamba_block_aligned_split: replace the blanket `num_external_computed_tokens == 0` assert with the actual contract — external claims must be block-aligned, mirroring what MambaManager.find_longest_cache_hit imposes on local hits. The split arithmetic already composes external tokens correctly. 2. Skip the aligned split when load_kv_async: the scheduler sets num_new_tokens=0 by design while remote KV streams in, and flooring 0 to 0 broke out of scheduling before allocate_slots / update_state_after_alloc ran — the request stayed WAITING and re-claimed the same prefix every step, forever (verified with py-spy + DEBUG traces on an 8xH100 TP=8 deployment). 3. _update_requests_with_invalid_blocks: iterate every KV cache group's block list instead of the single-group unpack (the TODO(davidb) hybrid-allocator gap), so a failed external load on a hybrid model recovers to recompute instead of dying with "too many values to unpack". Validated end-to-end on Qwen3.6-35B-A3B (TP=8, 2-node H100 cluster) through the LightBits LCF KV connector: claim -> WAITING_FOR_REMOTE_KVS -> async load -> miss recovery -> recompute all function; no scheduler wedge, engine stable across restarts. Jira: LCF-871 Signed-off-by: Arthur Hanson Rasmusson <arthur.r@lightbitslabs.com>
…nt after load-failure recovery
A KV-load-failure recovery truncates a hybrid request's computed
tokens; the request's mamba block table still holds the blocks
allocated for the original (longer) claim, so the next allocate
computes fewer required blocks than the table length and the strict
`assert num_required_blocks > len(req_blocks)` kills the engine
("num_required_blocks 39 < len(req_blocks) 40"). Nothing needs
allocating in that case — return the empty list; the surplus blocks
are freed by remove_skipped_blocks as the recompute progresses.
Reproduced live on Qwen3.6-35B (TP=8) via a transient external-KV
load failure followed by the recovery resched.
Jira: LCF-871
Signed-off-by: Arthur Hanson Rasmusson <arthur.r@lightbitslabs.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hybrid models (Qwen3.5/3.6: GDN linear attention + full attention) currently cannot be used with an external KV connector when prefix caching is on (mamba
alignmode): the scheduler asserts external connectors out of the aligned split, async external loads wedge the scheduler in an infinite re-claim loop, and KV-load-failure recovery crashes on the multi-group block table. This PR fixes all three invllm/v1/core/sched/scheduler.py:_mamba_block_aligned_split— replace the blanket assert with the real contract. The split arithmetic already composesnum_external_computed_tokens; what it needs is the same invariantMambaManager.find_longest_cache_hitimposes on local hits: claims aligned to the unified token-block size. The assert now enforces alignment instead of forbidding external connectors entirely. (The connector remains responsible for restoring the GDN/mamba boundary state snapshot alongside the attention KV — loading attention KV alone corrupts output; comment documents this.)Skip the aligned split for
load_kv_async. Async loads schedule zero new tokens by design; the split floors 0→0 and theif num_new_tokens == 0: breakexits beforeallocate_slots/update_state_after_alloc, so the request never entersWAITING_FOR_REMOTE_KVSand the scheduler re-claims the same prefix every step, forever. Diagnosed with py-spy (engine MainThread healthy in the parked-wait loop) + DEBUG trail (match_prefix ... matched=10336repeating,KV cache usage 0.0%,Waiting: 1).Multi-group
_update_requests_with_invalid_blocks(theTODO (davidb)hybrid-allocator gap): iterate every KV cache group's block list instead of(req_block_ids,) = ...— a failed external load on a 2-group hybrid currently dies withValueError: too many values to unpack. Per-request truncation bookkeeping is shared across groups, matching single-group semantics when only one group carries externally loaded blocks.Notably, no allocator changes are needed:
MambaManager.get_num_blocks_to_allocate/allocate_new_blocksin align mode already produce the null-padded 1-real-block shape for the claimed span, andallocate_slotsexplicitly supportsnum_new_tokens == 0with external tokens.Validation
Validated end-to-end on Qwen3.6-35B-A3B (TP=8, 2× 8×H100, vLLM v0.22.1 + these changes applied at deploy time) through the LightBits LCF KV connector (LCF PR LightBitsLabs/Light-Coretex-Fabric#543 / vllm-project#545):
WAITING_FOR_REMOTE_KVS→ async load → resume: no scheduler wedge (previously: infinite re-claim loop on the first claimed request).ValueErrorengine death).put_ops=560/step= 80 attention + 480 state objects on TP=8,put_errors=0).Full investigation notes: https://gist.github.com/arthurrasmusson-lb/fc4b89e27056c3013f48e0d21bf44064
Jira: LCF-871