[WIP] feat(lora): LoRA adapter serving#83
Draft
qywu wants to merge 19 commits into
Draft
Conversation
Full LoRA adapter serving implementation for tokenspeed, including: ## Scheduler (C++) - Per-adapter prefix cache namespacing: lora_id threaded through KVPrefixCache::Match, HybridPrefixCache::Match, and InsertHybridCache so each adapter gets its own radix-tree root for prefix reuse - EvictLoraNamespace: evicts KV pages and removes the virtual root on adapter unload ## LoraManager (Python) - GPU weight pool with LRU eviction and TP-aware weight sharding - Tiered GPU ↔ CPU ↔ disk pool with async prefetch - CUDA-graph support: separate no-LoRA and with-LoRA graphs captured; segment-grouped Triton kernels for decode - Attention LoRA: QKV, O-proj with TP sharding and head-dim awareness - MLP LoRA: gate_proj / up_proj / down_proj targets - MoE LoRA: sglang_shared_outer and per_expert formats with flat Triton kernels that eliminate gather copies; multi-stream prefetch overlaps A-shrink with base MoE GEMMs - LM-head LoRA support ## MoE LoRA kernels (tokenspeed-kernel) - shared_a_shrink, gate_up_b_expand: sglang_shared gate/up path - per_expert_a_shrink, per_expert_gate_up_b_expand, per_expert_b_down_expand: per-expert format without buffer copies - shared_b_down_expand: shared-B down projection - sorted_gate_up_b_expand, sorted_a_down_shrink: TMA prefill path - Multi-stream prefetch: flat_a_gemm / flat_down_shrink launched on a secondary CUDA stream concurrent with base MoE GEMMs ## HTTP / serving - lora_path accepted on /v1/completions and /v1/chat/completions - lora_path propagated through GenerateReqInput.__getitem__ - Pack scheduling policy + cold/warm latency benchmark ## Performance (Qwen3.5-35B-A3B TP=2 BS=8) - sglang_shared_outer n=1: ~962 tok/s (vs 1325 baseline, overhead ~2.25ms) - per_expert n=1: ~871 tok/s (vs 624 before flat-kernel optimization) - self_attn n=1: ~988 tok/s Signed-off-by: Qingyang Wu <willqywu@gmail.com>
255da36 to
1c84488
Compare
Single commit with valid Signed-off-by makes the remediation config unnecessary. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Removes all benchmark scripts and result files from the PR branch. They remain on qywu/lora-dev for development use. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
These files existed before our branch — they were mistakenly removed along with the LoRA-specific additions. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
- tokenspeed_kernel/_triton.py: restored to upstream (no modifications) - moe_lora.py: remove unused imports of fused_a_b_down_expand and fused_shared_a_b_gate_up_expand (experimental kernels not in hot path) Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Lazy-import refactor is unrelated to this LoRA PR. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
HIP/ROCm gluon conditional import change is unrelated to this LoRA PR. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
- tokenspeed_scheduler/__init__.py: restore PagedCacheGroupFamily and PrefixCacheAdjunctSpec exports (both are bound in python_module.cpp; our branch incorrectly removed them) - tokenspeed-kernel/test/ops/test_lora_triton.py: move to qywu/lora-dev (LoRA test missed in previous sweep) Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…rom Python exports The pre-installed tokenspeed_scheduler binary in CI was built before these types were added to the C++ extension, so importing them from the .so raises ImportError. Remove from __init__ until the installed binary is updated. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
The single-pass approach had a correctness bug: when a batch required
more adapters than could be evicted without touching batch adapters,
_find_free_slot would evict an adapter that was already assigned a slot
in per_request_slots. Those requests would then receive NO_LORA_SLOT and
silently run as the base model — wrong outputs with no error.
Fix with a two-phase approach:
Phase 1 — promote all unique adapters upfront:
- Early check: if n_unique > max_loras, raise RuntimeError immediately
instead of producing wrong outputs silently.
- Call _ensure_in_gpu for all batch adapters before assigning any slot.
- After each promotion, move_to_end (MRU) to prevent a subsequent
iteration from evicting an already-promoted batch adapter that
happens to be LRU in _gpu_lru.
- LRU eviction during this phase only targets adapters NOT in the batch.
Phase 2 — assign per_request_slots from the stable _name_to_slot map:
- All needed adapters are already on GPU; no evictions occur.
- Use _name_to_slot[name] directly (guaranteed present after phase 1).
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
When unload_adapter() is called while an adapter is still potentially
in-flight (used in the most recent prepare_loras batch), zeroing the
GPU slot immediately causes ongoing decode steps to produce wrong
outputs (zero LoRA delta = silent base-model behaviour).
Fix with a two-field deferred eviction mechanism:
_active_names — adapters used in the most recent prepare_loras call
_pending_eviction — names queued for eviction when no longer active
unload_adapter():
- Removes identity mappings immediately (blocks new requests)
- If adapter is in _active_names: adds to _pending_eviction + warning,
keeps CPU weights alive so retracted requests can still reload
- If adapter is not active: evicts GPU slot and CPU weights immediately
prepare_loras() (at the top of phase 1):
- Previous forward step is complete at this point
- Flushes _pending_eviction for adapters not in the current batch
- Updates _active_names to the current batch's unique adapter names
This also preserves correctness for retracted requests: if the scheduler
pauses a decode and later resumes it, _ensure_in_gpu reloads the weights
from the CPU copy, which is kept alive until the deferred eviction fires.
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
When unload_adapter() defers GPU eviction (mid-decode safety), the slot stays occupied until a batch without that adapter arrives. If the server goes idle with no further batches, the slot is never freed. Add flush_pending_evictions() that immediately zeroes all deferred slots. Call this when the server is confirmed idle (no in-flight requests) to reclaim GPU capacity. Calling it mid-decode has the same unsafe semantics as the original immediate eviction, so the caller must ensure quiescence first. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…cted-request failures
Three bugs in the initial deferred eviction design:
1. _id_to_name cleared too early: unload_adapter deleted _id_to_name[lora_id]
immediately, so retracted requests that resume later saw lora_id=None and
silently ran as base model. Fix: keep _id_to_name alive until _flush_one_pending.
2. Re-registration overwrites pending eviction slot: if the same adapter name
is reloaded before the pending eviction fires, _evict_by_name("A") would zero
the NEW adapter's slot. Fix: _pending_eviction now stores (name, lora_id)
tuples; _flush_one_pending skips GPU eviction if _name_to_id[name] exists
(name was re-registered with a new id).
3. Double-eviction safety: LRU pressure may evict the GPU slot before the
deferred flush fires. _evict_by_name is already idempotent so this is safe,
but _flush_one_pending now explicitly handles the case (no-op if slot gone).
Add _flush_one_pending(name, lora_id) as the canonical flush helper, used by
both flush_pending_evictions() and the per-step flush in prepare_loras().
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…am race _reset_slot was calling zero_slot (dense) and clear_slot (MoE) which both issue GPU tensor.zero_() operations — potentially hundreds of kernel launches per eviction, one per buffer per layer. More importantly, these GPU zeros have a correctness race: graph.replay() runs on a dedicated stream (cuda_graph_wrapper.self.stream) tensor.zero_() runs on the default PyTorch CUDA stream Without explicit inter-stream synchronisation, a GPU zero can race with an in-flight graph kernel still reading the old weights on the other stream. The zeros are defensive but not required: prepare_loras assigns weight_indices[i] only to slots in _name_to_slot. _evict_by_name removes the slot from _name_to_slot before _reset_slot runs, so no kernel ever reads from an evicted slot. Stale GPU values are overwritten when _load_to_slot reuses the slot for a new adapter. Changes: - _reset_slot: keep CPU metadata zeros (scalings, ranks); skip GPU zeros - MoeLoraBuffers: add clear_slot_cpu_only() that removes the slot from the weights_by_layer dict (needed for the eager non-buffer path) without any GPU operations - flush_pending_evictions: update docstring — now safe to call at any time since no GPU operations are involved in the eviction path Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…uler Previously, the C++ scheduler was unaware of max_loras and could build batches requiring more unique LoRA adapter ids than the Python GPU pool could hold simultaneously. prepare_loras() then raised RuntimeError, or worse, silently produced wrong outputs when _find_free_slot evicted an already-assigned adapter. Fix: thread max_loras through to the scheduler so the batch-building loop enforces the cap directly. Changes: - scheduler/types.h: add max_loras field (0 = LoRA disabled, no cap) - scheduler/operations/forward.cpp: track batch_lora_ids (unordered_set) in newForwardOperation(); skip any request whose lora_id would push the count past max_loras — the request is deferred to the next step - bindings/python_module.cpp: expose max_loras on SchedulerConfig - scheduler_utils.py make_config(): add max_loras parameter - event_loop.py: pass server_args.max_loras (0 when LoRA disabled) With this change the prepare_loras() RuntimeError for n_unique > max_loras becomes unreachable in normal operation. The deferred requests are picked up in subsequent scheduling rounds, naturally co-scheduling same-adapter requests (Gap 1 from the Open Gaps doc section). Signed-off-by: Qingyang Wu <willqywu@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary (WIP)
End-to-end LoRA adapter serving for tokenspeed. Branch is not yet rebased on current main — many test files appear as deletions because the last merge from main predates several recent PRs (#18, #51, etc.). Will refresh before un-drafting.
What's in this PR
feat(lora): scaffold LoRA adapter serving infrastructure.lora_idthrough hybrid cache paths.lora_pathaccepted on/v1/completionsand/v1/chat/completions; propagated throughGenerateReqInput.__getitem__.--enable-loraworks without CUDA graphs.Status
This is an early draft — opening for visibility and review of the overall shape. Next steps before un-drafting:
main(resolve stale deletions of perf(eviction): O(k log N) eviction via persistent LRU set #18 / feat(deepseek-v4): add scheduler-managed sliding-window cache groups #51 test files).--enable-lora(currently only C++ unit testtest_lora_prefix_cache.cpp).lora_pathin the OpenAI-compat docs.Test plan
test_lora_prefix_cache.cpp.