[WIP] feat(lora): LoRA adapter serving by qywu · Pull Request #83 · lightseekorg/tokenspeed

qywu · 2026-05-11T23:13:18Z

Summary (WIP)

End-to-end LoRA adapter serving for tokenspeed. Branch is not yet rebased on current main — many test files appear as deletions because the last merge from main predates several recent PRs (#18, #51, etc.). Will refresh before un-drafting.

What's in this PR

Scaffolding: feat(lora): scaffold LoRA adapter serving infrastructure.
Prefix-cache namespacing (C++): per-adapter namespacing in the scheduler so two adapters with the same prompt don't collide on cached KV.
HiCache wiring: thread lora_id through hybrid cache paths.
LoraManager: GPU weight pool with LRU eviction, TP-aware adapter application.
HTTP plumbing: lora_path accepted on /v1/completions and /v1/chat/completions; propagated through GenerateReqInput.__getitem__.
MLP target support: gate_proj / up_proj / down_proj LoRA targets in addition to attention QKV/output.
CUDA-graph support: segment-grouped Triton kernels; separate no-LoRA graph variant captured so base-only batches skip the LoRA path.
Tiered pool: GPU ↔ CPU ↔ disk pool with async prefetch.
Pack scheduling: pack policy + cold/warm latency benchmark.
Eager-mode fixes: --enable-lora works without CUDA graphs.
Misc perf: drop pure-PyTorch RMSNorm fallback in qk_norm; evict adapter namespace on unload.
Docs: HTML references for the LoRA implementation and the broader tokenspeed codebase structure.

Status

This is an early draft — opening for visibility and review of the overall shape. Next steps before un-drafting:

Rebase on current main (resolve stale deletions of perf(eviction): O(k log N) eviction via persistent LRU set #18 / feat(deepseek-v4): add scheduler-managed sliding-window cache groups #51 test files).
Add Python-level integration tests for --enable-lora (currently only C++ unit test test_lora_prefix_cache.cpp).
Benchmark numbers: cold-load, warm hit, pack vs no-pack throughput.
Document the HTTP API surface for lora_path in the OpenAI-compat docs.

Test plan

C++ unit test: test_lora_prefix_cache.cpp.
Python E2E: load base + 2 adapters, verify per-adapter outputs, prefix-cache namespace isolation.
TP=2 sanity once the dense-MLP TP fix from PR fix(qwen3): plumb tensor-parallel info through MLP layers #80 is merged (already in main, this branch needs a rebase to pick it up).

Full LoRA adapter serving implementation for tokenspeed, including: ## Scheduler (C++) - Per-adapter prefix cache namespacing: lora_id threaded through KVPrefixCache::Match, HybridPrefixCache::Match, and InsertHybridCache so each adapter gets its own radix-tree root for prefix reuse - EvictLoraNamespace: evicts KV pages and removes the virtual root on adapter unload ## LoraManager (Python) - GPU weight pool with LRU eviction and TP-aware weight sharding - Tiered GPU ↔ CPU ↔ disk pool with async prefetch - CUDA-graph support: separate no-LoRA and with-LoRA graphs captured; segment-grouped Triton kernels for decode - Attention LoRA: QKV, O-proj with TP sharding and head-dim awareness - MLP LoRA: gate_proj / up_proj / down_proj targets - MoE LoRA: sglang_shared_outer and per_expert formats with flat Triton kernels that eliminate gather copies; multi-stream prefetch overlaps A-shrink with base MoE GEMMs - LM-head LoRA support ## MoE LoRA kernels (tokenspeed-kernel) - shared_a_shrink, gate_up_b_expand: sglang_shared gate/up path - per_expert_a_shrink, per_expert_gate_up_b_expand, per_expert_b_down_expand: per-expert format without buffer copies - shared_b_down_expand: shared-B down projection - sorted_gate_up_b_expand, sorted_a_down_shrink: TMA prefill path - Multi-stream prefetch: flat_a_gemm / flat_down_shrink launched on a secondary CUDA stream concurrent with base MoE GEMMs ## HTTP / serving - lora_path accepted on /v1/completions and /v1/chat/completions - lora_path propagated through GenerateReqInput.__getitem__ - Pack scheduling policy + cold/warm latency benchmark ## Performance (Qwen3.5-35B-A3B TP=2 BS=8) - sglang_shared_outer n=1: ~962 tok/s (vs 1325 baseline, overhead ~2.25ms) - per_expert n=1: ~871 tok/s (vs 624 before flat-kernel optimization) - self_attn n=1: ~988 tok/s Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Single commit with valid Signed-off-by makes the remediation config unnecessary. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Removes all benchmark scripts and result files from the PR branch. They remain on qywu/lora-dev for development use. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

These files existed before our branch — they were mistakenly removed along with the LoRA-specific additions. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

- tokenspeed_kernel/_triton.py: restored to upstream (no modifications) - moe_lora.py: remove unused imports of fused_a_b_down_expand and fused_shared_a_b_gate_up_expand (experimental kernels not in hot path) Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Lazy-import refactor is unrelated to this LoRA PR. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

HIP/ROCm gluon conditional import change is unrelated to this LoRA PR. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

- tokenspeed_scheduler/__init__.py: restore PagedCacheGroupFamily and PrefixCacheAdjunctSpec exports (both are bound in python_module.cpp; our branch incorrectly removed them) - tokenspeed-kernel/test/ops/test_lora_triton.py: move to qywu/lora-dev (LoRA test missed in previous sweep) Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…rom Python exports The pre-installed tokenspeed_scheduler binary in CI was built before these types were added to the C++ extension, so importing them from the .so raises ImportError. Remove from __init__ until the installed binary is updated. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

The single-pass approach had a correctness bug: when a batch required more adapters than could be evicted without touching batch adapters, _find_free_slot would evict an adapter that was already assigned a slot in per_request_slots. Those requests would then receive NO_LORA_SLOT and silently run as the base model — wrong outputs with no error. Fix with a two-phase approach: Phase 1 — promote all unique adapters upfront: - Early check: if n_unique > max_loras, raise RuntimeError immediately instead of producing wrong outputs silently. - Call _ensure_in_gpu for all batch adapters before assigning any slot. - After each promotion, move_to_end (MRU) to prevent a subsequent iteration from evicting an already-promoted batch adapter that happens to be LRU in _gpu_lru. - LRU eviction during this phase only targets adapters NOT in the batch. Phase 2 — assign per_request_slots from the stable _name_to_slot map: - All needed adapters are already on GPU; no evictions occur. - Use _name_to_slot[name] directly (guaranteed present after phase 1). Signed-off-by: Qingyang Wu <willqywu@gmail.com>

When unload_adapter() is called while an adapter is still potentially in-flight (used in the most recent prepare_loras batch), zeroing the GPU slot immediately causes ongoing decode steps to produce wrong outputs (zero LoRA delta = silent base-model behaviour). Fix with a two-field deferred eviction mechanism: _active_names — adapters used in the most recent prepare_loras call _pending_eviction — names queued for eviction when no longer active unload_adapter(): - Removes identity mappings immediately (blocks new requests) - If adapter is in _active_names: adds to _pending_eviction + warning, keeps CPU weights alive so retracted requests can still reload - If adapter is not active: evicts GPU slot and CPU weights immediately prepare_loras() (at the top of phase 1): - Previous forward step is complete at this point - Flushes _pending_eviction for adapters not in the current batch - Updates _active_names to the current batch's unique adapter names This also preserves correctness for retracted requests: if the scheduler pauses a decode and later resumes it, _ensure_in_gpu reloads the weights from the CPU copy, which is kept alive until the deferred eviction fires. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

When unload_adapter() defers GPU eviction (mid-decode safety), the slot stays occupied until a batch without that adapter arrives. If the server goes idle with no further batches, the slot is never freed. Add flush_pending_evictions() that immediately zeroes all deferred slots. Call this when the server is confirmed idle (no in-flight requests) to reclaim GPU capacity. Calling it mid-decode has the same unsafe semantics as the original immediate eviction, so the caller must ensure quiescence first. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…cted-request failures Three bugs in the initial deferred eviction design: 1. _id_to_name cleared too early: unload_adapter deleted _id_to_name[lora_id] immediately, so retracted requests that resume later saw lora_id=None and silently ran as base model. Fix: keep _id_to_name alive until _flush_one_pending. 2. Re-registration overwrites pending eviction slot: if the same adapter name is reloaded before the pending eviction fires, _evict_by_name("A") would zero the NEW adapter's slot. Fix: _pending_eviction now stores (name, lora_id) tuples; _flush_one_pending skips GPU eviction if _name_to_id[name] exists (name was re-registered with a new id). 3. Double-eviction safety: LRU pressure may evict the GPU slot before the deferred flush fires. _evict_by_name is already idempotent so this is safe, but _flush_one_pending now explicitly handles the case (no-op if slot gone). Add _flush_one_pending(name, lora_id) as the canonical flush helper, used by both flush_pending_evictions() and the per-step flush in prepare_loras(). Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…am race _reset_slot was calling zero_slot (dense) and clear_slot (MoE) which both issue GPU tensor.zero_() operations — potentially hundreds of kernel launches per eviction, one per buffer per layer. More importantly, these GPU zeros have a correctness race: graph.replay() runs on a dedicated stream (cuda_graph_wrapper.self.stream) tensor.zero_() runs on the default PyTorch CUDA stream Without explicit inter-stream synchronisation, a GPU zero can race with an in-flight graph kernel still reading the old weights on the other stream. The zeros are defensive but not required: prepare_loras assigns weight_indices[i] only to slots in _name_to_slot. _evict_by_name removes the slot from _name_to_slot before _reset_slot runs, so no kernel ever reads from an evicted slot. Stale GPU values are overwritten when _load_to_slot reuses the slot for a new adapter. Changes: - _reset_slot: keep CPU metadata zeros (scalings, ranks); skip GPU zeros - MoeLoraBuffers: add clear_slot_cpu_only() that removes the slot from the weights_by_layer dict (needed for the eager non-buffer path) without any GPU operations - flush_pending_evictions: update docstring — now safe to call at any time since no GPU operations are involved in the eviction path Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…uler Previously, the C++ scheduler was unaware of max_loras and could build batches requiring more unique LoRA adapter ids than the Python GPU pool could hold simultaneously. prepare_loras() then raised RuntimeError, or worse, silently produced wrong outputs when _find_free_slot evicted an already-assigned adapter. Fix: thread max_loras through to the scheduler so the batch-building loop enforces the cap directly. Changes: - scheduler/types.h: add max_loras field (0 = LoRA disabled, no cap) - scheduler/operations/forward.cpp: track batch_lora_ids (unordered_set) in newForwardOperation(); skip any request whose lora_id would push the count past max_loras — the request is deferred to the next step - bindings/python_module.cpp: expose max_loras on SchedulerConfig - scheduler_utils.py make_config(): add max_loras parameter - event_loop.py: pass server_args.max_loras (0 when LoRA disabled) With this change the prepare_loras() RuntimeError for n_unique > max_loras becomes unreachable in normal operation. The deferred requests are picked up in subsequent scheduling rounds, naturally co-scheduling same-adapter requests (Gap 1 from the Open Gaps doc section). Signed-off-by: Qingyang Wu <willqywu@gmail.com>

qywu force-pushed the feat/lora-adapter-serving branch from 255da36 to 1c84488 Compare May 25, 2026 03:18

qywu added 18 commits May 25, 2026 03:19

chore: remove dco.yml — no longer needed after squash

7f0e675

Single commit with valid Signed-off-by makes the remediation config unnecessary. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chore: move benchmark files to qywu/lora-dev branch

98adfca

Removes all benchmark scripts and result files from the PR branch. They remain on qywu/lora-dev for development use. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chore: move LoRA doc files to qywu/lora-dev branch

de4cf46

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chore: move LoRA test files to qywu/lora-dev branch

dc7a35b

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chore: move scheduler LoRA test to qywu/lora-dev branch

0646be2

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chore: revert CMakeLists.txt LoRA test entry (moved to qywu/lora-dev)

3477a66

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chore: restore docs/index.md and test/runners.py from upstream

ddbd79a

These files existed before our branch — they were mistakenly removed along with the LoRA-specific additions. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chore: revert tokenspeed_kernel/__init__.py to upstream

d6e442b

Lazy-import refactor is unrelated to this LoRA PR. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chore: revert attention/__init__.py to upstream

3ad51aa

HIP/ROCm gluon conditional import change is unrelated to this LoRA PR. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] feat(lora): LoRA adapter serving#83

[WIP] feat(lora): LoRA adapter serving#83
qywu wants to merge 19 commits into
lightseekorg:mainfrom
qywu:feat/lora-adapter-serving

qywu commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qywu commented May 11, 2026

Summary (WIP)

What's in this PR

Status

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant