Skip to content

[WIP] feat(lora): LoRA adapter serving#83

Draft
qywu wants to merge 19 commits into
lightseekorg:mainfrom
qywu:feat/lora-adapter-serving
Draft

[WIP] feat(lora): LoRA adapter serving#83
qywu wants to merge 19 commits into
lightseekorg:mainfrom
qywu:feat/lora-adapter-serving

Conversation

@qywu
Copy link
Copy Markdown
Collaborator

@qywu qywu commented May 11, 2026

Summary (WIP)

End-to-end LoRA adapter serving for tokenspeed. Branch is not yet rebased on current main — many test files appear as deletions because the last merge from main predates several recent PRs (#18, #51, etc.). Will refresh before un-drafting.

What's in this PR

  • Scaffolding: feat(lora): scaffold LoRA adapter serving infrastructure.
  • Prefix-cache namespacing (C++): per-adapter namespacing in the scheduler so two adapters with the same prompt don't collide on cached KV.
  • HiCache wiring: thread lora_id through hybrid cache paths.
  • LoraManager: GPU weight pool with LRU eviction, TP-aware adapter application.
  • HTTP plumbing: lora_path accepted on /v1/completions and /v1/chat/completions; propagated through GenerateReqInput.__getitem__.
  • MLP target support: gate_proj / up_proj / down_proj LoRA targets in addition to attention QKV/output.
  • CUDA-graph support: segment-grouped Triton kernels; separate no-LoRA graph variant captured so base-only batches skip the LoRA path.
  • Tiered pool: GPU ↔ CPU ↔ disk pool with async prefetch.
  • Pack scheduling: pack policy + cold/warm latency benchmark.
  • Eager-mode fixes: --enable-lora works without CUDA graphs.
  • Misc perf: drop pure-PyTorch RMSNorm fallback in qk_norm; evict adapter namespace on unload.
  • Docs: HTML references for the LoRA implementation and the broader tokenspeed codebase structure.

Status

This is an early draft — opening for visibility and review of the overall shape. Next steps before un-drafting:

Test plan

  • C++ unit test: test_lora_prefix_cache.cpp.
  • Python E2E: load base + 2 adapters, verify per-adapter outputs, prefix-cache namespace isolation.
  • TP=2 sanity once the dense-MLP TP fix from PR fix(qwen3): plumb tensor-parallel info through MLP layers #80 is merged (already in main, this branch needs a rebase to pick it up).

Full LoRA adapter serving implementation for tokenspeed, including:

## Scheduler (C++)
- Per-adapter prefix cache namespacing: lora_id threaded through
  KVPrefixCache::Match, HybridPrefixCache::Match, and InsertHybridCache
  so each adapter gets its own radix-tree root for prefix reuse
- EvictLoraNamespace: evicts KV pages and removes the virtual root on
  adapter unload

## LoraManager (Python)
- GPU weight pool with LRU eviction and TP-aware weight sharding
- Tiered GPU ↔ CPU ↔ disk pool with async prefetch
- CUDA-graph support: separate no-LoRA and with-LoRA graphs captured;
  segment-grouped Triton kernels for decode
- Attention LoRA: QKV, O-proj with TP sharding and head-dim awareness
- MLP LoRA: gate_proj / up_proj / down_proj targets
- MoE LoRA: sglang_shared_outer and per_expert formats with flat Triton
  kernels that eliminate gather copies; multi-stream prefetch overlaps
  A-shrink with base MoE GEMMs
- LM-head LoRA support

## MoE LoRA kernels (tokenspeed-kernel)
- shared_a_shrink, gate_up_b_expand: sglang_shared gate/up path
- per_expert_a_shrink, per_expert_gate_up_b_expand,
  per_expert_b_down_expand: per-expert format without buffer copies
- shared_b_down_expand: shared-B down projection
- sorted_gate_up_b_expand, sorted_a_down_shrink: TMA prefill path
- Multi-stream prefetch: flat_a_gemm / flat_down_shrink launched on a
  secondary CUDA stream concurrent with base MoE GEMMs

## HTTP / serving
- lora_path accepted on /v1/completions and /v1/chat/completions
- lora_path propagated through GenerateReqInput.__getitem__
- Pack scheduling policy + cold/warm latency benchmark

## Performance (Qwen3.5-35B-A3B TP=2 BS=8)
- sglang_shared_outer n=1: ~962 tok/s (vs 1325 baseline, overhead ~2.25ms)
- per_expert n=1: ~871 tok/s (vs 624 before flat-kernel optimization)
- self_attn n=1: ~988 tok/s

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu qywu force-pushed the feat/lora-adapter-serving branch from 255da36 to 1c84488 Compare May 25, 2026 03:18
qywu added 18 commits May 25, 2026 03:19
Single commit with valid Signed-off-by makes the remediation config
unnecessary.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Removes all benchmark scripts and result files from the PR branch.
They remain on qywu/lora-dev for development use.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
These files existed before our branch — they were mistakenly removed
along with the LoRA-specific additions.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
- tokenspeed_kernel/_triton.py: restored to upstream (no modifications)
- moe_lora.py: remove unused imports of fused_a_b_down_expand and
  fused_shared_a_b_gate_up_expand (experimental kernels not in hot path)

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Lazy-import refactor is unrelated to this LoRA PR.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
HIP/ROCm gluon conditional import change is unrelated to this LoRA PR.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
- tokenspeed_scheduler/__init__.py: restore PagedCacheGroupFamily and
  PrefixCacheAdjunctSpec exports (both are bound in python_module.cpp;
  our branch incorrectly removed them)
- tokenspeed-kernel/test/ops/test_lora_triton.py: move to qywu/lora-dev
  (LoRA test missed in previous sweep)

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…rom Python exports

The pre-installed tokenspeed_scheduler binary in CI was built before
these types were added to the C++ extension, so importing them from
the .so raises ImportError. Remove from __init__ until the installed
binary is updated.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
The single-pass approach had a correctness bug: when a batch required
more adapters than could be evicted without touching batch adapters,
_find_free_slot would evict an adapter that was already assigned a slot
in per_request_slots. Those requests would then receive NO_LORA_SLOT and
silently run as the base model — wrong outputs with no error.

Fix with a two-phase approach:

Phase 1 — promote all unique adapters upfront:
  - Early check: if n_unique > max_loras, raise RuntimeError immediately
    instead of producing wrong outputs silently.
  - Call _ensure_in_gpu for all batch adapters before assigning any slot.
  - After each promotion, move_to_end (MRU) to prevent a subsequent
    iteration from evicting an already-promoted batch adapter that
    happens to be LRU in _gpu_lru.
  - LRU eviction during this phase only targets adapters NOT in the batch.

Phase 2 — assign per_request_slots from the stable _name_to_slot map:
  - All needed adapters are already on GPU; no evictions occur.
  - Use _name_to_slot[name] directly (guaranteed present after phase 1).

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
When unload_adapter() is called while an adapter is still potentially
in-flight (used in the most recent prepare_loras batch), zeroing the
GPU slot immediately causes ongoing decode steps to produce wrong
outputs (zero LoRA delta = silent base-model behaviour).

Fix with a two-field deferred eviction mechanism:
  _active_names  — adapters used in the most recent prepare_loras call
  _pending_eviction — names queued for eviction when no longer active

unload_adapter():
  - Removes identity mappings immediately (blocks new requests)
  - If adapter is in _active_names: adds to _pending_eviction + warning,
    keeps CPU weights alive so retracted requests can still reload
  - If adapter is not active: evicts GPU slot and CPU weights immediately

prepare_loras() (at the top of phase 1):
  - Previous forward step is complete at this point
  - Flushes _pending_eviction for adapters not in the current batch
  - Updates _active_names to the current batch's unique adapter names

This also preserves correctness for retracted requests: if the scheduler
pauses a decode and later resumes it, _ensure_in_gpu reloads the weights
from the CPU copy, which is kept alive until the deferred eviction fires.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
When unload_adapter() defers GPU eviction (mid-decode safety), the slot
stays occupied until a batch without that adapter arrives.  If the server
goes idle with no further batches, the slot is never freed.

Add flush_pending_evictions() that immediately zeroes all deferred slots.
Call this when the server is confirmed idle (no in-flight requests) to
reclaim GPU capacity.  Calling it mid-decode has the same unsafe
semantics as the original immediate eviction, so the caller must ensure
quiescence first.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…cted-request failures

Three bugs in the initial deferred eviction design:

1. _id_to_name cleared too early: unload_adapter deleted _id_to_name[lora_id]
   immediately, so retracted requests that resume later saw lora_id=None and
   silently ran as base model. Fix: keep _id_to_name alive until _flush_one_pending.

2. Re-registration overwrites pending eviction slot: if the same adapter name
   is reloaded before the pending eviction fires, _evict_by_name("A") would zero
   the NEW adapter's slot. Fix: _pending_eviction now stores (name, lora_id)
   tuples; _flush_one_pending skips GPU eviction if _name_to_id[name] exists
   (name was re-registered with a new id).

3. Double-eviction safety: LRU pressure may evict the GPU slot before the
   deferred flush fires. _evict_by_name is already idempotent so this is safe,
   but _flush_one_pending now explicitly handles the case (no-op if slot gone).

Add _flush_one_pending(name, lora_id) as the canonical flush helper, used by
both flush_pending_evictions() and the per-step flush in prepare_loras().

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…am race

_reset_slot was calling zero_slot (dense) and clear_slot (MoE) which both
issue GPU tensor.zero_() operations — potentially hundreds of kernel launches
per eviction, one per buffer per layer.

More importantly, these GPU zeros have a correctness race:
  graph.replay() runs on a dedicated stream (cuda_graph_wrapper.self.stream)
  tensor.zero_() runs on the default PyTorch CUDA stream
Without explicit inter-stream synchronisation, a GPU zero can race with an
in-flight graph kernel still reading the old weights on the other stream.

The zeros are defensive but not required: prepare_loras assigns
weight_indices[i] only to slots in _name_to_slot.  _evict_by_name removes
the slot from _name_to_slot before _reset_slot runs, so no kernel ever
reads from an evicted slot.  Stale GPU values are overwritten when
_load_to_slot reuses the slot for a new adapter.

Changes:
- _reset_slot: keep CPU metadata zeros (scalings, ranks); skip GPU zeros
- MoeLoraBuffers: add clear_slot_cpu_only() that removes the slot from the
  weights_by_layer dict (needed for the eager non-buffer path) without any
  GPU operations
- flush_pending_evictions: update docstring — now safe to call at any time
  since no GPU operations are involved in the eviction path

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…uler

Previously, the C++ scheduler was unaware of max_loras and could build
batches requiring more unique LoRA adapter ids than the Python GPU pool
could hold simultaneously.  prepare_loras() then raised RuntimeError,
or worse, silently produced wrong outputs when _find_free_slot evicted
an already-assigned adapter.

Fix: thread max_loras through to the scheduler so the batch-building
loop enforces the cap directly.

Changes:
- scheduler/types.h: add max_loras field (0 = LoRA disabled, no cap)
- scheduler/operations/forward.cpp: track batch_lora_ids (unordered_set)
  in newForwardOperation(); skip any request whose lora_id would push
  the count past max_loras — the request is deferred to the next step
- bindings/python_module.cpp: expose max_loras on SchedulerConfig
- scheduler_utils.py make_config(): add max_loras parameter
- event_loop.py: pass server_args.max_loras (0 when LoRA disabled)

With this change the prepare_loras() RuntimeError for n_unique > max_loras
becomes unreachable in normal operation.  The deferred requests are picked
up in subsequent scheduling rounds, naturally co-scheduling same-adapter
requests (Gap 1 from the Open Gaps doc section).

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant