Skip to content

Investigate TurboQuant / RotorQuant KV cache compression for Qwen3.5 MoE #1

@jsboige

Description

@jsboige

Context

Two new KV cache compression techniques have emerged from the community that could significantly improve our concurrent capacity on Qwen3.5-35B-A3B:

Both compress the KV cache at inference time (not model weights), so they are complementary to AWQ 4-bit quantization.

Potential Impact on Our Setup

Metric Current (FP8 KV) TurboQuant 4-bit (estimated) TurboQuant 2-bit (estimated)
KV cache capacity 335K tokens ~640K tokens ~1.2M+ tokens
Compression vs FP16 2x 3.8x 7.5x
Quality loss None Near-zero (claimed) Minimal (claimed)
Latency overhead None None (fused Triton kernels) None (claimed)
Throughput (batch=16) Baseline +21% (claimed) TBD

The main benefit is more concurrent users at the same GPU memory — critical for our multi-agent workloads.

Key Technical Points

TurboQuant

  • Two-stage pipeline: random orthogonal rotation + per-coordinate Lloyd-Max scalar quantization
  • Provably optimal distortion within ~2.7x of information-theoretic limits
  • 12/12 exact match in vLLM integration tests, perfect needle-in-a-haystack
  • PR [Quantization] Add TurboQuant dynamic kv cache compression vllm-project/vllm#38280 status: Phase 1 merged (CacheDType, TurboQuantConfig, Triton kernels, 43/43 tests). Phase 2 (packed uint8 storage for real memory savings) in progress
  • Standalone pip plugin available: turboquant-vllm (uses --attention-backend CUSTOM)

RotorQuant

  • Clifford algebra approach: chunks vectors into groups of 3 dims, applies 4-parameter rotor
  • Same quality as TurboQuant, but 44x fewer parameters, 7.9x fewer FMAs
  • Triton fused pipeline 128-652x faster than PyTorch reference
  • No vLLM integration yet — standalone library only

Risks / Unknowns for Our Setup

  1. MoE compatibility: Neither has been tested on MoE models (all benchmarks on dense models: Qwen2.5-7B, Mistral-7B)
  2. AWQ + TurboQuant interaction: No testing of AWQ 4-bit weights + TurboQuant KV cache combined
  3. RTX 4090 (SM89): Not explicitly benchmarked (results on H200, RTX 5090)
  4. FlashInfer MoE conflict: --attention-backend CUSTOM (plugin) may conflict with our FlashInfer MoE setup
  5. Marlin MoE headroom: gpu-util 0.85 constraint remains (variable temp allocs, RFC [RFC]: Fixing the inaccurate memory profiling vllm-project/vllm#27951)
  6. FP16 norms fail silently at >11K tokens — FP32 norms needed (+2 bytes/vector overhead)
  7. K/V asymmetry: K vectors have 6-182x larger norms than V — uniform bit allocation is suboptimal

Action Plan

  • Watch vLLM PR [Quantization] Add TurboQuant dynamic kv cache compression vllm-project/vllm#38280 for Phase 2 merge (packed storage = real memory savings)
  • Test TurboQuant plugin (pip install turboquant-vllm) on Qwen3.5 MoE once Phase 2 lands in nightly
  • Benchmark KV cache capacity, decode speed, concurrent throughput, quality (GSM8K, MME) at 4-bit and 2-bit KV
  • Evaluate RotorQuant if/when vLLM backend integration appears
  • Document results vs current FP8 KV baseline

References

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions