Context
Two new KV cache compression techniques have emerged from the community that could significantly improve our concurrent capacity on Qwen3.5-35B-A3B:
Both compress the KV cache at inference time (not model weights), so they are complementary to AWQ 4-bit quantization.
Potential Impact on Our Setup
Metric
Current (FP8 KV)
TurboQuant 4-bit (estimated)
TurboQuant 2-bit (estimated)
KV cache capacity
335K tokens
~640K tokens
~1.2M+ tokens
Compression vs FP16
2x
3.8x
7.5x
Quality loss
None
Near-zero (claimed)
Minimal (claimed)
Latency overhead
None
None (fused Triton kernels)
None (claimed)
Throughput (batch=16)
Baseline
+21% (claimed)
TBD
The main benefit is more concurrent users at the same GPU memory — critical for our multi-agent workloads.
Key Technical Points
TurboQuant
Two-stage pipeline: random orthogonal rotation + per-coordinate Lloyd-Max scalar quantization
Provably optimal distortion within ~2.7x of information-theoretic limits
12/12 exact match in vLLM integration tests, perfect needle-in-a-haystack
PR [Quantization] Add TurboQuant dynamic kv cache compression vllm-project/vllm#38280 status : Phase 1 merged (CacheDType, TurboQuantConfig, Triton kernels, 43/43 tests). Phase 2 (packed uint8 storage for real memory savings) in progress
Standalone pip plugin available: turboquant-vllm (uses --attention-backend CUSTOM)
RotorQuant
Clifford algebra approach: chunks vectors into groups of 3 dims, applies 4-parameter rotor
Same quality as TurboQuant, but 44x fewer parameters, 7.9x fewer FMAs
Triton fused pipeline 128-652x faster than PyTorch reference
No vLLM integration yet — standalone library only
Risks / Unknowns for Our Setup
MoE compatibility : Neither has been tested on MoE models (all benchmarks on dense models: Qwen2.5-7B, Mistral-7B)
AWQ + TurboQuant interaction : No testing of AWQ 4-bit weights + TurboQuant KV cache combined
RTX 4090 (SM89) : Not explicitly benchmarked (results on H200, RTX 5090)
FlashInfer MoE conflict : --attention-backend CUSTOM (plugin) may conflict with our FlashInfer MoE setup
Marlin MoE headroom : gpu-util 0.85 constraint remains (variable temp allocs, RFC [RFC]: Fixing the inaccurate memory profiling vllm-project/vllm#27951 )
FP16 norms fail silently at >11K tokens — FP32 norms needed (+2 bytes/vector overhead)
K/V asymmetry : K vectors have 6-182x larger norms than V — uniform bit allocation is suboptimal
Action Plan
References
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com
Context
Two new KV cache compression techniques have emerged from the community that could significantly improve our concurrent capacity on Qwen3.5-35B-A3B:
Both compress the KV cache at inference time (not model weights), so they are complementary to AWQ 4-bit quantization.
Potential Impact on Our Setup
The main benefit is more concurrent users at the same GPU memory — critical for our multi-agent workloads.
Key Technical Points
TurboQuant
turboquant-vllm(uses--attention-backend CUSTOM)RotorQuant
Risks / Unknowns for Our Setup
--attention-backend CUSTOM(plugin) may conflict with our FlashInfer MoE setupAction Plan
pip install turboquant-vllm) on Qwen3.5 MoE once Phase 2 lands in nightlyReferences
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com