Investigate TurboQuant / RotorQuant KV cache compression for Qwen3.5 MoE

## Context

Two new KV cache compression techniques have emerged from the community that could significantly improve our concurrent capacity on Qwen3.5-35B-A3B:

- **TurboQuant** (Google, ICLR 2026): Online vector quantization of KV cache to 2-4 bits. [vllm PR #38280](https://github.com/vllm-project/vllm/pull/38280), [issue #38171](https://github.com/vllm-project/vllm/issues/38171), [standalone plugin](https://github.com/Alberto-Codes/turboquant-vllm)
- **RotorQuant** (Clifford algebra reimagining): 44x fewer parameters, 7.9x fewer FMAs vs TurboQuant, same quality. [scrya-com/rotorquant](https://github.com/scrya-com/rotorquant). No vLLM integration yet.

Both compress the **KV cache at inference time** (not model weights), so they are **complementary to AWQ 4-bit** quantization.

## Potential Impact on Our Setup

| Metric | Current (FP8 KV) | TurboQuant 4-bit (estimated) | TurboQuant 2-bit (estimated) |
|--------|:-:|:-:|:-:|
| KV cache capacity | 335K tokens | ~640K tokens | ~1.2M+ tokens |
| Compression vs FP16 | 2x | 3.8x | 7.5x |
| Quality loss | None | Near-zero (claimed) | Minimal (claimed) |
| Latency overhead | None | None (fused Triton kernels) | None (claimed) |
| Throughput (batch=16) | Baseline | +21% (claimed) | TBD |

The main benefit is **more concurrent users** at the same GPU memory — critical for our multi-agent workloads.

## Key Technical Points

### TurboQuant
- Two-stage pipeline: random orthogonal rotation + per-coordinate Lloyd-Max scalar quantization
- Provably optimal distortion within ~2.7x of information-theoretic limits
- 12/12 exact match in vLLM integration tests, perfect needle-in-a-haystack
- **PR #38280 status**: Phase 1 merged (CacheDType, TurboQuantConfig, Triton kernels, 43/43 tests). Phase 2 (packed uint8 storage for real memory savings) in progress
- Standalone pip plugin available: `turboquant-vllm` (uses `--attention-backend CUSTOM`)

### RotorQuant
- Clifford algebra approach: chunks vectors into groups of 3 dims, applies 4-parameter rotor
- Same quality as TurboQuant, but 44x fewer parameters, 7.9x fewer FMAs
- Triton fused pipeline 128-652x faster than PyTorch reference
- **No vLLM integration yet** — standalone library only

## Risks / Unknowns for Our Setup

1. **MoE compatibility**: Neither has been tested on MoE models (all benchmarks on dense models: Qwen2.5-7B, Mistral-7B)
2. **AWQ + TurboQuant interaction**: No testing of AWQ 4-bit weights + TurboQuant KV cache combined
3. **RTX 4090 (SM89)**: Not explicitly benchmarked (results on H200, RTX 5090)
4. **FlashInfer MoE conflict**: `--attention-backend CUSTOM` (plugin) may conflict with our FlashInfer MoE setup
5. **Marlin MoE headroom**: gpu-util 0.85 constraint remains (variable temp allocs, RFC #27951)
6. **FP16 norms fail silently at >11K tokens** — FP32 norms needed (+2 bytes/vector overhead)
7. **K/V asymmetry**: K vectors have 6-182x larger norms than V — uniform bit allocation is suboptimal

## Action Plan

- [ ] **Watch** vLLM PR #38280 for Phase 2 merge (packed storage = real memory savings)
- [ ] **Test TurboQuant plugin** (`pip install turboquant-vllm`) on Qwen3.5 MoE once Phase 2 lands in nightly
- [ ] **Benchmark** KV cache capacity, decode speed, concurrent throughput, quality (GSM8K, MME) at 4-bit and 2-bit KV
- [ ] **Evaluate RotorQuant** if/when vLLM backend integration appears
- [ ] **Document results** vs current FP8 KV baseline

## References

- Paper: [TurboQuant (arXiv 2504.19874)](https://arxiv.org/abs/2504.19874)
- vLLM issue: https://github.com/vllm-project/vllm/issues/38171
- vLLM PR: https://github.com/vllm-project/vllm/pull/38280
- Plugin: https://github.com/Alberto-Codes/turboquant-vllm
- RotorQuant: https://github.com/scrya-com/rotorquant

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate TurboQuant / RotorQuant KV cache compression for Qwen3.5 MoE #1

Context

Potential Impact on Our Setup

Key Technical Points

TurboQuant

RotorQuant

Risks / Unknowns for Our Setup

Action Plan

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	Current (FP8 KV)	TurboQuant 4-bit (estimated)	TurboQuant 2-bit (estimated)
KV cache capacity	335K tokens	~640K tokens	~1.2M+ tokens
Compression vs FP16	2x	3.8x	7.5x
Quality loss	None	Near-zero (claimed)	Minimal (claimed)
Latency overhead	None	None (fused Triton kernels)	None (claimed)
Throughput (batch=16)	Baseline	+21% (claimed)	TBD

Investigate TurboQuant / RotorQuant KV cache compression for Qwen3.5 MoE #1

Description

Context

Potential Impact on Our Setup

Key Technical Points

TurboQuant

RotorQuant

Risks / Unknowns for Our Setup

Action Plan

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions