Unified KV cache compression toolkit for LLM inference
10 methods. 16 presets. GPU-validated. One API.
A Python toolkit that compresses the KV cache in large language models. The KV cache is the #1 memory bottleneck during inference — a 32B model at 32K context uses 8+ GB just for the cache. This library gives you 10 different ways to compress it, all under one API.
Install it, pick a preset, and get the exact launch command for llama.cpp or vLLM with optimal compression. Or use it directly in your own inference code.
git clone https://github.com/rookiemann/multi-turboquant
cd multi-turboquant
pip install -e .
python run_ui.pyFour lines. Opens a browser dashboard. See your GPUs, benchmark methods, plan deployments, generate commands.
| Method | Family | Transform | Bits | Compression | Calibration | Speed Impact |
|---|---|---|---|---|---|---|
turbo2 |
TurboQuant | Walsh-Hadamard 128-d | 2.25 | 7.1x | Required | -3% |
turbo3 |
TurboQuant | Walsh-Hadamard 128-d | 3.25 | 4.9x | Required | -5% |
turbo4 |
TurboQuant | Walsh-Hadamard 128-d | 4.25 | 3.8x | Required | -4% |
turbo2_tcq |
TCQ | WHT + Viterbi trellis | 2.25 | 7.1x | Required | -3% |
turbo3_tcq |
TCQ | WHT + Viterbi trellis | 3.25 | 4.9x | Required | -5% |
iso3 |
IsoQuant | Quaternion 4D rotation | 3.25 | 4.9x | No | ~0% |
iso4 |
IsoQuant | Quaternion 4D rotation | 4.25 | 3.8x | No | ~0% |
planar3 |
PlanarQuant | Givens 2D rotation | 3.25 | 4.9x | No | -1% |
planar4 |
PlanarQuant | Givens 2D rotation | 4.25 | 3.8x | No | ~0% |
triattention |
TriAttention | DFT token eviction | 16 | 10-16x | Required | Varies |
Combined mode (unique to this repo): Token eviction + quantization together. Evict unimportant tokens, compress the survivors. ~80x total KV reduction.
All 10 methods run on GPU through our code. No upstream forks needed.
Every method tested on RTX 3090, real CUDA tensors, our code:
| Method | Cosine Similarity | Compression | GPU Verified |
|---|---|---|---|
| turbo2 | 0.9420 | 5.8x | ✅ |
| turbo3 | 0.9817 | 4.0x | ✅ |
| turbo4 | 0.9947 | 3.2x | ✅ |
| turbo3_tcq | 0.9817 | 4.0x | ✅ |
| iso3 | 0.9783 | 4.7x | ✅ |
| iso4 | 0.9951 | 3.7x | ✅ |
| planar3 | 0.9783 | 4.7x | ✅ |
| planar4 | 0.9952 | 3.7x | ✅ |
| TriAttn + iso3 | 0.9782 | 9.5x | ✅ |
77 automated tests: 68 CPU + 9 GPU.
| Suite | Tests | What It Proves |
|---|---|---|
test_methods.py |
37 | All 10 methods encode/decode, config, presets, integration |
test_integration.py |
31 | Vectorized kernels, paged KV cache, dispatch, TriAttention composition |
test_gpu.py |
9 | Real GPU inference, calibration generation, hardware detection |
pytest tests/ # all 77 tests
pytest tests/ --ignore=tests/test_gpu.py # CPU only (68 tests)from multi_turboquant import get_preset
config = get_preset("balanced") # turbo3_tcq symmetric, 5x
config = get_preset("k_only_iso") # ISO3 K-only, zero speed cost, no calibration
config = get_preset("extreme") # TriAttention + turbo3_tcq, ~80x
config = get_preset("agents_8x16k") # 8 agents at 16K contextfrom multi_turboquant.integration import get_llamacpp_command
cmd = get_llamacpp_command(
config,
model_path="/opt/models/model.gguf",
port=8080,
tensor_split="24,12", # dual GPU
parallel_slots=8, # 8 concurrent agents
)
# llama-server --model ... --cache-type-k turbo3_tcq --cache-type-v turbo3_tcq
# -fa on -c 131072 --tensor-split 24,12 --parallel 8from multi_turboquant import plan_agents
result = plan_agents(
gpus=[{"name": "RTX 3090", "vram_gb": 24}, {"name": "RTX 3060", "vram_gb": 12}],
model_params_b=32,
model_quant="Q4_K_M",
desired_agents=8,
desired_context=16384,
)
result.print_report()
# Preset: turbo4 | 8 agents at 16K | KV: 8.5 GB | Headroom: 9 GBimport torch
from multi_turboquant import compress, decompress, CacheConfig, CacheMethod
config = CacheConfig(k_method=CacheMethod.ISO3, v_method=CacheMethod.FP16)
keys = torch.randn(32, 8, 128, device="cuda")
compressed = compress(keys, config, which="k")
reconstructed = decompress(compressed)
# cosine similarity > 0.97from multi_turboquant.hardware import detect_platform
from multi_turboquant.compatibility import check_config, get_recommended_config
platform = detect_platform()
print(platform.summary())
# NVIDIA: all 10 methods | AMD: iso/planar only | Mac: iso/planar only
config = get_recommended_config(platform)
issues = check_config(config, platform)| Preset | Config | Use Case |
|---|---|---|
k_only_iso |
K=iso3, V=f16 | Zero speed cost, no calibration |
balanced |
turbo3_tcq symmetric | Best quality at 5x |
speed |
turbo3 symmetric | Fastest on Ampere |
quality |
turbo4 symmetric | Near-lossless 3.8x |
max_compression |
turbo2_tcq symmetric | Maximum 7x |
extreme |
turbo3_tcq + TriAttention | ~80x total reduction |
agents_8x16k |
turbo4 symmetric | 8 agents at 16K context |
agents_4x8k_70b |
turbo4 symmetric | 4 agents on 70B model |
no_calibration_symmetric |
iso3 symmetric | No setup needed |
python scripts/plan_and_launch.py --model 32 --agents 8 --context 16384 --gpus 24 12Works with any number of GPUs. Auto-detects NVIDIA, AMD, Apple Silicon. Generates the exact launch command with tensor-split and parallel flags.
TurboQuant/TCQ methods need a one-time calibration from the model's safetensors weights:
mtq-calibrate /path/to/model-safetensors --recipe turbo3
# Generates turboquant_kv.json (~200 KB, ~30 seconds)IsoQuant and PlanarQuant need no calibration — just works.
| Platform | Methods Available | Engine |
|---|---|---|
| Linux + NVIDIA | All 10 | llama.cpp + vLLM |
| Windows + NVIDIA | All 10 | llama.cpp + vLLM |
| Linux + AMD (ROCm) | iso/planar (4) | llama.cpp |
| macOS + Apple Silicon | iso/planar (4) | llama.cpp (Metal) |
| Any (CPU) | All 10 | Library only |
python run_ui.pyBrowser-based UI for exploring methods, running benchmarks, planning deployments, and generating commands. No dependencies beyond the library itself.
multi_turboquant/
config.py CacheConfig, CacheMethod, 12 cache types
registry.py Method registration and discovery
presets.py 16 named presets + auto-recommend
planner.py Multi-agent capacity planning, any GPU count
hardware.py GPU auto-detection (NVIDIA, AMD, Metal)
compatibility.py Method/platform compatibility checks
methods/ 5 method families, all with encode/decode
kernels/triton/ Attention backend, vectorized encode, dispatch
calibration/ Weight-norm analysis, frequency stats, auto-calibrate
integration/ llama.cpp flags, vLLM patch, bridge adapter
benchmark/ Head-to-head comparison, perplexity, VRAM profiling
Full manual with 23 chapters: docs/manual.md
This project reimplements algorithms from published research. All original repos are MIT or Apache-2.0 licensed:
| Contribution | Source |
|---|---|
| Walsh-Hadamard KV compression | TheTom/llama-cpp-turboquant |
| Trellis Coded Quantization | spiritbuun/buun-llama-cpp |
| IsoQuant / PlanarQuant | scrya-com/rotorquant (ParaMind2025) |
| CUDA + Metal kernels | johndpope/llama-cpp-turboquant |
| TriAttention token eviction | WeianMao/triattention |
We reimplemented the algorithms in Python. Credit goes to these authors for the mathematical ideas.
MIT