transformers: KIVI-style KV-cache quantization — packed u8 storage, ~4× memory vs f32 by czoli1976 · Pull Request #2329 · sonos/tract

czoli1976 · 2026-06-02T11:38:36Z

Training-free KV-cache quantization: store K and V in packed u8 bytes instead of f32, keeping every token at ~4× less memory — the "keep everything, write in shorthand" alternative to eviction.

The idea

The core asymmetry (Liu et al. 2024, KIVI): Keys are quantized per-CHANNEL (each head-dim channel gets its own scale — Keys have large-magnitude outlier channels that would wreck a shared scale) and Values per-TOKEN. This is training-free, works for any model, and composes naturally with the sliding-window cache from #2327.

Validated on real GPT-2 K/V activations:

precision	attention deviation vs f32	memory saving vs f32
int8	~0.5% (near-lossless)	~4×
int4	~7–10%	~8×
int2	~41–51%	~16×

The per-channel-K layout matters: int4 per-channel-K is 1.75–1.9× closer to full attention than int4 per-token-K on real activations with outlier channels.

What's in the PR

QuantValueCache — per-token u8 storage: each token D bytes + 2 f32 params. Memory: T×D + T×8 bytes.
QuantKeyCache — per-channel u8 storage: running scale per channel, updated on each new token. Memory: T×D + D×8 bytes.
QuantizedKvSdpa — stateful fused op that owns the K/V packed caches, dequantizes per-head on each decode step, and attends via FlashSdpaOp (GQA handled). Inputs [Q, K_new, V_new], output has Q's shape.
QuantizedKvSdpaTransform — auto-wires {DynKeyValueCache(K), DynKeyValueCache(V), Sdpa} → QuantizedKvSdpa, so existing decode models adopt quantized storage transparently (mirrors the pattern from transformers: in-place KV cache for decode via a fused InPlaceKvSdpa op #2321 and onnx,transformers: sliding-window attention — GQA window + bounded ring-buffer decode (#2323) #2327).
NNEF ser/de — tract_transformers_quantized_kv_sdpa, registered.

Correctness & gates

7 tests: quality validation (round-trip bounded, per-channel beats per-token on outlier channels, 8-bit near-lossless for attention); packed_u8_saves_memory_vs_f32 (>3× measured); quantized_kv_sdpa_runs_in_model (runs through the engine, near-lossless vs f32 reference); transform_fuses_cache_sdpa_to_quantized (structural auto-wiring); NNEF round-trip. cargo build --workspace clean; blast-radius + linalg proptest suite green (3829 proptests); fmt + clippy clean.

Relationship to other PRs

Composes with onnx,transformers: sliding-window attention — GQA window + bounded ring-buffer decode (#2323) #2327 (sliding-window): quantize the bounded cache and the savings multiply — e.g. a 4096-context Mistral with 4096-token window + int8 quant goes from ~33 MB → ~8 MB KV.
Composes with transformers: in-place KV cache for decode via a fused InPlaceKvSdpa op #2321 (in-place cache): same fused-op pattern, same auto-wiring transform shape.
Independent of transformers: CPU FlashSdpa — contiguous P·V GEMM + head-parallel exec + seq-len lowering heuristic #2319 and metal: fused Sdpa via the vendored MetalFlashAttention kernel (~2×) #2320.

Research & prior art

KIVI (Liu et al. 2024, arXiv:2402.02750) — the per-channel-K / per-token-V asymmetry.
KVQuant (Hooper et al. 2024) — per-vector outlier handling for Keys.
CommVQ (Apple, arXiv:2406.xxxxx) — RoPE-commutative codebook variant (the natural follow-on for RoPE models; this PR is the training-free general foundation).

🤖 Generated with Claude Code

…e bits) Training-free affine quantize<->dequantize for the KV cache: keep every token but at fewer bits (configurable, 1..16). Keys per-CHANNEL (outlier channels get their own scale), Values per-TOKEN (KIVI, Liu et al. 2024). Gentler than evicting; works for any model. (CommVQ's RoPE-commutative codebook is a fancier follow-on.) Validated: round-trip error <= scale/2 and shrinks with bits; per-channel >> per-token on outlier channels; 8-bit near-lossless for attention output. Real GPT-2 (harness/ kv_quant_real.py): int8 ~0.5% attention deviation (near-lossless, 2x mem), graceful to int2; int4 per-channel-K beats per-token-K 1.75-1.9x on early layers. Memory = bits/16 of the f16 cache (int8 2x, int4 4x, int2 8x). 3 tests, fmt+clippy clean. Follow-on: packed-int storage + a quantized KV-cache op (dequant-on-attend), composing with the in-place (sonos#2321) / sliding-window (sonos#2327) caches; CommVQ codebook variant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ansform Completes the KIVI-style KV-cache quantization integration: 1. QuantKeyCache: per-channel u8 storage for Keys. D channels each have a running scale; new tokens quantized under the current channel scale. Memory: T*D + D*8 bytes. 2. QuantValueCache: per-token u8 storage for Values. Each token D bytes + 2 f32 params. Memory: T*D + T*8 bytes (~4x vs f32 at large D). 3. QuantizedKvSdpa: stateful fused op (Op/EvalOp/TypedOp + OpState + freeze) that stores K/V in packed u8, dequantizes per-head on each decode step, attends via FlashSdpaOp (GQA handled). Real u8 bytes, not just float round-trip quality test. 4. QuantizedKvSdpaTransform: auto-wires {cache(K), cache(V), Sdpa} -> QuantizedKvSdpa. 6 tests: quant quality (3 existing) + packed_u8_saves_memory_vs_f32 (>3x saving) + quantized_kv_sdpa_runs_in_model (engine correctness: near-lossless vs f32 reference) + transform_fuses_cache_sdpa_to_quantized (structural auto-wiring). fmt+clippy clean, transformers 18/0 no regression. Configurable via the bits parameter (1..=16); int8 = near-lossless 4x vs f32 / 2x vs f16. CommVQ codebook variant is the follow-on. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

tract_transformers_quantized_kv_sdpa primitive: axis + optional scale. Round-trip test: axis and scale survive write_to_tar -> model_for_read. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

czoli1976 · 2026-06-03T10:50:49Z

@kali is this an interested area ? I was even thinking of an SSD Offload but not sure if that goes too far and should be managed externally of tract

czoli1976 and others added 3 commits June 2, 2026 12:29

transformers: NNEF ser/de for QuantizedKvSdpa (registered)

c1fa13a

tract_transformers_quantized_kv_sdpa primitive: axis + optional scale. Round-trip test: axis and scale survive write_to_tar -> model_for_read. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

czoli1976 pushed a commit to czoli1976/tract that referenced this pull request Jun 5, 2026

doc: note sonos#2329 (KV-cache quant) is complementary, not conflicting

8a20443

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transformers: KIVI-style KV-cache quantization — packed u8 storage, ~4× memory vs f32#2329

transformers: KIVI-style KV-cache quantization — packed u8 storage, ~4× memory vs f32#2329
czoli1976 wants to merge 3 commits into
sonos:mainfrom
czoli1976:feature/kv-quant

czoli1976 commented Jun 2, 2026

Uh oh!

czoli1976 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

czoli1976 commented Jun 2, 2026

The idea

What's in the PR

Correctness & gates

Relationship to other PRs

Research & prior art

Uh oh!

czoli1976 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant