Add KV-cache key compression reproducible comparison by yi-fireworks · Pull Request #303 · fw-ai/cookbook

yi-fireworks · 2026-04-06T19:35:11Z

Summary

Self-contained kernel-level comparison of three KV-cache key compression methods
(Conventional scalar quantization, TurboQuant, RotorQuant) on a B200 GPU with
matched storage budgets and fused Triton scoring kernels.

Measurement harness, library code, and reference results at research/kv-cache-compression/
Covers 2k/4k/32k context, 8/4/3/2 bit widths, quality and latency
README documents setup, reproduction, and interpretation

Directory structure

research/kv-cache-compression/
  README.md                    # Standalone entry point
  requirements.txt             # torch, triton, scipy, transformers
  configs/kernel_table_phase1.json
  scripts/run_unified_table.py # Measurement harness
  turboquant/                  # Library (9 .py files)
  results/unified/             # Reference CSV + markdown

What was adjusted from the internal experimental branch

Removed stale baseline block from config JSON (referenced internal paths
and an outdated note about RotorQuant being over-budget)
Added transformers to requirements.txt (needed by an import in fused_attention.py)
Added .gitignore
Excluded: __pycache__/, patch_test.py (empty stub), data/LLMTest_NeedleInAHaystack/
(vendored third-party, unused by harness), results/niah/ and results/real_keys_eval/
(separate experiments not documented in README)

Reproduce

cd research/kv-cache-compression
pip install -r requirements.txt
python scripts/run_unified_table.py --gpu 0

Requires any Ampere+ NVIDIA GPU (A100, H100, L4, B200, etc.). Quality numbers
are seed-deterministic and should match the reference CSV. Timing varies by hardware.

Test plan

All Python files pass py_compile
Core import chain resolves (TurboQuantProd, RotorQuantMSE, LloydMaxCodebook)
No internal paths, credentials, or infra references (verified via ripgrep)
Reference CSV well-formed (87 rows, 15 columns)
GPU smoke test when hardware access returns

Self-contained kernel-level comparison of three KV-cache key compression methods (Conventional, TurboQuant, RotorQuant) on a B200 GPU with matched storage budgets and fused Triton scoring kernels. Includes the measurement harness, library code, reference results, and full documentation for reproduction. Made-with: Cursor

Quality numbers are seed-deterministic but not bitwise-identical across GPU architectures (verified: B200 reference vs H200 smoke test shows deltas in 4th-6th decimal place). Made-with: Cursor

yi-fireworks added 2 commits April 6, 2026 19:32

Note cross-architecture numerical variation in caveats

708e6cf

Quality numbers are seed-deterministic but not bitwise-identical across GPU architectures (verified: B200 reference vs H200 smoke test shows deltas in 4th-6th decimal place). Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KV-cache key compression reproducible comparison#303

Add KV-cache key compression reproducible comparison#303
yi-fireworks wants to merge 2 commits intomainfrom
yi/kv-cache-compression

yi-fireworks commented Apr 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yi-fireworks commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Directory structure

What was adjusted from the internal experimental branch

Reproduce

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yi-fireworks commented Apr 6, 2026 •

edited

Loading