Skip to content

Add KV-cache key compression reproducible comparison#303

Open
yi-fireworks wants to merge 2 commits intomainfrom
yi/kv-cache-compression
Open

Add KV-cache key compression reproducible comparison#303
yi-fireworks wants to merge 2 commits intomainfrom
yi/kv-cache-compression

Conversation

@yi-fireworks
Copy link
Copy Markdown

@yi-fireworks yi-fireworks commented Apr 6, 2026

Summary

Self-contained kernel-level comparison of three KV-cache key compression methods
(Conventional scalar quantization, TurboQuant, RotorQuant) on a B200 GPU with
matched storage budgets and fused Triton scoring kernels.

  • Measurement harness, library code, and reference results at research/kv-cache-compression/
  • Covers 2k/4k/32k context, 8/4/3/2 bit widths, quality and latency
  • README documents setup, reproduction, and interpretation

Directory structure

research/kv-cache-compression/
  README.md                    # Standalone entry point
  requirements.txt             # torch, triton, scipy, transformers
  configs/kernel_table_phase1.json
  scripts/run_unified_table.py # Measurement harness
  turboquant/                  # Library (9 .py files)
  results/unified/             # Reference CSV + markdown

What was adjusted from the internal experimental branch

  • Removed stale baseline block from config JSON (referenced internal paths
    and an outdated note about RotorQuant being over-budget)
  • Added transformers to requirements.txt (needed by an import in fused_attention.py)
  • Added .gitignore
  • Excluded: __pycache__/, patch_test.py (empty stub), data/LLMTest_NeedleInAHaystack/
    (vendored third-party, unused by harness), results/niah/ and results/real_keys_eval/
    (separate experiments not documented in README)

Reproduce

cd research/kv-cache-compression
pip install -r requirements.txt
python scripts/run_unified_table.py --gpu 0

Requires any Ampere+ NVIDIA GPU (A100, H100, L4, B200, etc.). Quality numbers
are seed-deterministic and should match the reference CSV. Timing varies by hardware.

Test plan

  • All Python files pass py_compile
  • Core import chain resolves (TurboQuantProd, RotorQuantMSE, LloydMaxCodebook)
  • No internal paths, credentials, or infra references (verified via ripgrep)
  • Reference CSV well-formed (87 rows, 15 columns)
  • GPU smoke test when hardware access returns

Self-contained kernel-level comparison of three KV-cache key
compression methods (Conventional, TurboQuant, RotorQuant) on a
B200 GPU with matched storage budgets and fused Triton scoring
kernels.

Includes the measurement harness, library code, reference results,
and full documentation for reproduction.

Made-with: Cursor
Quality numbers are seed-deterministic but not bitwise-identical
across GPU architectures (verified: B200 reference vs H200 smoke
test shows deltas in 4th-6th decimal place).

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant