diff --git a/docs/announce/arxiv/SUBMISSION.md b/docs/announce/arxiv/SUBMISSION.md new file mode 100644 index 0000000..5ac233e --- /dev/null +++ b/docs/announce/arxiv/SUBMISSION.md @@ -0,0 +1,203 @@ +# arXiv submission checklist + +Everything arXiv needs is in `reports/paper/`. This document is the +step-by-step submission recipe + the metadata to paste into the arXiv +web form at . + +**Submitter**: Allen Li (account name per paper author block in +`reports/paper/kakeyalattice.tex`). + +**Expected processing time**: 1 working day for new-author +endorsement if this is Allen's first arXiv submission (cs.LG is an +endorsement category). ~4–8 hours for indexing once accepted. + +## Pre-flight checks (do before opening the submission form) + +1. **LaTeX source compiles cleanly** — `pdflatex` + `bibtex` on + `reports/paper/kakeyalattice.tex` produces the committed PDF. + Re-run once on your local machine with the same texlive version + arXiv uses (TeX Live 2024 at time of writing) to catch any + package drift. + +2. **All referenced files are under `reports/paper/`** — no images + or `.bib` files outside that directory, because arXiv packages + only what you upload. + +3. **No `\cite{}` entries point at non-existent `.bib` keys** — + `bibtex` should exit with zero warnings. A single unresolved cite + produces a "not found" badge in the arXiv listing. + +4. **ORCID attached** — if Allen has an ORCID, it should go on the + submission form under "Author information" so the arXiv listing + gains a verifiable identity anchor (a GEO signal). + +5. **License choice** — we recommend **arXiv license + `CC BY 4.0`** (the most permissive arXiv-compatible license; it + allows Perplexity / ChatGPT to ingest and quote the paper, which + is the whole point). The paper text does not need to change to + match the CC BY license; the license applies to the arXiv copy + alone. + +## Bundle — what to upload + +Upload **both** of: + +1. The `.tex` source: `reports/paper/kakeyalattice.tex`. +2. Any `.bib` file the paper uses. Inspect the `.tex` for + `\bibliography{...}` — if it references a separate `.bib`, upload + that too. If the bibliography is embedded in the `.tex` via + `\begin{thebibliography}`, no separate upload is needed. +3. All figure files referenced by `\includegraphics{...}`. + +**Recommended**: upload a **single `.zip`** containing everything +under `reports/paper/` (except `reports/paper/README.md`, which +arXiv does not need). + +## Metadata to paste into the arXiv form + +### Title + +``` +KakeyaLattice: Nested-Lattice KV-Cache Compression for Large Language Models +``` + +### Authors + +``` +Allen Li (Individual researcher) +``` + +Paste exactly as the author block in `reports/paper/kakeyalattice.tex` +renders. If Allen has an ORCID, paste it as well. + +### Abstract + +Paste the contents of the `\begin{abstract} ... \end{abstract}` block +from `reports/paper/kakeyalattice.tex`. The abstract already names the +key search terms ("KV cache", "lattice quantization", "transformer +inference") that arXiv's fulltext search and Google Scholar will +index. + +### Primary category + +**`cs.LG`** — Machine Learning. + +### Cross-list categories + +**`cs.CL`** — Computation and Language. +**`cs.IT`** — Information Theory. The nested-lattice quantisation +framing belongs in `cs.IT` and this cross-list **meaningfully widens +the retrieval surface** for searchers using information-theory +vocabulary. + +### Comments field + +The "Comments" field becomes part of the arXiv listing header and is +read by Google Scholar and Perplexity. Recommend: + +``` +25 pages, 8 figures, 6 tables. Software release v1.5.0 at +https://github.com/FluffyAIcode/LLM-KV--Cache-compress. Live demo +at https://huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress. +PyPI: kakeyalattice. +``` + +Adjust the page / figure / table count after final compilation. + +### MSC / ACM classification + +**MSC**: `94A29` (Source coding, quantization), `68T07` (Artificial +neural networks and deep learning). + +**ACM class**: `I.2.7` (Natural Language Processing), `E.4` +(Coding and Information Theory). + +### Report number + +Leave blank. + +### Journal reference + +Leave blank until accepted at a venue. + +### DOI + +Leave blank — arXiv will mint one on submission acceptance. + +## Post-submission actions + +Once the arXiv ID is assigned: + +1. **File a one-commit PR titled** `arxiv: wire minted arXiv ID into + README + CITATION.cff + ACKNOWLEDGMENTS.md + paper/README.md` + with the following changes: + + - **README.md badge** — replace the current `DOI — pending` badge: + + ```markdown + [![arXiv](https://img.shields.io/badge/arXiv--b31b1b.svg)](https://arxiv.org/abs/) + ``` + + - **CITATION.cff** — add under the top-level key: + + ```yaml + identifiers: + - type: other + value: "arXiv:" + description: "arXiv preprint for the companion technical report" + preferred-citation: + # ... existing entries ... + identifiers: + - type: other + value: "arXiv:" + ``` + + - **ACKNOWLEDGMENTS.md** — under "Corrections and reviewers" add + a line: "Companion preprint: arXiv:". + + - **reports/paper/README.md** — add a "Published at" line at the + top linking to `https://arxiv.org/abs/`. + +2. **Tag a GitHub release** — `v1.5.0-arxiv` — so the DOI minted by + Zenodo (if you enable Zenodo's GitHub integration) points at the + exact commit the arXiv abstract references. + +3. **Submit the same arXiv ID to Papers with Code** — see + [`../papers_with_code/SUBMISSION.md`](../papers_with_code/SUBMISSION.md). + +## If the submission is held by arXiv for review + +cs.LG is an endorsement category. If this is Allen's first cs.LG +submission, arXiv will place the submission in `hold` status until +an existing cs.LG author endorses it. Two paths: + +- **Passive**: wait for arXiv's own moderation. Takes 1–3 business + days; usually succeeds for well-formatted submissions with a clear + methodology and real benchmarks. +- **Active**: ask a collaborator who has ≥2 prior cs.LG submissions + to endorse via arXiv's web form. We recommend asking someone who + cited in `ACKNOWLEDGMENTS.md` (Zandieh et al. from TurboQuant, the + KIVI authors, or the vLLM authors are natural candidates — they + benefit from the citation and the endorsement is one click for + them). + +## Why an arXiv ID matters for GEO + +An arXiv ID is the single strongest authority anchor in ML research +discovery: + +- Google Scholar indexes arXiv the same day an ID mints. Our paper + becomes findable on queries like `"nested lattice KV cache"`, + `"E8 lattice LLM"`, `"Hadamard KV quantization"` — today it is not + on Google Scholar at all. +- Semantic Scholar, Connected Papers, Emergent Mind, and + Papers-with-Code ingest arXiv nightly. +- Perplexity and ChatGPT-with-search treat arXiv citations as + first-class sources and are measurably more likely to quote an + arXiv-backed claim. +- AI answer engines weight arXiv-hosted content roughly one order of + magnitude higher than non-arXiv-hosted research reports in topic + queries like "best LLM KV compression method 2026". + +Completing step 2 of the runbook is expected to be the single +largest lift in public discoverability of KakeyaLattice. diff --git a/docs/announce/dev_to/post_1_theory.md b/docs/announce/dev_to/post_1_theory.md new file mode 100644 index 0000000..9357c96 --- /dev/null +++ b/docs/announce/dev_to/post_1_theory.md @@ -0,0 +1,226 @@ +--- +title: E8-lattice KV cache compression, from first principles +published: false +description: Why a 1867 math trick (Sylvester-Hadamard rotation) plus a 1999 algorithm (Conway-Sloane E8 closest-point) beats scalar quantization by 9-38% on modern LLM KV caches. Drop-in DynamicCache subclass, pip install. +tags: llm, quantization, python, performance +cover_image: https://raw.githubusercontent.com/FluffyAIcode/LLM-KV--Cache-compress/main/assets/hero_pareto.png +canonical_url: https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/blog/2026-04-kakeyalattice-v1-5.md +--- + +## TL;DR + +The KV cache is the biggest memory consumer in modern LLM serving. +Every per-channel scalar quantizer you've ever tried — INT8, +SmoothQuant-KV, TurboQuant, `QuantoQuantizedCache`, KIVI — is +leaving **9-38% compression on the table** at the quality budgets +production cares about, and for a fixable reason. This post explains +what the fix is (a 1867 ±1 matrix + a 1999 lattice algorithm), +why it works (real LLM KV is heavy-tailed and non-isotropic), and +how it ships (`pip install kakeyalattice`, drop-in +`transformers.DynamicCache` subclass, 10 lines of integration). + +Numbers below are from real vLLM prefill + real FlashAttention bf16 +on NVIDIA H200, 128k context, WikiText-103, n=8 passages × 64 eval +positions per passage. Raw JSON and reproducer at the +[GitHub repo][repo]. Nothing is mocked. + +## The problem scalar quantizers have + +At 128k context on Qwen3-4B the KV cache alone is 18 GiB — larger +than the 8 GiB of model weights. At 1M context the KV cache is the +**only** memory cost that matters. Compressing it without hurting +perplexity is the fastest path to more concurrent users per GPU +node. + +The standard approach is **per-channel scalar quantization**: for +each KV channel, store an INT4 or INT8 value plus a per-channel +scale. SmoothQuant-KV (Xiao et al., ICML 2023, +[arXiv:2211.10438](https://arxiv.org/abs/2211.10438)), +`QuantoQuantizedCache` in HF transformers, and +TurboQuant (Zandieh et al., 2024, +[arXiv:2406.17005](https://arxiv.org/abs/2406.17005)) all follow +this recipe with different scale-selection tricks. At the tight +quality budget production deployments tune for (**≤1% perplexity +loss**), the strongest published scalar quantizer (TurboQuant) tops +out at compression ratios like: + +- Qwen3-4B: 1.95× +- GLM-4-9B-Chat: *cannot reach 1% at any bit setting* +- DeepSeek-R1-Distill-Qwen-1.5B: 2.09× + +Why can't it do better? **Because real LLM KV activations are +heavy-tailed and non-isotropic.** A per-channel scalar quantizer +must budget bits for the worst-case channel (the one with the +heaviest tail), which wastes bits on every other channel. At +aggressive compression ratios this dominates. + +We verified this on DeepSeek-V4-Flash with trained weights: the +isotropy-variance ratio (variance of the largest-variance coordinate +divided by the smallest) across the `csa_pool_kv_ratio4` stream +is **732,400**. One coordinate out of 512 has variance nearly a +million times larger than another. A scalar quantizer has to +accommodate both. + +## The fix, in two steps + +### Step 1 — Sylvester-Hadamard rotation (1867) + +A **Hadamard matrix** H of size D×D has entries in {+1, −1} and +satisfies `Hᵀ H = D · I`. James Joseph Sylvester constructed one +in 1867 as a recursive ±1 sign-pattern: + +``` +H_2 = [[+1, +1], + [+1, −1]] + +H_{2D} = [[H_D, H_D], + [H_D, −H_D]] +``` + +For a KV vector `x ∈ R^D`, the rotation `y = H x / √D` is: + +- **Norm-preserving** — `Hᵀ H / D = I`, so `||y|| = ||x||`. +- **Coordinate-mixing** — each output coordinate is a ±1 sum of all + input coordinates, divided by √D. +- **Cheap** — computable in `O(D log D)` via a radix-2 algorithm + (essentially an FFT without complex numbers). + +Empirically, on every LLM family we tested (Qwen3, Llama-3, +DeepSeek, GLM, Gemma), applying Sylvester-Hadamard rotation to the +KV vectors **gaussianizes** their distribution: kurtosis drops +toward 3, isotropy-variance ratio falls by 1–3 orders of magnitude, +and the Wasserstein-2 distance to a matched Gaussian drops into +the 0.05–0.5 range. We call this the **non-Gaussian audit** (paper +gates: kurt<0.5, iso-var<1.5, had-var<1.5, W2/σ<0.05) and run it +as a sanity check before claiming the rotation works on a new +model family. + +### Step 2 — nested-lattice closest-point snap + +Once the vector is rotated into a well-behaved distribution, we +quantize **jointly across groups of coordinates** (4 or 8 at a +time) by snapping each group to its closest point on a lattice. + +The `D4` lattice in 4 dimensions and the `E8` lattice in 8 +dimensions are the **densest known sphere packings** in those +dimensions ([Conway & Sloane, 1999, +doi:10.1007/978-1-4757-6568-7](https://doi.org/10.1007/978-1-4757-6568-7)). +Density here means: for a given quantization error budget, a D4 or +E8 lattice packs more codepoints into the space than any arrangement +of axis-aligned scalar codepoints. Specifically: + +- D4 gains **1.5 dB** in packing efficiency over scalar per-axis + quantization at the same bit rate. +- E8 gains **3.2 dB** — roughly a 2× efficiency win. + +Translated to LLM KV caches, this means: at the same total bit +budget, D4/E8 lattice-quantized K/V vectors have lower +reconstruction MSE than scalar-quantized vectors **by a provable +amount**. And since we've rotated the vectors to be near-Gaussian +before snapping, the classical nested-lattice shaping-gain bound +(Zamir & Feder 1996, +[doi:10.1109/18.508838](https://doi.org/10.1109/18.508838)) +actually applies — the theoretical gain is achievable, not +hypothetical. + +The closest-point decoders for D4 and E8 are textbook 1999 +algorithms. D4's is a 4-case argmin on the integer lattice plus a +half-integer shift; E8's is a slightly more elaborate case +analysis on Z^8 plus D_8^+ coset selection. Both run in pure +PyTorch at roughly the cost of a LayerNorm. + +## The numbers + +Head-to-head with TurboQuant on iso-PPL compression ratio at +≤1% perplexity loss (higher CR = more bits saved at same quality): + +| model | KakeyaLattice CR | TurboQuant CR | KL advantage | +|:------|-----------------:|--------------:|-------------:| +| Qwen3-4B | **2.40×** | 1.95× | **+23.3%** | +| GLM-4-9B-Chat | **1.73×** | (unreachable) | KL only | +| Gemma-4-E4B | **3.04×** | 3.04× | tied (saturated) | +| DeepSeek-R1-Distill-Qwen-1.5B | **2.29×** | 2.09× | **+9.2%** | + +At ≤2% the KakeyaLattice advantage grows to +27, +38, tied, +3% +respectively. Raw JSON, extractor script, and hero chart generator +are all in the repo; the table above is regenerated from +`reports/v1_4_release/kv_128k_isoppl_n8/*.json` by running +`python benchmarks/extract_iso_ppl_table.py`. + +## Decode latency + +The extra step is one Hadamard rotate + one lattice snap + one +unscale per decode token per attention layer. Measured on NVIDIA +H200 across four models × three operating points: **~0.25 ms per +decode step**, or **<2% of a typical 15-30 ms bf16 decode step at +batch 1**. You will not notice it. + +## How to use it + +Three lines of Python once the package is installed: + +```bash +pip install kakeyalattice +``` + +```python +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer +from kakeyalattice.hf import KakeyaLatticeCache + +tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B") +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen3-0.6B", torch_dtype=torch.bfloat16 +).cuda() + +cache = KakeyaLatticeCache( + variant="e8", q_range=38, # balanced default: ~2.3x CR, <1% |Δppl| + num_hidden_layers=model.config.num_hidden_layers, + head_dim=model.config.head_dim, + device="cuda", +) + +out = model.generate( + **tok("Hello world", return_tensors="pt").to("cuda"), + max_new_tokens=256, + past_key_values=cache, + use_cache=True, +) +``` + +That's it. Any `transformers` model whose `head_dim` is a power of 2 +and divisible by 8 (E8) or 4 (D4) works — Qwen3, Llama-3, +DeepSeek-R1-Distill, GLM-4, Gemma-4, Phi-3. + +## What KakeyaLattice does *not* do + +- **Weight quantization.** That's orthogonal — stack HQQ/GPTQ/AWQ + weight quantization with KakeyaLattice KV compression. +- **Eviction.** SnapKV, H2O, Scissorhands are also orthogonal — + they compose multiplicatively with KakeyaLattice. +- **Zero-latency decode.** The ~0.25ms/step overhead is real, just + small. A fused Triton kernel would cut it further. +- **HBM savings in the Python reference impl.** Today + `KakeyaLatticeCache` stores the reconstructed tensor in the + model's KV dtype; the on-paper CR measures reconstruction + quality, not HBM bytes. A native vLLM integration that stores + lattice indices directly in the paged KV cache is in progress + (see the [vLLM RFC][vllm-rfc]). + +## Try it + +- **Live demo (no install)**: + +- **GitHub + paper + raw data**: + +- **PyPI**: `pip install kakeyalattice` +- **Cite**: GitHub's sidebar "Cite this repository" widget + (sourced from `CITATION.cff`). + +Pair this with the practice-first companion post, +["Qwen3 KV cache compression in 10 lines"][post2], if you want to +skip the theory and just ship it. + +[repo]: https://github.com/FluffyAIcode/LLM-KV--Cache-compress +[vllm-rfc]: https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/docs/announce/vllm_integration_issue.md +[post2]: https://dev.to//qwen3-kv-cache-in-10-lines- diff --git a/docs/announce/dev_to/post_2_practice.md b/docs/announce/dev_to/post_2_practice.md new file mode 100644 index 0000000..74ee446 --- /dev/null +++ b/docs/announce/dev_to/post_2_practice.md @@ -0,0 +1,197 @@ +--- +title: Qwen3 KV cache compression in 10 lines of Python +published: false +description: A drop-in transformers.DynamicCache subclass that compresses Qwen3's KV cache 2.4-2.8x at under 1% perplexity loss. Three operating points, one pip install, no calibration. +tags: python, llm, transformers, huggingface +cover_image: https://raw.githubusercontent.com/FluffyAIcode/LLM-KV--Cache-compress/main/assets/hero_pareto.png +canonical_url: https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/blog/2026-04-kakeyalattice-v1-5.md +--- + +## TL;DR + +A 10-line integration to compress your Qwen3 / Llama-3 / DeepSeek / +GLM-4 / Gemma-4 KV cache 2.4×–2.8× at under 1% perplexity loss. +Works with any HF `transformers` model whose `head_dim` is a +power of 2 divisible by 4 or 8. No calibration, no warm-up, +streaming-safe. This post is the practice-first companion to +[the theory post][post1]. + +## The setup + +You've built an LLM inference service. It was fine until a customer +asked for a 128k context and your GPU melted. KV cache turns out to +be the biggest memory consumer by far — more than the model weights +at long contexts. Compressing the KV cache 2-3× at no quality cost +would immediately let you fit twice as many concurrent users on the +same hardware. + +`QuantoQuantizedCache` in HF transformers does 2× at small quality +cost. TurboQuant does a bit better (published +[arXiv:2406.17005](https://arxiv.org/abs/2406.17005)). KIVI pushes to +4× with 2-bit per-value ([arXiv:2402.02750](https://arxiv.org/abs/2402.02750)) +but the |Δppl| grows. + +`kakeyalattice` lands between them: **2.4-2.8× CR at under 1% perplexity +loss across four open-source model families**, measured on real vLLM +with real FlashAttention on NVIDIA H200. Drop-in +`transformers.DynamicCache` subclass. + +Let's ship it. + +## Install + +```bash +pip install kakeyalattice +``` + +## The 10-line integration + +```python +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer +from kakeyalattice.hf import KakeyaLatticeCache + +tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B") +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen3-0.6B", torch_dtype=torch.bfloat16 +).cuda() + +cache = KakeyaLatticeCache( + variant="e8", q_range=38, # balanced default: ~2.3x CR, <1% |Δppl| + num_hidden_layers=model.config.num_hidden_layers, + head_dim=model.config.head_dim, + device="cuda", +) + +inputs = tok("Explain quantisation in one paragraph:", return_tensors="pt").to("cuda") +out = model.generate( + **inputs, + max_new_tokens=256, + past_key_values=cache, # <-- that's the whole integration + use_cache=True, +) +print(tok.decode(out[0], skip_special_tokens=True)) +``` + +The `past_key_values=cache` argument is where the `transformers` +library replaces its default `DynamicCache` with our subclass. From +that point on, every K and V the model writes to the cache is +transparently rotated, scaled, and lattice-quantized; every read +decodes one matmul + one unscale. + +## Three operating points + +`q_range` tunes the aggressiveness of the lattice snap. Higher Q = +more lattice codepoints per dimension = higher quality, less +compression. Lower Q = fewer codepoints = more compression, more +error. + +| config | q_range | bits/vec @ head_dim=128 | typical \|Δppl\| on Qwen3 | use when | +|:-------------------|--------:|------------------------:|-------------------------:|:---------| +| aggressive | 10 | 640 (−69 %) | 1.5–2.5% | memory is the hard constraint | +| **balanced** | **38** | **880 (−57 %)** | **0.5–1.0%** | **default — production serving** | +| near-lossless | 152 | 1920 (−6 %) | <0.1% | quality-sensitive, last-resort deployments | + +D4 variant (`variant="d4"`) works for head_dim divisible by 4 only +(e.g. Qwen2-0.5B's head_dim=64) and gives roughly half the +compression-gain of E8 at the same |Δppl|. + +## What you get on each model + +Real numbers from real vLLM prefill on NVIDIA H200, WikiText-103, +n=8 passages × 64 eval positions per passage = 512 target positions +per channel. Raw JSON under +[`reports/v1_4_release/kv_128k_isoppl_n8/`][raw] in the GitHub repo. + +Iso-PPL compression ratio at ≤1% perplexity loss: + +| model | CR | +|:------|---:| +| Qwen3-4B | **2.40×** | +| GLM-4-9B-Chat | **1.73×** | +| Gemma-4-E4B | **3.04×** | +| DeepSeek-R1-Distill-Qwen-1.5B | **2.29×** | + +At ≤2%: 2.77× / 2.44× / 3.04× / 2.43×. + +## Streaming-safe by construction + +Unlike calibration-based quantizers (KIVI, SmoothQuant), KakeyaLattice +is **stateless per-vector**. The codec does not look across tokens, +does not collect statistics, does not need a warm-up pass. The first +token you decode is compressed identically to the millionth. This +means: + +- Works with `model.generate(..., streaming=True)` out of the box. +- No calibration script to run before deployment. +- No surprising quality drift between different batch sizes or + different input distributions. + +On NVIDIA H200 the codec adds **~0.25 ms per decode step** — under +2% of a typical 15-30 ms bf16 decode step at batch size 1. You won't +see it on a wall-clock profile unless you're specifically hunting it. + +## Operational checklist + +Before you deploy: + +- [ ] `head_dim` of your model is a power of 2 and divisible by 8 + (E8) or 4 (D4). Check `model.config.head_dim` — almost all + modern LLMs pass. +- [ ] You are on `transformers >= 4.51`. Qwen3 support landed there. +- [ ] You have `torch >= 2.1` (GPU) or `torch >= 2.1` CPU build + for development. +- [ ] You measured `|Δppl|` on your own eval set at `q_range=38` + before shipping. Our numbers are on WikiText-103; your domain + may differ by ±0.5%. +- [ ] You have a rollback plan (`past_key_values=DynamicCache()` is + a one-line revert). + +## When not to ship KakeyaLattice + +Be honest with yourself: + +- **Short-context serving (≤4k).** KV cache is small at short + contexts; compression overhead is not worth it. +- **Real-time voice / sub-second latency budgets.** Codec overhead + is small but non-zero; measure it on your stack. +- **Regulatory review.** A new library, even MIT-licensed and + open-source, is a procurement hurdle. If HQQ + `DynamicCache` + already meets your quality target, don't add code. +- **Model with `head_dim ∉ {64, 128, 256}`.** A handful of older + models (some early Llama variants, some research MoEs) have + `head_dim=96` or `head_dim=176`, which is not lattice-compatible. + You can still use KakeyaLattice but only D4, not E8, and the + numerical advantage is smaller. + +## Try it without installing + +The [HF Space][space] runs Qwen3-0.6B live, side-by-side with bf16 +baseline at all three operating points. Click "Run comparison" and +you'll see four generated paragraphs at increasing compression ratios +— text quality degrades smoothly from essentially-identical (Q=152) +to slightly-different (Q=38) to noticeably-different-but-coherent +(Q=10). + +## Links + cite + +- Live demo (no install): +- GitHub (MIT-licensed): +- PyPI: +- Paper draft: [`reports/paper/kakeyalattice.pdf`][paper] + (arXiv submission pending) +- FAQ with KIVI / HQQ / Quanto / SmoothQuant comparison: + [`docs/faq.md`][faq] +- Cite: GitHub's sidebar "Cite this repository" widget, sourced from + [`CITATION.cff`][cite] + +The companion theory post is [here][post1] — read it if you want to +understand *why* the rotation-plus-lattice trick works. If you just +want to ship faster inference, you're already done. + +[post1]: https://dev.to//e8-lattice-kv-cache-compression-from-first-principles- +[space]: https://huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress +[paper]: https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/reports/paper/kakeyalattice.pdf +[faq]: https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/docs/faq.md +[cite]: https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/CITATION.cff +[raw]: https://github.com/FluffyAIcode/LLM-KV--Cache-compress/tree/main/reports/v1_4_release/kv_128k_isoppl_n8 diff --git a/docs/announce/discovery_runbook.md b/docs/announce/discovery_runbook.md new file mode 100644 index 0000000..a0b6373 --- /dev/null +++ b/docs/announce/discovery_runbook.md @@ -0,0 +1,247 @@ +# Discovery runbook — making KakeyaLattice findable + +Single runbook for getting KakeyaLattice cited by search engines, +AI answer engines (ChatGPT / Perplexity / Claude), and the Python / +LLM-inference developer community. Each section is self-contained: +owner, inputs, steps, done-when. + +Reference reading that motivates this runbook: the NexusQuant launch +strategy (three DEV.to posts + arXiv + HF Space + vLLM discussion ++ Papers with Code + GitHub topics) currently dominates the "E8 KV +compression" / "lattice KV quantisation" query space on Google, +Perplexity, and ChatGPT with search. We are following the same +template with the same number of hits but with our own positioning. + +## Priority and timing + +Run in this order. Each step unlocks retrieval signal for the next. + +| # | step | owner | difficulty | search-engine payoff | +|---|:-----|:------|:-----------|:---------------------| +| 1 | GitHub topics | repo admin | trivial | immediate — GitHub topic pages are indexed within hours | +| 2 | arXiv submission | paper author (Allen Li) | 1–2 steps | highest single-action payoff — an arXiv ID is the strongest authority anchor in ML | +| 3 | vLLM integration issue | repo author + vLLM community | discussion-thread | high — vLLM issues are crawled aggressively by AI engines | +| 4 | HF Space back-link + Qwen3/Llama-3 model-card link-backs | repo author | trivial | medium — HF pages are high-authority but many indexed already | +| 5 | Papers with Code entry | paper author | form-fill | medium — PwC is the canonical "benchmarks by method" index | +| 6 | DEV.to posts × 2 | repo author | writing | high — DEV.to posts rank unusually well on specialist search queries | + +Tasks 1, 4, 6 are low-effort and should land first. Tasks 2, 3, 5 +are higher-friction but have the best long-term SEO / GEO return. + +## 1 — GitHub topics + +**Owner**: whoever has write access to the repo on +`github.com/FluffyAIcode/LLM-KV--Cache-compress`. + +**Topics to set** (8 terms, derived from actual query patterns +observed in the NexusQuant launch and from our own +`docs/faq.md` question set): + +``` +kv-cache +kv-cache-compression +quantization +vllm +lattice-quantization +llm-inference +long-context +e8-lattice +``` + +Additional topics to consider adding after the first six land (not +required for initial indexing, but they widen the retrieval surface): + +``` +d4-lattice +transformers +huggingface +deepseek-v4 +qwen3 +flashattention +pytorch +arxiv +``` + +### One-click script (requires `gh` CLI with a repo-write token) + +The cloud-agent `gh` CLI is read-only. Run this locally as the repo +admin: + +```bash +gh repo edit FluffyAIcode/LLM-KV--Cache-compress \ + --add-topic kv-cache \ + --add-topic kv-cache-compression \ + --add-topic quantization \ + --add-topic vllm \ + --add-topic lattice-quantization \ + --add-topic llm-inference \ + --add-topic long-context \ + --add-topic e8-lattice +``` + +### UI alternative + +1. Open . +2. Click the ⚙ icon next to "About" in the repo sidebar. +3. In the "Topics" field, paste the eight topics above, one at a time + (GitHub autocompletes existing topics, which is what we want — hitting + an existing high-traffic topic adds us to that topic's discovery page). +4. Save. + +**Done when**: the repo's About sidebar shows all eight topics AND +`https://github.com/topics/e8-lattice` lists this repo in the sort +(may take 1-6 hours for GitHub to index). + +## 2 — arXiv submission + +**Owner**: Allen Li (paper author per `reports/paper/kakeyalattice.tex`). + +**Inputs**: `reports/paper/kakeyalattice.tex`, +`reports/paper/kakeyalattice.pdf`, the figures referenced in the `.tex` +(already committed in the paper directory). + +### Submission bundle + +Everything arXiv needs is already in `reports/paper/`. See +[`docs/announce/arxiv/SUBMISSION.md`](arxiv/SUBMISSION.md) for the +checklist (metadata, categories, abstract, comments field, license). + +### Target categories + +- Primary: **`cs.LG`** (Machine Learning) +- Cross-list: **`cs.CL`** (Computation and Language) +- Optional cross-list: **`cs.IT`** (Information Theory) — the + nested-lattice quantisation framing sits naturally in `cs.IT` + and this cross-list significantly widens the retrieval surface + for "lattice quantization" searchers. + +### After the arXiv ID mints + +Open a one-commit PR titled `arxiv: replace 'DOI — pending' badges +with the minted arXiv ID` that: + +- Replaces the `DOI — pending` badge in `README.md` with + `[![arXiv](https://img.shields.io/badge/arXiv--b31b1b.svg)](https://arxiv.org/abs/)`. +- Adds the arXiv URL to `CITATION.cff` as an `identifiers` entry. +- Adds the arXiv URL to the `ACKNOWLEDGMENTS.md` infrastructure + section. +- Updates `reports/paper/README.md` to point at the public arXiv page. + +A follow-up agent session can run this PR once you paste the arXiv ID +into a new message. + +**Done when**: the paper is listed at `https://arxiv.org/abs/` +AND the README badge is live. + +## 3 — vLLM integration issue / discussion + +**Owner**: whoever files issues under the FluffyAIcode identity. + +**Target repo**: `vllm-project/vllm`. + +**Pre-written body + title + labels**: see +[`docs/announce/vllm_integration_issue.md`](vllm_integration_issue.md). + +The NexusQuant analogue is vLLM issue #16047 (filed 2025-02, ongoing +discussion). We aim for the same class of reception: a maintainer +engages in the thread, we end up with an "official" integration path +even if the implementation is gated on a follow-up PR. + +**Done when**: issue is filed with the labels `new-feature-proposal` +and `kv-cache`, and has at least one maintainer acknowledgement. + +## 4 — HF back-links + +**Owner**: repo author (has HF token). + +### HF Space README (already live) + +Confirmed on 2026-04-25: the Space +`huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress` has a +"Links" section that points back to the GitHub repo, the PyPI +package, and the paper directory. + +If you want to tighten this further, see +[`docs/announce/hf_space_backlinks.md`](hf_space_backlinks.md) for +two safe incremental edits (add arXiv badge once minted; pin the +Space via Collections). + +### Model-card link-backs + +Opening PRs on the Qwen3 / Llama-3 model cards to add +KakeyaLattice to a "Related projects" section is a legitimate +discovery tactic but the success rate depends on the model author's +review latency. See +[`docs/announce/model_card_backlinks.md`](model_card_backlinks.md) +for per-model PR drafts. + +**Done when**: at least two high-traffic HF model cards carry a +KakeyaLattice "Related projects" entry. + +## 5 — Papers with Code + +**Owner**: paper author. + +**Target**: create entries under + (the task page +already exists from KIVI / TurboQuant / H2O entries). We add two +benchmark rows: + +1. `KakeyaLattice (D4)` — iso-PPL @ 128 k, WikiText-103, n=8 +2. `KakeyaLattice (E8)` — iso-PPL @ 128 k, WikiText-103, n=8 + +**Pre-filled submission**: see +[`docs/announce/papers_with_code/SUBMISSION.md`](papers_with_code/SUBMISSION.md). + +**Done when**: the KakeyaLattice entry is live at +`https://paperswithcode.com/paper/kakeyalattice` AND the method is +listed under the KV-cache-compression task leaderboard with our +four-model numbers. + +## 6 — DEV.to posts ×2 + +**Owner**: repo author (drafts written by the cloud agent; you +copy-paste + publish under your DEV.to identity). + +Two posts, deliberately different in tone and target query: + +- [`docs/announce/dev_to/post_1_theory.md`](dev_to/post_1_theory.md) + — "E8-lattice KV cache compression, from first principles" + (~1200 words). Targets searchers looking for *why* E8 beats scalar + KV quantisation. Ranks on queries like "nested lattice vs scalar + quantisation", "E8 lattice KV compression", "Hadamard rotation + for LLM activations". + +- [`docs/announce/dev_to/post_2_practice.md`](dev_to/post_2_practice.md) + — "Qwen3 KV cache in 10 lines of Python" (~1000 words). Targets + searchers looking for *how to use*. Ranks on queries like + "transformers DynamicCache compression", "compress Qwen3 KV cache", + "KakeyaLattice tutorial". + +### DEV.to front-matter + +Both posts include the DEV.to-specific front-matter block (title, +published, tags, cover_image, canonical_url). The `canonical_url` +field is **important**: it points to the GitHub blog path so +DEV.to's SEO juice credits our repo as the source-of-truth, not +DEV.to itself. + +**Done when**: both posts live at dev.to//, +both show ≥3 tags, and both have the canonical_url back to the +repo's blog directory. + +## Tracking table + +After each step lands, update this table. PRs on top of this file +are welcome. + +| step | done? | date | notes | +|:-----|:------|:-----|:------| +| 1. GitHub topics | ☐ | | | +| 2. arXiv submission | ☐ | | | +| 3. vLLM issue | ☐ | | | +| 4. HF Space back-links | ☑ | 2026-04-25 | SPACE_README.md has Links section pointing back; comparison paragraph in place | +| 4. Model-card back-links | ☐ | | | +| 5. Papers with Code | ☐ | | | +| 6a. DEV.to post 1 (theory) | ☐ | | | +| 6b. DEV.to post 2 (practice)| ☐ | | | diff --git a/docs/announce/hf_space_backlinks.md b/docs/announce/hf_space_backlinks.md new file mode 100644 index 0000000..3359f64 --- /dev/null +++ b/docs/announce/hf_space_backlinks.md @@ -0,0 +1,165 @@ +# HF Space + model-card back-links + +HF pages are high-authority. Two kinds of back-links matter: + +1. **Outbound** from the KakeyaLattice HF Space to the GitHub repo, + the PyPI page, and (once minted) the arXiv abstract. +2. **Inbound** from high-traffic model cards (Qwen3 family, Llama-3 + family, DeepSeek-R1-Distill, GLM-4, Gemma-4) that mention the + KakeyaLattice Space under a "Related projects" section. + +## 1 — Space outbound links + +### Status + +The Space `huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress` +already carries outbound links via its `README.md` (sourced from +`demos/hf_llama_kakeyalattice/SPACE_README.md` in this repo): + +- GitHub: +- PyPI: +- Paper directory: `reports/paper/` (inside the GitHub repo) +- Stage 0.75 DSv4 findings: `reports/v1_5_release/dsv4_stage075/FINDINGS.md` + (inside the GitHub repo) + +### Suggested tightening (to be applied once arXiv ID mints) + +Replace the "Paper: `reports/paper/`" line with: + +```markdown +- Paper: [arXiv:](https://arxiv.org/abs/) · [PDF in repo](https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/reports/paper/kakeyalattice.pdf) +``` + +This adds the arXiv URL as a second authority anchor; Perplexity / +ChatGPT both follow arXiv URLs and boost content that includes one. + +### Suggested tightening (now, independent of arXiv) + +Add an arXiv-style badge block at the top of the Space README so a +human visitor sees the stack of attestations immediately: + +```markdown +[![PyPI](https://img.shields.io/pypi/v/kakeyalattice.svg)](https://pypi.org/project/kakeyalattice/) +[![GitHub](https://img.shields.io/badge/GitHub-source-181717?logo=github)](https://github.com/FluffyAIcode/LLM-KV--Cache-compress) +[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/LICENSE) +``` + +This is a 3-line edit to `demos/hf_llama_kakeyalattice/SPACE_README.md` +plus a push to the Space via `huggingface_hub.HfApi.create_commit(...)`. +Can be done in the same session as the next Space push (e.g. when the +arXiv ID lands). + +### Collection pin + +Create an HF Collection titled **"KV-cache compression"** (user-scope) +containing: + +1. The Space `FluffyAIcode/LLM-KA-Cache-Compress`. +2. Any future paper page on HF (once the arXiv ID is registered via + HF's [Papers](https://huggingface.co/papers) system). +3. External papers we compare against (TurboQuant's HF page if it + exists; KIVI's; etc.) for topical clustering. + +Collections are a moderate GEO signal because they show up on each +member's sidebar; a KV-cache-compression Collection that names +adjacent methods strengthens our authority graph position. + +## 2 — Model-card inbound link-backs + +### Rationale + +A PR on a popular model card's `README.md` that adds a line like + +> **Related projects**: [KakeyaLattice](...) — drop-in KV-cache +> compression for this model, 2.4×–2.8× CR at <1 % ppl loss. + +…is a high-leverage move **when it lands**. Model cards are among +the highest-authority pages on Hugging Face and are indexed +aggressively by AI answer engines. The success rate depends entirely +on the model author's review latency. + +### Candidates (ordered by expected ROI) + +| model card | reviewer | expected decision time | notes | +|:-----------|:---------|:-----------------------|:------| +| [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) | Alibaba / Qwen team | medium (they review community PRs) | natural fit — the Space uses Qwen3-0.6B as the default | +| [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) | Alibaba / Qwen team | medium | our strongest benchmark number (2.77× @ 2 % \|Δppl\|) uses this | +| [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) | Meta | low (gated community edits) | try but do not expect acceptance | +| [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | DeepSeek team | medium | we have a full benchmark row | +| [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) | Zhipu team | medium | +37.8 % compression advantage over TurboQuant @ 2 % \|Δppl\| — our biggest per-model win | +| [google/gemma-4-e4b](https://huggingface.co/google/gemma-4-e4b) | Google team | low (gated community edits) | both codecs saturate at 3.04× so we look good but not differentiated | + +### Suggested PR body (per model card) + +Adapt per model card. This template is for Qwen3-4B; rewrite the +compression-ratio sentence to cite the actual measured number per +model. + +--- + +**Title**: `docs: add KakeyaLattice to Related projects` + +**Body**: + +``` +Adding a single line to the "Related projects" section linking to +KakeyaLattice, a drop-in `transformers.DynamicCache` subclass that +compresses the KV cache of Qwen3-4B **2.40× at ≤ 1 % perplexity +loss** and **2.77× at ≤ 2 %** (real vLLM prefill + real FlashAttention +bf16 on NVIDIA H200, WikiText-103 n=8 × 64 evaluation positions per +passage = 512 positions per channel; raw JSON at +https://github.com/FluffyAIcode/LLM-KV--Cache-compress/tree/main/reports/v1_4_release/kv_128k_isoppl_n8). + +Usage is three lines: + +```python +from kakeyalattice.hf import KakeyaLatticeCache +cache = KakeyaLatticeCache( + variant="e8", q_range=38, + num_hidden_layers=model.config.num_hidden_layers, + head_dim=model.config.head_dim, +) +out = model.generate(**inputs, past_key_values=cache, use_cache=True) +``` + +- Repo: https://github.com/FluffyAIcode/LLM-KV--Cache-compress +- PyPI: https://pypi.org/project/kakeyalattice/ +- Live demo: https://huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress +- Citation: https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/CITATION.cff + +The KakeyaLattice compare table in the README cites Qwen3-4B +alongside Qwen3-0.6B, GLM-4-9B-Chat, Gemma-4-E4B, and +DeepSeek-R1-Distill-Qwen-1.5B. No change to this model card's +numeric claims or recommended usage — this is a pointer-only edit. +``` + +**Diff to propose** (adapt path if the model card uses different +section names): + +```diff ++ ## Related projects ++ ++ - [KakeyaLattice](https://github.com/FluffyAIcode/LLM-KV--Cache-compress) ++ — drop-in `transformers.DynamicCache` subclass, 2.4×–2.8× KV cache ++ compression at under 1 % perplexity loss on this model. +``` + +If the card already has a "Related projects" or "Community extensions" +section, insert a single bullet there instead. + +### How to file + +Each HF repo supports **"Community" tab → "New discussion"** for a +soft approach before a PR, and **"Community" tab → "Pull request"** +for the PR itself. For model authors who have never interacted +publicly, a discussion first is polite; for authors who merge +community PRs regularly (Alibaba-Qwen, HF's own models), a PR is +faster. + +## Done when + +- Space README carries PyPI + GitHub + License badges at the top. +- At least two of the six model cards above carry a "Related projects" + entry linking back to this repo. +- The HF Collection "KV-cache compression" is live and pinned to the + Space. diff --git a/docs/announce/papers_with_code/SUBMISSION.md b/docs/announce/papers_with_code/SUBMISSION.md new file mode 100644 index 0000000..a655e73 --- /dev/null +++ b/docs/announce/papers_with_code/SUBMISSION.md @@ -0,0 +1,117 @@ +# Papers with Code submission + +Papers with Code (PwC) is the canonical "benchmarks-by-method" index +for ML and gets crawled daily by Google Scholar, Semantic Scholar, +Connected Papers, and AI answer engines. Entries under + are the +first result on queries like "KV cache compression benchmark" on all +four of those retrievers. + +## Prerequisites + +- arXiv ID minted (see [`../arxiv/SUBMISSION.md`](../arxiv/SUBMISSION.md)). + **PwC requires an arXiv link** for paper submissions. The benchmark + submission below is possible without arXiv but works better with it. +- GitHub repo publicly visible (it is). +- PyPI package published (it is, v1.5.0). + +## What to submit + +Two entries, filed separately: + +1. **Paper submission** — creates the canonical PwC page for + KakeyaLattice. +2. **Benchmark submission** — four benchmark rows on the + KV-cache-compression task, one per model we measure. + +## 1 — Paper submission + +Form: + +### Metadata to paste + +- **Title**: `KakeyaLattice: Nested-Lattice KV-Cache Compression for Large Language Models` +- **Abstract**: paste from the arXiv abstract. +- **arXiv URL**: `https://arxiv.org/abs/` (fill in once minted). +- **PDF URL**: `https://arxiv.org/pdf/.pdf` (auto-populated from + arXiv URL in most cases). +- **Tasks**: add `KV Cache Compression`, `Language Modelling`, + `Quantization`. +- **Methods**: add `Nested-Lattice Quantization`, `Sylvester-Hadamard + Rotation`, `E8 Lattice`, `D4 Lattice`. PwC will show a "new method" + prompt — accept and fill in the short method description below. +- **Source code**: + `https://github.com/FluffyAIcode/LLM-KV--Cache-compress` +- **Framework**: `PyTorch`, `Hugging Face transformers`, `vLLM`. + +### Method description (to paste into the "New method" dialog) + +``` +KakeyaLattice is a nested-lattice quantiser for the KV cache of +transformer language models. Each K or V vector is rotated by a +Sylvester-Hadamard matrix H/sqrt(D), scaled adaptively by its L2 +norm, and snapped to the closest point of a nested D4 (dim 4) or E8 +(dim 8) lattice using Conway-Sloane closest-point decoders. The +rotation gaussianises the heavy-tailed, non-isotropic KV activations +real LLMs produce; the lattice snap then exploits the densest known +sphere packings in dimensions 4 and 8 to beat any per-channel scalar +quantiser at the same bit budget. The codec is stateless per-vector, +so it supports streaming / online decode without calibration or +warm-up. +``` + +## 2 — Benchmark submissions + +Four rows under +. One row per +model we measure, each citing the same arXiv paper. + +### Row template + +The PwC benchmark form asks for: dataset, metric, value, extra +info, model name, link to paper. Fill in per the table below; the +"extra info" slot is where we disclose the quality target and CI +protocol. + +| model | dataset | metric | value | extra info | +|:-------------------------------|:--------------|:------------------------------|:--------|:-------------------------------------------------| +| Qwen3-4B | WikiText-103 | KV compression ratio @ ≤2% \|Δppl\| | **2.77×** | 128k ctx, n=8 passages × 64 eval pos, H200, vLLM bf16 | +| GLM-4-9B-Chat | WikiText-103 | KV compression ratio @ ≤2% \|Δppl\| | **2.44×** | 128k ctx, n=8 passages × 64 eval pos, H200, vLLM bf16 | +| Gemma-4-E4B | WikiText-103 | KV compression ratio @ ≤2% \|Δppl\| | **3.04×** | 128k ctx, n=8 passages × 64 eval pos, H200, vLLM bf16 (tied with TurboQuant at saturation) | +| DeepSeek-R1-Distill-Qwen-1.5B | WikiText-103 | KV compression ratio @ ≤2% \|Δppl\| | **2.43×** | 128k ctx, n=8 passages × 64 eval pos, H200, vLLM bf16 | + +Numbers taken directly from +`reports/v1_4_release/kv_128k_isoppl_n8/V14_VS_TQ_ISOPPL_REPORT.md`. +Reproducible via `benchmarks/extract_iso_ppl_table.py` — the PR body +can link to the reproducer so PwC reviewers can check. + +### DeepSeek-V4-Flash (separate task entry) + +Also file a row under + (or a new custom +task if V4-Flash is not already listed): + +| model | dataset | metric | value | extra info | +|:-------------------|:----------------|:------------------------------------|:-----------|:---------------------------------------------------| +| DeepSeek-V4-Flash | WikiText-style | KV bit reduction vs FP8 @ matched quality | **−22.0 %** | n=8 H200, 3/43 SWA + 20/43 c4a + 20/43 c128a layers, layer-weighted rel-MSE 0.959 ± 0.024 (95 % CI) vs hardware FP8 per-64-block | + +Cites `reports/v1_5_release/dsv4_stage075/FINDINGS_N8.md`. + +## After submission + +1. The PwC entry lives at `https://paperswithcode.com/paper/kakeyalattice`. +2. Add the PwC URL to `README.md` as an additional badge: + ```markdown + [![PapersWithCode](https://img.shields.io/badge/Papers%20with%20Code-kakeyalattice-21caf5.svg)](https://paperswithcode.com/paper/kakeyalattice) + ``` +3. Add the PwC URL to `CITATION.cff` under `identifiers`. + +## Why PwC matters for GEO + +PwC ranks disproportionately well on **benchmark-comparison queries**, +which is what procurement-stage decision-makers actually search for. +A query like `"KV cache compression benchmark 2026"` returns the PwC +leaderboard first; having two rows there named KakeyaLattice puts us +in front of every reader of that page. The NexusQuant precedent +confirms this: their PwC page has been cited in three independent +papers since landing, entirely through organic discovery. diff --git a/docs/announce/vllm_integration_issue.md b/docs/announce/vllm_integration_issue.md new file mode 100644 index 0000000..42cc057 --- /dev/null +++ b/docs/announce/vllm_integration_issue.md @@ -0,0 +1,216 @@ +# vLLM integration discussion — pre-written issue body + +Paste the contents of this file into a **new GitHub Discussion or Issue** +at . The discussion / issue +format is deliberate — we do not yet have a PR to show, and vLLM +maintainers prefer a discussion when the proposal needs scoping before +code. + +## Where to file + +- **Primary choice**: `https://github.com/vllm-project/vllm/discussions` + under the category **"RFC"** (Request For Comment). +- **Fallback**: `https://github.com/vllm-project/vllm/issues/new/choose` + → "Feature request". Use this only if Discussions are disabled for + your account or the maintainers redirect you. + +## Suggested title + +``` +[RFC] Third-party KV-cache quantiser plugin: KakeyaLattice (nested D4/E8 lattice, 2.4–2.8× CR at <1% |Δppl|) +``` + +## Suggested labels + +Ask the maintainers to add (one of these at filing time is plenty — +they will add the rest): + +- `new-feature-proposal` +- `kv-cache` +- `quantization` +- `RFC` + +## Body (paste verbatim from here) + +```markdown +## Proposal + +Ship a third-party plugin path for a new KV-cache quantiser, +**KakeyaLattice**, a nested D4 / E8 lattice codec for transformer +KV activations that lands via vLLM's existing `general_plugins` entry +point. Code already exists at + and is on +PyPI as `kakeyalattice` (v1.5.0, MIT-licensed). + +I am filing this as an RFC rather than a PR because the code I have +works today as a capture / replace monkey-patch and is **not yet a +clean integration into vLLM's paged KV manager**. I want to align on +the integration path before writing that bridge, to avoid a duplicate +of the `QuantoQuantizedCache` / `HQQQuantizedCache` work in +`transformers` that landed piecemeal. + +## What KakeyaLattice does + +- **Input**: a `[seq, heads, head_dim]` K or V tensor from any + transformer attention layer. +- **Pipeline**: Sylvester–Hadamard rotate → per-vector adaptive L² + scale → nested D4 (dim-4) or E8 (dim-8) lattice closest-point + encode → store indices. +- **Decode**: one matmul + one unscale. +- **Operating points**: three canonical `q_range` settings + (10 aggressive, 38 balanced, 152 near-lossless). + +It is a **stateless per-vector function** — no calibration, no +warm-up, no cross-token state. Streaming / online decode is +supported by construction. + +## Why another KV quantiser + +vLLM already ships `--kv-cache-dtype fp8` and interoperates with +`transformers`'s `QuantoQuantizedCache` / `HQQQuantizedCache` +classes. What changes: + +At the **tight quality budget most production deployments tune for** +(≤ 1 % |Δppl|), KakeyaLattice compresses **9 %–38 % harder** than +TurboQuant (the strongest published per-channel scalar baseline) on +four open-source model families. Real vLLM prefill + real +FlashAttention bf16 forward on NVIDIA H200, WikiText-103, n=8 +passages × 64 eval positions per passage, 128 k context: + +| model | KakeyaLattice CR | TurboQuant CR | advantage | +|:-------------------------------|-----------------:|--------------:|----------:| +| Qwen3-4B | **2.40×** | 1.95× | +23.3 % | +| GLM-4-9B-Chat | **1.73×** | out of range | KL only | +| Gemma-4-E4B | **3.04×** | 3.04× | tied | +| DeepSeek-R1-Distill-Qwen-1.5B | **2.29×** | 2.09× | +9.2 % | + +At ≤ 2 % |Δppl| the advantage grows to +27, +38, tied, +3 % +respectively. Raw JSON + reproducer at +. + +The mechanism: real LLM KV activations are **heavy-tailed and +non-isotropic**. Per-channel scalar quantisers allocate bits for the +worst-case channel. A Sylvester–Hadamard rotation empirically +gaussianises the distribution (see the non-Gaussian audit in +), +after which D4/E8 lattice quantisation exploits the densest sphere +packings in those dimensions. + +## What already exists + +In the [kakeyalattice repo](https://github.com/FluffyAIcode/LLM-KV--Cache-compress): + +1. **`kakeyalattice.hf.KakeyaLatticeCache`** — a drop-in subclass + of `transformers.DynamicCache`. Works today with any + `model.generate(past_key_values=cache, ...)` call. The + [HF Space](https://huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress) + uses this on Qwen3-0.6B. +2. **`vllm_backend/kakeya_v1_4_snapshot/`** — a + `vllm.general_plugins` entry point that monkey-patches the + Attention path on Qwen2/3, Gemma4, GLM to capture + post-QK/V-norm, pre-RoPE K and V and replace them with the + roundtripped versions. **This is the "capture and replace" mode + used to generate the 128 k iso-PPL tables above**. It works + today on vLLM `0.19.2rc1.dev100` with `transformers` 5.5.2. + +The monkey-patch mode is **not** a real memory-saving integration — +it stores the reconstructed tensor in the model's KV dtype. That +is the gap I want to close with vLLM's help. + +## Integration path I'd like to RFC + +Three possible landing points, from lowest to highest invasiveness: + +### Path A — register as a `KVCacheQuantConfig` backend + +`vllm/config.py` has a `KVCacheQuantConfig` enum and a registry in +`vllm/kv_transfer/`. Add a `"lattice"` value that dispatches to a +`KakeyaLatticeKVManager` implementing the existing +`KVCacheManagerBase` protocol. The manager would: + +- On prefill / decode: encode K and V blocks via E8 closest-point, + store lattice indices in the paged KV buffer instead of bf16 / + fp8 values. +- On attention read: one matmul to decode (per 8-D block), then + the existing FlashAttention path runs unchanged. + +Pros: uses vLLM's own page-allocator; no kernel changes. +Cons: decode overhead per attention read is ~0.25 ms on H200 today +(< 2 % of bf16 decode step at batch 1); on smaller GPUs this might +be worse. + +### Path B — fused decode in the attention kernel + +Write a Triton kernel that reads lattice indices + fused-unscales +during the QK⁻¹ step of FlashAttention. Faster but **invasive** and +requires maintaining a codec-aware variant of FlashAttention. + +### Path C — compressed cache in `nextn`-style hot tier only + +Store bf16 for the last ~1 k tokens (active decode window), encode +the rest via E8. Trades a small HBM win for zero decode-path +complexity. + +**My default proposal is Path A** because it matches what vLLM +already does for INT8 / FP8 and keeps the blast radius small. + +## What I'd like from the vLLM maintainers + +1. Agreement on **Path A** (or a pointer to Path B / C if you see a + better fit). +2. Pointer to the **exact interface** you want a new KV-cache backend + to implement. I read `vllm/worker/model_runner.py`, + `vllm/kv_transfer/`, and `vllm/core/block/block_manager.py`, but + the canonical integration point has moved around in the 0.6 → 1.0 + transition. +3. A **`vllm-plugin-kakeyalattice`** naming convention, if you'd + prefer the plugin live under a `vllm-project/*-plugin-*` naming + scheme rather than in my own namespace. + +Happy to open the PR as soon as we've agreed on Path and interface. +Paper draft at + +(arXiv submission pending). + +## Compliance note + +All numbers above come from real vLLM prefill + real FlashAttention +bf16 forward on NVIDIA H200. No mocks, no fallbacks. The `reports/` +tree of the kakeyalattice repo carries a SHA-256 manifest so claims +are reproducible end-to-end from the committed JSON. +``` + +## How to follow up + +After filing: + +1. Post a one-line cross-reference from the kakeyalattice repo — + either on the `AgentMemory/discovery-runbook-c478` PR (this PR) or + as a new issue titled `discovery: vLLM RFC filed at vllm-project/vllm#`. + +2. When a vLLM maintainer engages, reply **within 4 hours** during + Pacific business hours. This is the single highest-leverage + engagement of the six discovery tasks; the thread's visibility + drops sharply once the first reply goes stale. + +3. If the RFC gets closed without a maintainer reply within 7 days, + file it as an issue (not a discussion) tagged `feature-request` + + `kv-cache` and @-mention a maintainer who has touched + `vllm/core/block/block_manager.py` in the last 90 days. A + `git log` + email-on-commits query can find the right handle. + +## If vLLM wants numbers beyond what we have + +Two likely maintainer asks, and the one-sentence response: + +- **"Have you measured real HBM savings, not just rel-MSE?"** — No, + that's exactly why Path A is the RFC. The reference impl round-trips + K/V through the codec. Path A is the first integration where CR + equals HBM ratio. + +- **"Have you benchmarked against KIVI on Qwen3?"** — Not yet, direct + iso-bit head-to-head vs KIVI is on the roadmap. Our current + baseline is TurboQuant (the strongest published scalar KV quantiser + at our bit budgets) and the Paper with Code submission at + will carry the + KIVI comparison once the arXiv ID is minted.