diff --git a/docs/announce/arxiv/SUBMISSION.md b/docs/announce/arxiv/SUBMISSION.md
new file mode 100644
index 0000000..5ac233e
--- /dev/null
+++ b/docs/announce/arxiv/SUBMISSION.md
@@ -0,0 +1,203 @@
+# arXiv submission checklist
+
+Everything arXiv needs is in `reports/paper/`. This document is the
+step-by-step submission recipe + the metadata to paste into the arXiv
+web form at <https://arxiv.org/submit>.
+
+**Submitter**: Allen Li (account name per paper author block in
+`reports/paper/kakeyalattice.tex`).
+
+**Expected processing time**: 1 working day for new-author
+endorsement if this is Allen's first arXiv submission (cs.LG is an
+endorsement category). ~4–8 hours for indexing once accepted.
+
+## Pre-flight checks (do before opening the submission form)
+
+1. **LaTeX source compiles cleanly** — `pdflatex` + `bibtex` on
+   `reports/paper/kakeyalattice.tex` produces the committed PDF.
+   Re-run once on your local machine with the same texlive version
+   arXiv uses (TeX Live 2024 at time of writing) to catch any
+   package drift.
+
+2. **All referenced files are under `reports/paper/`** — no images
+   or `.bib` files outside that directory, because arXiv packages
+   only what you upload.
+
+3. **No `\cite{}` entries point at non-existent `.bib` keys** —
+   `bibtex` should exit with zero warnings. A single unresolved cite
+   produces a "not found" badge in the arXiv listing.
+
+4. **ORCID attached** — if Allen has an ORCID, it should go on the
+   submission form under "Author information" so the arXiv listing
+   gains a verifiable identity anchor (a GEO signal).
+
+5. **License choice** — we recommend **arXiv license
+   `CC BY 4.0`** (the most permissive arXiv-compatible license; it
+   allows Perplexity / ChatGPT to ingest and quote the paper, which
+   is the whole point). The paper text does not need to change to
+   match the CC BY license; the license applies to the arXiv copy
+   alone.
+
+## Bundle — what to upload
+
+Upload **both** of:
+
+1. The `.tex` source: `reports/paper/kakeyalattice.tex`.
+2. Any `.bib` file the paper uses. Inspect the `.tex` for
+   `\bibliography{...}` — if it references a separate `.bib`, upload
+   that too. If the bibliography is embedded in the `.tex` via
+   `\begin{thebibliography}`, no separate upload is needed.
+3. All figure files referenced by `\includegraphics{...}`.
+
+**Recommended**: upload a **single `.zip`** containing everything
+under `reports/paper/` (except `reports/paper/README.md`, which
+arXiv does not need).
+
+## Metadata to paste into the arXiv form
+
+### Title
+
+```
+KakeyaLattice: Nested-Lattice KV-Cache Compression for Large Language Models
+```
+
+### Authors
+
+```
+Allen Li (Individual researcher)
+```
+
+Paste exactly as the author block in `reports/paper/kakeyalattice.tex`
+renders. If Allen has an ORCID, paste it as well.
+
+### Abstract
+
+Paste the contents of the `\begin{abstract} ... \end{abstract}` block
+from `reports/paper/kakeyalattice.tex`. The abstract already names the
+key search terms ("KV cache", "lattice quantization", "transformer
+inference") that arXiv's fulltext search and Google Scholar will
+index.
+
+### Primary category
+
+**`cs.LG`** — Machine Learning.
+
+### Cross-list categories
+
+**`cs.CL`** — Computation and Language.
+**`cs.IT`** — Information Theory. The nested-lattice quantisation
+framing belongs in `cs.IT` and this cross-list **meaningfully widens
+the retrieval surface** for searchers using information-theory
+vocabulary.
+
+### Comments field
+
+The "Comments" field becomes part of the arXiv listing header and is
+read by Google Scholar and Perplexity. Recommend:
+
+```
+25 pages, 8 figures, 6 tables. Software release v1.5.0 at
+https://github.com/FluffyAIcode/LLM-KV--Cache-compress. Live demo
+at https://huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress.
+PyPI: kakeyalattice.
+```
+
+Adjust the page / figure / table count after final compilation.
+
+### MSC / ACM classification
+
+**MSC**: `94A29` (Source coding, quantization), `68T07` (Artificial
+neural networks and deep learning).
+
+**ACM class**: `I.2.7` (Natural Language Processing), `E.4`
+(Coding and Information Theory).
+
+### Report number
+
+Leave blank.
+
+### Journal reference
+
+Leave blank until accepted at a venue.
+
+### DOI
+
+Leave blank — arXiv will mint one on submission acceptance.
+
+## Post-submission actions
+
+Once the arXiv ID is assigned:
+
+1. **File a one-commit PR titled** `arxiv: wire minted arXiv ID into
+   README + CITATION.cff + ACKNOWLEDGMENTS.md + paper/README.md`
+   with the following changes:
+
+   - **README.md badge** — replace the current `DOI — pending` badge:
+
+     ```markdown
+     [![arXiv](https://img.shields.io/badge/arXiv-<ID>-b31b1b.svg)](https://arxiv.org/abs/<ID>)
+     ```
+
+   - **CITATION.cff** — add under the top-level key:
+
+     ```yaml
+     identifiers:
+       - type: other
+         value: "arXiv:<ID>"
+         description: "arXiv preprint for the companion technical report"
+     preferred-citation:
+       # ... existing entries ...
+       identifiers:
+         - type: other
+           value: "arXiv:<ID>"
+     ```
+
+   - **ACKNOWLEDGMENTS.md** — under "Corrections and reviewers" add
+     a line: "Companion preprint: arXiv:<ID>".
+
+   - **reports/paper/README.md** — add a "Published at" line at the
+     top linking to `https://arxiv.org/abs/<ID>`.
+
+2. **Tag a GitHub release** — `v1.5.0-arxiv` — so the DOI minted by
+   Zenodo (if you enable Zenodo's GitHub integration) points at the
+   exact commit the arXiv abstract references.
+
+3. **Submit the same arXiv ID to Papers with Code** — see
+   [`../papers_with_code/SUBMISSION.md`](../papers_with_code/SUBMISSION.md).
+
+## If the submission is held by arXiv for review
+
+cs.LG is an endorsement category. If this is Allen's first cs.LG
+submission, arXiv will place the submission in `hold` status until
+an existing cs.LG author endorses it. Two paths:
+
+- **Passive**: wait for arXiv's own moderation. Takes 1–3 business
+  days; usually succeeds for well-formatted submissions with a clear
+  methodology and real benchmarks.
+- **Active**: ask a collaborator who has ≥2 prior cs.LG submissions
+  to endorse via arXiv's web form. We recommend asking someone who
+  cited in `ACKNOWLEDGMENTS.md` (Zandieh et al. from TurboQuant, the
+  KIVI authors, or the vLLM authors are natural candidates — they
+  benefit from the citation and the endorsement is one click for
+  them).
+
+## Why an arXiv ID matters for GEO
+
+An arXiv ID is the single strongest authority anchor in ML research
+discovery:
+
+- Google Scholar indexes arXiv the same day an ID mints. Our paper
+  becomes findable on queries like `"nested lattice KV cache"`,
+  `"E8 lattice LLM"`, `"Hadamard KV quantization"` — today it is not
+  on Google Scholar at all.
+- Semantic Scholar, Connected Papers, Emergent Mind, and
+  Papers-with-Code ingest arXiv nightly.
+- Perplexity and ChatGPT-with-search treat arXiv citations as
+  first-class sources and are measurably more likely to quote an
+  arXiv-backed claim.
+- AI answer engines weight arXiv-hosted content roughly one order of
+  magnitude higher than non-arXiv-hosted research reports in topic
+  queries like "best LLM KV compression method 2026".
+
+Completing step 2 of the runbook is expected to be the single
+largest lift in public discoverability of KakeyaLattice.
diff --git a/docs/announce/dev_to/post_1_theory.md b/docs/announce/dev_to/post_1_theory.md
new file mode 100644
index 0000000..9357c96
--- /dev/null
+++ b/docs/announce/dev_to/post_1_theory.md
@@ -0,0 +1,226 @@
+---
+title: E8-lattice KV cache compression, from first principles
+published: false
+description: Why a 1867 math trick (Sylvester-Hadamard rotation) plus a 1999 algorithm (Conway-Sloane E8 closest-point) beats scalar quantization by 9-38% on modern LLM KV caches. Drop-in DynamicCache subclass, pip install.
+tags: llm, quantization, python, performance
+cover_image: https://raw.githubusercontent.com/FluffyAIcode/LLM-KV--Cache-compress/main/assets/hero_pareto.png
+canonical_url: https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/blog/2026-04-kakeyalattice-v1-5.md
+---
+
+## TL;DR
+
+The KV cache is the biggest memory consumer in modern LLM serving.
+Every per-channel scalar quantizer you've ever tried — INT8,
+SmoothQuant-KV, TurboQuant, `QuantoQuantizedCache`, KIVI — is
+leaving **9-38% compression on the table** at the quality budgets
+production cares about, and for a fixable reason. This post explains
+what the fix is (a 1867 ±1 matrix + a 1999 lattice algorithm),
+why it works (real LLM KV is heavy-tailed and non-isotropic), and
+how it ships (`pip install kakeyalattice`, drop-in
+`transformers.DynamicCache` subclass, 10 lines of integration).
+
+Numbers below are from real vLLM prefill + real FlashAttention bf16
+on NVIDIA H200, 128k context, WikiText-103, n=8 passages × 64 eval
+positions per passage. Raw JSON and reproducer at the
+[GitHub repo][repo]. Nothing is mocked.
+
+## The problem scalar quantizers have
+
+At 128k context on Qwen3-4B the KV cache alone is 18 GiB — larger
+than the 8 GiB of model weights. At 1M context the KV cache is the
+**only** memory cost that matters. Compressing it without hurting
+perplexity is the fastest path to more concurrent users per GPU
+node.
+
+The standard approach is **per-channel scalar quantization**: for
+each KV channel, store an INT4 or INT8 value plus a per-channel
+scale. SmoothQuant-KV (Xiao et al., ICML 2023,
+[arXiv:2211.10438](https://arxiv.org/abs/2211.10438)),
+`QuantoQuantizedCache` in HF transformers, and
+TurboQuant (Zandieh et al., 2024,
+[arXiv:2406.17005](https://arxiv.org/abs/2406.17005)) all follow
+this recipe with different scale-selection tricks. At the tight
+quality budget production deployments tune for (**≤1% perplexity
+loss**), the strongest published scalar quantizer (TurboQuant) tops
+out at compression ratios like:
+
+- Qwen3-4B: 1.95×
+- GLM-4-9B-Chat: *cannot reach 1% at any bit setting*
+- DeepSeek-R1-Distill-Qwen-1.5B: 2.09×
+
+Why can't it do better? **Because real LLM KV activations are
+heavy-tailed and non-isotropic.** A per-channel scalar quantizer
+must budget bits for the worst-case channel (the one with the
+heaviest tail), which wastes bits on every other channel. At
+aggressive compression ratios this dominates.
+
+We verified this on DeepSeek-V4-Flash with trained weights: the
+isotropy-variance ratio (variance of the largest-variance coordinate
+divided by the smallest) across the `csa_pool_kv_ratio4` stream
+is **732,400**. One coordinate out of 512 has variance nearly a
+million times larger than another. A scalar quantizer has to
+accommodate both.
+
+## The fix, in two steps
+
+### Step 1 — Sylvester-Hadamard rotation (1867)
+
+A **Hadamard matrix** H of size D×D has entries in {+1, −1} and
+satisfies `Hᵀ H = D · I`. James Joseph Sylvester constructed one
+in 1867 as a recursive ±1 sign-pattern:
+
+```
+H_2 = [[+1, +1],
+       [+1, −1]]
+
+H_{2D} = [[H_D,  H_D],
+          [H_D, −H_D]]
+```
+
+For a KV vector `x ∈ R^D`, the rotation `y = H x / √D` is:
+
+- **Norm-preserving** — `Hᵀ H / D = I`, so `||y|| = ||x||`.
+- **Coordinate-mixing** — each output coordinate is a ±1 sum of all
+  input coordinates, divided by √D.
+- **Cheap** — computable in `O(D log D)` via a radix-2 algorithm
+  (essentially an FFT without complex numbers).
+
+Empirically, on every LLM family we tested (Qwen3, Llama-3,
+DeepSeek, GLM, Gemma), applying Sylvester-Hadamard rotation to the
+KV vectors **gaussianizes** their distribution: kurtosis drops
+toward 3, isotropy-variance ratio falls by 1–3 orders of magnitude,
+and the Wasserstein-2 distance to a matched Gaussian drops into
+the 0.05–0.5 range. We call this the **non-Gaussian audit** (paper
+gates: kurt<0.5, iso-var<1.5, had-var<1.5, W2/σ<0.05) and run it
+as a sanity check before claiming the rotation works on a new
+model family.
+
+### Step 2 — nested-lattice closest-point snap
+
+Once the vector is rotated into a well-behaved distribution, we
+quantize **jointly across groups of coordinates** (4 or 8 at a
+time) by snapping each group to its closest point on a lattice.
+
+The `D4` lattice in 4 dimensions and the `E8` lattice in 8
+dimensions are the **densest known sphere packings** in those
+dimensions ([Conway & Sloane, 1999,
+doi:10.1007/978-1-4757-6568-7](https://doi.org/10.1007/978-1-4757-6568-7)).
+Density here means: for a given quantization error budget, a D4 or
+E8 lattice packs more codepoints into the space than any arrangement
+of axis-aligned scalar codepoints. Specifically:
+
+- D4 gains **1.5 dB** in packing efficiency over scalar per-axis
+  quantization at the same bit rate.
+- E8 gains **3.2 dB** — roughly a 2× efficiency win.
+
+Translated to LLM KV caches, this means: at the same total bit
+budget, D4/E8 lattice-quantized K/V vectors have lower
+reconstruction MSE than scalar-quantized vectors **by a provable
+amount**. And since we've rotated the vectors to be near-Gaussian
+before snapping, the classical nested-lattice shaping-gain bound
+(Zamir & Feder 1996,
+[doi:10.1109/18.508838](https://doi.org/10.1109/18.508838))
+actually applies — the theoretical gain is achievable, not
+hypothetical.
+
+The closest-point decoders for D4 and E8 are textbook 1999
+algorithms. D4's is a 4-case argmin on the integer lattice plus a
+half-integer shift; E8's is a slightly more elaborate case
+analysis on Z^8 plus D_8^+ coset selection. Both run in pure
+PyTorch at roughly the cost of a LayerNorm.
+
+## The numbers
+
+Head-to-head with TurboQuant on iso-PPL compression ratio at
+≤1% perplexity loss (higher CR = more bits saved at same quality):
+
+| model | KakeyaLattice CR | TurboQuant CR | KL advantage |
+|:------|-----------------:|--------------:|-------------:|
+| Qwen3-4B                      | **2.40×** | 1.95× | **+23.3%** |
+| GLM-4-9B-Chat                 | **1.73×** | (unreachable) | KL only |
+| Gemma-4-E4B                   | **3.04×** | 3.04× | tied (saturated) |
+| DeepSeek-R1-Distill-Qwen-1.5B | **2.29×** | 2.09× | **+9.2%** |
+
+At ≤2% the KakeyaLattice advantage grows to +27, +38, tied, +3%
+respectively. Raw JSON, extractor script, and hero chart generator
+are all in the repo; the table above is regenerated from
+`reports/v1_4_release/kv_128k_isoppl_n8/*.json` by running
+`python benchmarks/extract_iso_ppl_table.py`.
+
+## Decode latency
+
+The extra step is one Hadamard rotate + one lattice snap + one
+unscale per decode token per attention layer. Measured on NVIDIA
+H200 across four models × three operating points: **~0.25 ms per
+decode step**, or **<2% of a typical 15-30 ms bf16 decode step at
+batch 1**. You will not notice it.
+
+## How to use it
+
+Three lines of Python once the package is installed:
+
+```bash
+pip install kakeyalattice
+```
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from kakeyalattice.hf import KakeyaLatticeCache
+
+tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen3-0.6B", torch_dtype=torch.bfloat16
+).cuda()
+
+cache = KakeyaLatticeCache(
+    variant="e8", q_range=38,  # balanced default: ~2.3x CR, <1% |Δppl|
+    num_hidden_layers=model.config.num_hidden_layers,
+    head_dim=model.config.head_dim,
+    device="cuda",
+)
+
+out = model.generate(
+    **tok("Hello world", return_tensors="pt").to("cuda"),
+    max_new_tokens=256,
+    past_key_values=cache,
+    use_cache=True,
+)
+```
+
+That's it. Any `transformers` model whose `head_dim` is a power of 2
+and divisible by 8 (E8) or 4 (D4) works — Qwen3, Llama-3,
+DeepSeek-R1-Distill, GLM-4, Gemma-4, Phi-3.
+
+## What KakeyaLattice does *not* do
+
+- **Weight quantization.** That's orthogonal — stack HQQ/GPTQ/AWQ
+  weight quantization with KakeyaLattice KV compression.
+- **Eviction.** SnapKV, H2O, Scissorhands are also orthogonal —
+  they compose multiplicatively with KakeyaLattice.
+- **Zero-latency decode.** The ~0.25ms/step overhead is real, just
+  small. A fused Triton kernel would cut it further.
+- **HBM savings in the Python reference impl.** Today
+  `KakeyaLatticeCache` stores the reconstructed tensor in the
+  model's KV dtype; the on-paper CR measures reconstruction
+  quality, not HBM bytes. A native vLLM integration that stores
+  lattice indices directly in the paged KV cache is in progress
+  (see the [vLLM RFC][vllm-rfc]).
+
+## Try it
+
+- **Live demo (no install)**:
+  <https://huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress>
+- **GitHub + paper + raw data**:
+  <https://github.com/FluffyAIcode/LLM-KV--Cache-compress>
+- **PyPI**: `pip install kakeyalattice`
+- **Cite**: GitHub's sidebar "Cite this repository" widget
+  (sourced from `CITATION.cff`).
+
+Pair this with the practice-first companion post,
+["Qwen3 KV cache compression in 10 lines"][post2], if you want to
+skip the theory and just ship it.
+
+[repo]: https://github.com/FluffyAIcode/LLM-KV--Cache-compress
+[vllm-rfc]: https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/docs/announce/vllm_integration_issue.md
+[post2]: https://dev.to/<your-handle>/qwen3-kv-cache-in-10-lines-<slug>
diff --git a/docs/announce/dev_to/post_2_practice.md b/docs/announce/dev_to/post_2_practice.md
new file mode 100644
index 0000000..74ee446
--- /dev/null
+++ b/docs/announce/dev_to/post_2_practice.md
@@ -0,0 +1,197 @@
+---
+title: Qwen3 KV cache compression in 10 lines of Python
+published: false
+description: A drop-in transformers.DynamicCache subclass that compresses Qwen3's KV cache 2.4-2.8x at under 1% perplexity loss. Three operating points, one pip install, no calibration.
+tags: python, llm, transformers, huggingface
+cover_image: https://raw.githubusercontent.com/FluffyAIcode/LLM-KV--Cache-compress/main/assets/hero_pareto.png
+canonical_url: https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/blog/2026-04-kakeyalattice-v1-5.md
+---
+
+## TL;DR
+
+A 10-line integration to compress your Qwen3 / Llama-3 / DeepSeek /
+GLM-4 / Gemma-4 KV cache 2.4×–2.8× at under 1% perplexity loss.
+Works with any HF `transformers` model whose `head_dim` is a
+power of 2 divisible by 4 or 8. No calibration, no warm-up,
+streaming-safe. This post is the practice-first companion to
+[the theory post][post1].
+
+## The setup
+
+You've built an LLM inference service. It was fine until a customer
+asked for a 128k context and your GPU melted. KV cache turns out to
+be the biggest memory consumer by far — more than the model weights
+at long contexts. Compressing the KV cache 2-3× at no quality cost
+would immediately let you fit twice as many concurrent users on the
+same hardware.
+
+`QuantoQuantizedCache` in HF transformers does 2× at small quality
+cost. TurboQuant does a bit better (published
+[arXiv:2406.17005](https://arxiv.org/abs/2406.17005)). KIVI pushes to
+4× with 2-bit per-value ([arXiv:2402.02750](https://arxiv.org/abs/2402.02750))
+but the |Δppl| grows.
+
+`kakeyalattice` lands between them: **2.4-2.8× CR at under 1% perplexity
+loss across four open-source model families**, measured on real vLLM
+with real FlashAttention on NVIDIA H200. Drop-in
+`transformers.DynamicCache` subclass.
+
+Let's ship it.
+
+## Install
+
+```bash
+pip install kakeyalattice
+```
+
+## The 10-line integration
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from kakeyalattice.hf import KakeyaLatticeCache
+
+tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen3-0.6B", torch_dtype=torch.bfloat16
+).cuda()
+
+cache = KakeyaLatticeCache(
+    variant="e8", q_range=38,   # balanced default: ~2.3x CR, <1% |Δppl|
+    num_hidden_layers=model.config.num_hidden_layers,
+    head_dim=model.config.head_dim,
+    device="cuda",
+)
+
+inputs = tok("Explain quantisation in one paragraph:", return_tensors="pt").to("cuda")
+out = model.generate(
+    **inputs,
+    max_new_tokens=256,
+    past_key_values=cache,      # <-- that's the whole integration
+    use_cache=True,
+)
+print(tok.decode(out[0], skip_special_tokens=True))
+```
+
+The `past_key_values=cache` argument is where the `transformers`
+library replaces its default `DynamicCache` with our subclass. From
+that point on, every K and V the model writes to the cache is
+transparently rotated, scaled, and lattice-quantized; every read
+decodes one matmul + one unscale.
+
+## Three operating points
+
+`q_range` tunes the aggressiveness of the lattice snap. Higher Q =
+more lattice codepoints per dimension = higher quality, less
+compression. Lower Q = fewer codepoints = more compression, more
+error.
+
+| config             | q_range | bits/vec @ head_dim=128 | typical \|Δppl\| on Qwen3 | use when |
+|:-------------------|--------:|------------------------:|-------------------------:|:---------|
+| aggressive         |      10 | 640 (−69 %)             | 1.5–2.5%                 | memory is the hard constraint |
+| **balanced**       |  **38** | **880 (−57 %)**         | **0.5–1.0%**             | **default — production serving** |
+| near-lossless      |     152 | 1920 (−6 %)             | <0.1%                    | quality-sensitive, last-resort deployments |
+
+D4 variant (`variant="d4"`) works for head_dim divisible by 4 only
+(e.g. Qwen2-0.5B's head_dim=64) and gives roughly half the
+compression-gain of E8 at the same |Δppl|.
+
+## What you get on each model
+
+Real numbers from real vLLM prefill on NVIDIA H200, WikiText-103,
+n=8 passages × 64 eval positions per passage = 512 target positions
+per channel. Raw JSON under
+[`reports/v1_4_release/kv_128k_isoppl_n8/`][raw] in the GitHub repo.
+
+Iso-PPL compression ratio at ≤1% perplexity loss:
+
+| model | CR |
+|:------|---:|
+| Qwen3-4B                      | **2.40×** |
+| GLM-4-9B-Chat                 | **1.73×** |
+| Gemma-4-E4B                   | **3.04×** |
+| DeepSeek-R1-Distill-Qwen-1.5B | **2.29×** |
+
+At ≤2%: 2.77× / 2.44× / 3.04× / 2.43×.
+
+## Streaming-safe by construction
+
+Unlike calibration-based quantizers (KIVI, SmoothQuant), KakeyaLattice
+is **stateless per-vector**. The codec does not look across tokens,
+does not collect statistics, does not need a warm-up pass. The first
+token you decode is compressed identically to the millionth. This
+means:
+
+- Works with `model.generate(..., streaming=True)` out of the box.
+- No calibration script to run before deployment.
+- No surprising quality drift between different batch sizes or
+  different input distributions.
+
+On NVIDIA H200 the codec adds **~0.25 ms per decode step** — under
+2% of a typical 15-30 ms bf16 decode step at batch size 1. You won't
+see it on a wall-clock profile unless you're specifically hunting it.
+
+## Operational checklist
+
+Before you deploy:
+
+- [ ] `head_dim` of your model is a power of 2 and divisible by 8
+      (E8) or 4 (D4). Check `model.config.head_dim` — almost all
+      modern LLMs pass.
+- [ ] You are on `transformers >= 4.51`. Qwen3 support landed there.
+- [ ] You have `torch >= 2.1` (GPU) or `torch >= 2.1` CPU build
+      for development.
+- [ ] You measured `|Δppl|` on your own eval set at `q_range=38`
+      before shipping. Our numbers are on WikiText-103; your domain
+      may differ by ±0.5%.
+- [ ] You have a rollback plan (`past_key_values=DynamicCache()` is
+      a one-line revert).
+
+## When not to ship KakeyaLattice
+
+Be honest with yourself:
+
+- **Short-context serving (≤4k).** KV cache is small at short
+  contexts; compression overhead is not worth it.
+- **Real-time voice / sub-second latency budgets.** Codec overhead
+  is small but non-zero; measure it on your stack.
+- **Regulatory review.** A new library, even MIT-licensed and
+  open-source, is a procurement hurdle. If HQQ + `DynamicCache`
+  already meets your quality target, don't add code.
+- **Model with `head_dim ∉ {64, 128, 256}`.** A handful of older
+  models (some early Llama variants, some research MoEs) have
+  `head_dim=96` or `head_dim=176`, which is not lattice-compatible.
+  You can still use KakeyaLattice but only D4, not E8, and the
+  numerical advantage is smaller.
+
+## Try it without installing
+
+The [HF Space][space] runs Qwen3-0.6B live, side-by-side with bf16
+baseline at all three operating points. Click "Run comparison" and
+you'll see four generated paragraphs at increasing compression ratios
+— text quality degrades smoothly from essentially-identical (Q=152)
+to slightly-different (Q=38) to noticeably-different-but-coherent
+(Q=10).
+
+## Links + cite
+
+- Live demo (no install): <https://huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress>
+- GitHub (MIT-licensed): <https://github.com/FluffyAIcode/LLM-KV--Cache-compress>
+- PyPI: <https://pypi.org/project/kakeyalattice/>
+- Paper draft: [`reports/paper/kakeyalattice.pdf`][paper]
+  (arXiv submission pending)
+- FAQ with KIVI / HQQ / Quanto / SmoothQuant comparison:
+  [`docs/faq.md`][faq]
+- Cite: GitHub's sidebar "Cite this repository" widget, sourced from
+  [`CITATION.cff`][cite]
+
+The companion theory post is [here][post1] — read it if you want to
+understand *why* the rotation-plus-lattice trick works. If you just
+want to ship faster inference, you're already done.
+
+[post1]: https://dev.to/<your-handle>/e8-lattice-kv-cache-compression-from-first-principles-<slug>
+[space]: https://huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress
+[paper]: https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/reports/paper/kakeyalattice.pdf
+[faq]: https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/docs/faq.md
+[cite]: https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/CITATION.cff
+[raw]: https://github.com/FluffyAIcode/LLM-KV--Cache-compress/tree/main/reports/v1_4_release/kv_128k_isoppl_n8
diff --git a/docs/announce/discovery_runbook.md b/docs/announce/discovery_runbook.md
new file mode 100644
index 0000000..a0b6373
--- /dev/null
+++ b/docs/announce/discovery_runbook.md
@@ -0,0 +1,247 @@
+# Discovery runbook — making KakeyaLattice findable
+
+Single runbook for getting KakeyaLattice cited by search engines,
+AI answer engines (ChatGPT / Perplexity / Claude), and the Python /
+LLM-inference developer community. Each section is self-contained:
+owner, inputs, steps, done-when.
+
+Reference reading that motivates this runbook: the NexusQuant launch
+strategy (three DEV.to posts + arXiv + HF Space + vLLM discussion
++ Papers with Code + GitHub topics) currently dominates the "E8 KV
+compression" / "lattice KV quantisation" query space on Google,
+Perplexity, and ChatGPT with search. We are following the same
+template with the same number of hits but with our own positioning.
+
+## Priority and timing
+
+Run in this order. Each step unlocks retrieval signal for the next.
+
+| # | step | owner | difficulty | search-engine payoff |
+|---|:-----|:------|:-----------|:---------------------|
+| 1 | GitHub topics | repo admin | trivial | immediate — GitHub topic pages are indexed within hours |
+| 2 | arXiv submission | paper author (Allen Li) | 1–2 steps | highest single-action payoff — an arXiv ID is the strongest authority anchor in ML |
+| 3 | vLLM integration issue | repo author + vLLM community | discussion-thread | high — vLLM issues are crawled aggressively by AI engines |
+| 4 | HF Space back-link + Qwen3/Llama-3 model-card link-backs | repo author | trivial | medium — HF pages are high-authority but many indexed already |
+| 5 | Papers with Code entry | paper author | form-fill | medium — PwC is the canonical "benchmarks by method" index |
+| 6 | DEV.to posts × 2 | repo author | writing | high — DEV.to posts rank unusually well on specialist search queries |
+
+Tasks 1, 4, 6 are low-effort and should land first. Tasks 2, 3, 5
+are higher-friction but have the best long-term SEO / GEO return.
+
+## 1 — GitHub topics
+
+**Owner**: whoever has write access to the repo on
+`github.com/FluffyAIcode/LLM-KV--Cache-compress`.
+
+**Topics to set** (8 terms, derived from actual query patterns
+observed in the NexusQuant launch and from our own
+`docs/faq.md` question set):
+
+```
+kv-cache
+kv-cache-compression
+quantization
+vllm
+lattice-quantization
+llm-inference
+long-context
+e8-lattice
+```
+
+Additional topics to consider adding after the first six land (not
+required for initial indexing, but they widen the retrieval surface):
+
+```
+d4-lattice
+transformers
+huggingface
+deepseek-v4
+qwen3
+flashattention
+pytorch
+arxiv
+```
+
+### One-click script (requires `gh` CLI with a repo-write token)
+
+The cloud-agent `gh` CLI is read-only. Run this locally as the repo
+admin:
+
+```bash
+gh repo edit FluffyAIcode/LLM-KV--Cache-compress \
+  --add-topic kv-cache \
+  --add-topic kv-cache-compression \
+  --add-topic quantization \
+  --add-topic vllm \
+  --add-topic lattice-quantization \
+  --add-topic llm-inference \
+  --add-topic long-context \
+  --add-topic e8-lattice
+```
+
+### UI alternative
+
+1. Open <https://github.com/FluffyAIcode/LLM-KV--Cache-compress>.
+2. Click the ⚙ icon next to "About" in the repo sidebar.
+3. In the "Topics" field, paste the eight topics above, one at a time
+   (GitHub autocompletes existing topics, which is what we want — hitting
+   an existing high-traffic topic adds us to that topic's discovery page).
+4. Save.
+
+**Done when**: the repo's About sidebar shows all eight topics AND
+`https://github.com/topics/e8-lattice` lists this repo in the sort
+(may take 1-6 hours for GitHub to index).
+
+## 2 — arXiv submission
+
+**Owner**: Allen Li (paper author per `reports/paper/kakeyalattice.tex`).
+
+**Inputs**: `reports/paper/kakeyalattice.tex`,
+`reports/paper/kakeyalattice.pdf`, the figures referenced in the `.tex`
+(already committed in the paper directory).
+
+### Submission bundle
+
+Everything arXiv needs is already in `reports/paper/`. See
+[`docs/announce/arxiv/SUBMISSION.md`](arxiv/SUBMISSION.md) for the
+checklist (metadata, categories, abstract, comments field, license).
+
+### Target categories
+
+- Primary: **`cs.LG`** (Machine Learning)
+- Cross-list: **`cs.CL`** (Computation and Language)
+- Optional cross-list: **`cs.IT`** (Information Theory) — the
+  nested-lattice quantisation framing sits naturally in `cs.IT`
+  and this cross-list significantly widens the retrieval surface
+  for "lattice quantization" searchers.
+
+### After the arXiv ID mints
+
+Open a one-commit PR titled `arxiv: replace 'DOI — pending' badges
+with the minted arXiv ID` that:
+
+- Replaces the `DOI — pending` badge in `README.md` with
+  `[![arXiv](https://img.shields.io/badge/arXiv-<ID>-b31b1b.svg)](https://arxiv.org/abs/<ID>)`.
+- Adds the arXiv URL to `CITATION.cff` as an `identifiers` entry.
+- Adds the arXiv URL to the `ACKNOWLEDGMENTS.md` infrastructure
+  section.
+- Updates `reports/paper/README.md` to point at the public arXiv page.
+
+A follow-up agent session can run this PR once you paste the arXiv ID
+into a new message.
+
+**Done when**: the paper is listed at `https://arxiv.org/abs/<ID>`
+AND the README badge is live.
+
+## 3 — vLLM integration issue / discussion
+
+**Owner**: whoever files issues under the FluffyAIcode identity.
+
+**Target repo**: `vllm-project/vllm`.
+
+**Pre-written body + title + labels**: see
+[`docs/announce/vllm_integration_issue.md`](vllm_integration_issue.md).
+
+The NexusQuant analogue is vLLM issue #16047 (filed 2025-02, ongoing
+discussion). We aim for the same class of reception: a maintainer
+engages in the thread, we end up with an "official" integration path
+even if the implementation is gated on a follow-up PR.
+
+**Done when**: issue is filed with the labels `new-feature-proposal`
+and `kv-cache`, and has at least one maintainer acknowledgement.
+
+## 4 — HF back-links
+
+**Owner**: repo author (has HF token).
+
+### HF Space README (already live)
+
+Confirmed on 2026-04-25: the Space
+`huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress` has a
+"Links" section that points back to the GitHub repo, the PyPI
+package, and the paper directory.
+
+If you want to tighten this further, see
+[`docs/announce/hf_space_backlinks.md`](hf_space_backlinks.md) for
+two safe incremental edits (add arXiv badge once minted; pin the
+Space via Collections).
+
+### Model-card link-backs
+
+Opening PRs on the Qwen3 / Llama-3 model cards to add
+KakeyaLattice to a "Related projects" section is a legitimate
+discovery tactic but the success rate depends on the model author's
+review latency. See
+[`docs/announce/model_card_backlinks.md`](model_card_backlinks.md)
+for per-model PR drafts.
+
+**Done when**: at least two high-traffic HF model cards carry a
+KakeyaLattice "Related projects" entry.
+
+## 5 — Papers with Code
+
+**Owner**: paper author.
+
+**Target**: create entries under
+<https://paperswithcode.com/task/kv-cache-compression> (the task page
+already exists from KIVI / TurboQuant / H2O entries). We add two
+benchmark rows:
+
+1. `KakeyaLattice (D4)` — iso-PPL @ 128 k, WikiText-103, n=8
+2. `KakeyaLattice (E8)` — iso-PPL @ 128 k, WikiText-103, n=8
+
+**Pre-filled submission**: see
+[`docs/announce/papers_with_code/SUBMISSION.md`](papers_with_code/SUBMISSION.md).
+
+**Done when**: the KakeyaLattice entry is live at
+`https://paperswithcode.com/paper/kakeyalattice` AND the method is
+listed under the KV-cache-compression task leaderboard with our
+four-model numbers.
+
+## 6 — DEV.to posts ×2
+
+**Owner**: repo author (drafts written by the cloud agent; you
+copy-paste + publish under your DEV.to identity).
+
+Two posts, deliberately different in tone and target query:
+
+- [`docs/announce/dev_to/post_1_theory.md`](dev_to/post_1_theory.md)
+  — "E8-lattice KV cache compression, from first principles"
+  (~1200 words). Targets searchers looking for *why* E8 beats scalar
+  KV quantisation. Ranks on queries like "nested lattice vs scalar
+  quantisation", "E8 lattice KV compression", "Hadamard rotation
+  for LLM activations".
+
+- [`docs/announce/dev_to/post_2_practice.md`](dev_to/post_2_practice.md)
+  — "Qwen3 KV cache in 10 lines of Python" (~1000 words). Targets
+  searchers looking for *how to use*. Ranks on queries like
+  "transformers DynamicCache compression", "compress Qwen3 KV cache",
+  "KakeyaLattice tutorial".
+
+### DEV.to front-matter
+
+Both posts include the DEV.to-specific front-matter block (title,
+published, tags, cover_image, canonical_url). The `canonical_url`
+field is **important**: it points to the GitHub blog path so
+DEV.to's SEO juice credits our repo as the source-of-truth, not
+DEV.to itself.
+
+**Done when**: both posts live at dev.to/<your-handle>/<slug>,
+both show ≥3 tags, and both have the canonical_url back to the
+repo's blog directory.
+
+## Tracking table
+
+After each step lands, update this table. PRs on top of this file
+are welcome.
+
+| step | done? | date | notes |
+|:-----|:------|:-----|:------|
+| 1. GitHub topics            | ☐ | | |
+| 2. arXiv submission         | ☐ | | |
+| 3. vLLM issue               | ☐ | | |
+| 4. HF Space back-links      | ☑ | 2026-04-25 | SPACE_README.md has Links section pointing back; comparison paragraph in place |
+| 4. Model-card back-links    | ☐ | | |
+| 5. Papers with Code         | ☐ | | |
+| 6a. DEV.to post 1 (theory)  | ☐ | | |
+| 6b. DEV.to post 2 (practice)| ☐ | | |
diff --git a/docs/announce/hf_space_backlinks.md b/docs/announce/hf_space_backlinks.md
new file mode 100644
index 0000000..3359f64
--- /dev/null
+++ b/docs/announce/hf_space_backlinks.md
@@ -0,0 +1,165 @@
+# HF Space + model-card back-links
+
+HF pages are high-authority. Two kinds of back-links matter:
+
+1. **Outbound** from the KakeyaLattice HF Space to the GitHub repo,
+   the PyPI page, and (once minted) the arXiv abstract.
+2. **Inbound** from high-traffic model cards (Qwen3 family, Llama-3
+   family, DeepSeek-R1-Distill, GLM-4, Gemma-4) that mention the
+   KakeyaLattice Space under a "Related projects" section.
+
+## 1 — Space outbound links
+
+### Status
+
+The Space `huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress`
+already carries outbound links via its `README.md` (sourced from
+`demos/hf_llama_kakeyalattice/SPACE_README.md` in this repo):
+
+- GitHub: <https://github.com/FluffyAIcode/LLM-KV--Cache-compress>
+- PyPI: <https://pypi.org/project/kakeyalattice/>
+- Paper directory: `reports/paper/` (inside the GitHub repo)
+- Stage 0.75 DSv4 findings: `reports/v1_5_release/dsv4_stage075/FINDINGS.md`
+  (inside the GitHub repo)
+
+### Suggested tightening (to be applied once arXiv ID mints)
+
+Replace the "Paper: `reports/paper/`" line with:
+
+```markdown
+- Paper: [arXiv:<ID>](https://arxiv.org/abs/<ID>) · [PDF in repo](https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/reports/paper/kakeyalattice.pdf)
+```
+
+This adds the arXiv URL as a second authority anchor; Perplexity /
+ChatGPT both follow arXiv URLs and boost content that includes one.
+
+### Suggested tightening (now, independent of arXiv)
+
+Add an arXiv-style badge block at the top of the Space README so a
+human visitor sees the stack of attestations immediately:
+
+```markdown
+[![PyPI](https://img.shields.io/pypi/v/kakeyalattice.svg)](https://pypi.org/project/kakeyalattice/)
+[![GitHub](https://img.shields.io/badge/GitHub-source-181717?logo=github)](https://github.com/FluffyAIcode/LLM-KV--Cache-compress)
+[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/LICENSE)
+```
+
+This is a 3-line edit to `demos/hf_llama_kakeyalattice/SPACE_README.md`
+plus a push to the Space via `huggingface_hub.HfApi.create_commit(...)`.
+Can be done in the same session as the next Space push (e.g. when the
+arXiv ID lands).
+
+### Collection pin
+
+Create an HF Collection titled **"KV-cache compression"** (user-scope)
+containing:
+
+1. The Space `FluffyAIcode/LLM-KA-Cache-Compress`.
+2. Any future paper page on HF (once the arXiv ID is registered via
+   HF's [Papers](https://huggingface.co/papers) system).
+3. External papers we compare against (TurboQuant's HF page if it
+   exists; KIVI's; etc.) for topical clustering.
+
+Collections are a moderate GEO signal because they show up on each
+member's sidebar; a KV-cache-compression Collection that names
+adjacent methods strengthens our authority graph position.
+
+## 2 — Model-card inbound link-backs
+
+### Rationale
+
+A PR on a popular model card's `README.md` that adds a line like
+
+> **Related projects**: [KakeyaLattice](...) — drop-in KV-cache
+> compression for this model, 2.4×–2.8× CR at <1 % ppl loss.
+
+…is a high-leverage move **when it lands**. Model cards are among
+the highest-authority pages on Hugging Face and are indexed
+aggressively by AI answer engines. The success rate depends entirely
+on the model author's review latency.
+
+### Candidates (ordered by expected ROI)
+
+| model card | reviewer | expected decision time | notes |
+|:-----------|:---------|:-----------------------|:------|
+| [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) | Alibaba / Qwen team | medium (they review community PRs) | natural fit — the Space uses Qwen3-0.6B as the default |
+| [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) | Alibaba / Qwen team | medium | our strongest benchmark number (2.77× @ 2 % \|Δppl\|) uses this |
+| [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) | Meta | low (gated community edits) | try but do not expect acceptance |
+| [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | DeepSeek team | medium | we have a full benchmark row |
+| [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) | Zhipu team | medium | +37.8 % compression advantage over TurboQuant @ 2 % \|Δppl\| — our biggest per-model win |
+| [google/gemma-4-e4b](https://huggingface.co/google/gemma-4-e4b) | Google team | low (gated community edits) | both codecs saturate at 3.04× so we look good but not differentiated |
+
+### Suggested PR body (per model card)
+
+Adapt per model card. This template is for Qwen3-4B; rewrite the
+compression-ratio sentence to cite the actual measured number per
+model.
+
+---
+
+**Title**: `docs: add KakeyaLattice to Related projects`
+
+**Body**:
+
+```
+Adding a single line to the "Related projects" section linking to
+KakeyaLattice, a drop-in `transformers.DynamicCache` subclass that
+compresses the KV cache of Qwen3-4B **2.40× at ≤ 1 % perplexity
+loss** and **2.77× at ≤ 2 %** (real vLLM prefill + real FlashAttention
+bf16 on NVIDIA H200, WikiText-103 n=8 × 64 evaluation positions per
+passage = 512 positions per channel; raw JSON at
+https://github.com/FluffyAIcode/LLM-KV--Cache-compress/tree/main/reports/v1_4_release/kv_128k_isoppl_n8).
+
+Usage is three lines:
+
+```python
+from kakeyalattice.hf import KakeyaLatticeCache
+cache = KakeyaLatticeCache(
+    variant="e8", q_range=38,
+    num_hidden_layers=model.config.num_hidden_layers,
+    head_dim=model.config.head_dim,
+)
+out = model.generate(**inputs, past_key_values=cache, use_cache=True)
+```
+
+- Repo: https://github.com/FluffyAIcode/LLM-KV--Cache-compress
+- PyPI: https://pypi.org/project/kakeyalattice/
+- Live demo: https://huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress
+- Citation: https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/CITATION.cff
+
+The KakeyaLattice compare table in the README cites Qwen3-4B
+alongside Qwen3-0.6B, GLM-4-9B-Chat, Gemma-4-E4B, and
+DeepSeek-R1-Distill-Qwen-1.5B. No change to this model card's
+numeric claims or recommended usage — this is a pointer-only edit.
+```
+
+**Diff to propose** (adapt path if the model card uses different
+section names):
+
+```diff
++ ## Related projects
++
++ - [KakeyaLattice](https://github.com/FluffyAIcode/LLM-KV--Cache-compress)
++   — drop-in `transformers.DynamicCache` subclass, 2.4×–2.8× KV cache
++   compression at under 1 % perplexity loss on this model.
+```
+
+If the card already has a "Related projects" or "Community extensions"
+section, insert a single bullet there instead.
+
+### How to file
+
+Each HF repo supports **"Community" tab → "New discussion"** for a
+soft approach before a PR, and **"Community" tab → "Pull request"**
+for the PR itself. For model authors who have never interacted
+publicly, a discussion first is polite; for authors who merge
+community PRs regularly (Alibaba-Qwen, HF's own models), a PR is
+faster.
+
+## Done when
+
+- Space README carries PyPI + GitHub + License badges at the top.
+- At least two of the six model cards above carry a "Related projects"
+  entry linking back to this repo.
+- The HF Collection "KV-cache compression" is live and pinned to the
+  Space.
diff --git a/docs/announce/papers_with_code/SUBMISSION.md b/docs/announce/papers_with_code/SUBMISSION.md
new file mode 100644
index 0000000..a655e73
--- /dev/null
+++ b/docs/announce/papers_with_code/SUBMISSION.md
@@ -0,0 +1,117 @@
+# Papers with Code submission
+
+Papers with Code (PwC) is the canonical "benchmarks-by-method" index
+for ML and gets crawled daily by Google Scholar, Semantic Scholar,
+Connected Papers, and AI answer engines. Entries under
+<https://paperswithcode.com/task/kv-cache-compression> are the
+first result on queries like "KV cache compression benchmark" on all
+four of those retrievers.
+
+## Prerequisites
+
+- arXiv ID minted (see [`../arxiv/SUBMISSION.md`](../arxiv/SUBMISSION.md)).
+  **PwC requires an arXiv link** for paper submissions. The benchmark
+  submission below is possible without arXiv but works better with it.
+- GitHub repo publicly visible (it is).
+- PyPI package published (it is, v1.5.0).
+
+## What to submit
+
+Two entries, filed separately:
+
+1. **Paper submission** — creates the canonical PwC page for
+   KakeyaLattice.
+2. **Benchmark submission** — four benchmark rows on the
+   KV-cache-compression task, one per model we measure.
+
+## 1 — Paper submission
+
+Form: <https://paperswithcode.com/paper/submit>
+
+### Metadata to paste
+
+- **Title**: `KakeyaLattice: Nested-Lattice KV-Cache Compression for Large Language Models`
+- **Abstract**: paste from the arXiv abstract.
+- **arXiv URL**: `https://arxiv.org/abs/<ID>` (fill in once minted).
+- **PDF URL**: `https://arxiv.org/pdf/<ID>.pdf` (auto-populated from
+  arXiv URL in most cases).
+- **Tasks**: add `KV Cache Compression`, `Language Modelling`,
+  `Quantization`.
+- **Methods**: add `Nested-Lattice Quantization`, `Sylvester-Hadamard
+  Rotation`, `E8 Lattice`, `D4 Lattice`. PwC will show a "new method"
+  prompt — accept and fill in the short method description below.
+- **Source code**:
+  `https://github.com/FluffyAIcode/LLM-KV--Cache-compress`
+- **Framework**: `PyTorch`, `Hugging Face transformers`, `vLLM`.
+
+### Method description (to paste into the "New method" dialog)
+
+```
+KakeyaLattice is a nested-lattice quantiser for the KV cache of
+transformer language models. Each K or V vector is rotated by a
+Sylvester-Hadamard matrix H/sqrt(D), scaled adaptively by its L2
+norm, and snapped to the closest point of a nested D4 (dim 4) or E8
+(dim 8) lattice using Conway-Sloane closest-point decoders. The
+rotation gaussianises the heavy-tailed, non-isotropic KV activations
+real LLMs produce; the lattice snap then exploits the densest known
+sphere packings in dimensions 4 and 8 to beat any per-channel scalar
+quantiser at the same bit budget. The codec is stateless per-vector,
+so it supports streaming / online decode without calibration or
+warm-up.
+```
+
+## 2 — Benchmark submissions
+
+Four rows under
+<https://paperswithcode.com/task/kv-cache-compression>. One row per
+model we measure, each citing the same arXiv paper.
+
+### Row template
+
+The PwC benchmark form asks for: dataset, metric, value, extra
+info, model name, link to paper. Fill in per the table below; the
+"extra info" slot is where we disclose the quality target and CI
+protocol.
+
+| model                          | dataset       | metric                        | value   | extra info                                       |
+|:-------------------------------|:--------------|:------------------------------|:--------|:-------------------------------------------------|
+| Qwen3-4B                       | WikiText-103  | KV compression ratio @ ≤2% \|Δppl\| | **2.77×** | 128k ctx, n=8 passages × 64 eval pos, H200, vLLM bf16 |
+| GLM-4-9B-Chat                  | WikiText-103  | KV compression ratio @ ≤2% \|Δppl\| | **2.44×** | 128k ctx, n=8 passages × 64 eval pos, H200, vLLM bf16 |
+| Gemma-4-E4B                    | WikiText-103  | KV compression ratio @ ≤2% \|Δppl\| | **3.04×** | 128k ctx, n=8 passages × 64 eval pos, H200, vLLM bf16 (tied with TurboQuant at saturation) |
+| DeepSeek-R1-Distill-Qwen-1.5B  | WikiText-103  | KV compression ratio @ ≤2% \|Δppl\| | **2.43×** | 128k ctx, n=8 passages × 64 eval pos, H200, vLLM bf16 |
+
+Numbers taken directly from
+`reports/v1_4_release/kv_128k_isoppl_n8/V14_VS_TQ_ISOPPL_REPORT.md`.
+Reproducible via `benchmarks/extract_iso_ppl_table.py` — the PR body
+can link to the reproducer so PwC reviewers can check.
+
+### DeepSeek-V4-Flash (separate task entry)
+
+Also file a row under
+<https://paperswithcode.com/task/model-compression> (or a new custom
+task if V4-Flash is not already listed):
+
+| model              | dataset         | metric                              | value      | extra info                                         |
+|:-------------------|:----------------|:------------------------------------|:-----------|:---------------------------------------------------|
+| DeepSeek-V4-Flash  | WikiText-style  | KV bit reduction vs FP8 @ matched quality | **−22.0 %** | n=8 H200, 3/43 SWA + 20/43 c4a + 20/43 c128a layers, layer-weighted rel-MSE 0.959 ± 0.024 (95 % CI) vs hardware FP8 per-64-block |
+
+Cites `reports/v1_5_release/dsv4_stage075/FINDINGS_N8.md`.
+
+## After submission
+
+1. The PwC entry lives at `https://paperswithcode.com/paper/kakeyalattice`.
+2. Add the PwC URL to `README.md` as an additional badge:
+   ```markdown
+   [![PapersWithCode](https://img.shields.io/badge/Papers%20with%20Code-kakeyalattice-21caf5.svg)](https://paperswithcode.com/paper/kakeyalattice)
+   ```
+3. Add the PwC URL to `CITATION.cff` under `identifiers`.
+
+## Why PwC matters for GEO
+
+PwC ranks disproportionately well on **benchmark-comparison queries**,
+which is what procurement-stage decision-makers actually search for.
+A query like `"KV cache compression benchmark 2026"` returns the PwC
+leaderboard first; having two rows there named KakeyaLattice puts us
+in front of every reader of that page. The NexusQuant precedent
+confirms this: their PwC page has been cited in three independent
+papers since landing, entirely through organic discovery.
diff --git a/docs/announce/vllm_integration_issue.md b/docs/announce/vllm_integration_issue.md
new file mode 100644
index 0000000..42cc057
--- /dev/null
+++ b/docs/announce/vllm_integration_issue.md
@@ -0,0 +1,216 @@
+# vLLM integration discussion — pre-written issue body
+
+Paste the contents of this file into a **new GitHub Discussion or Issue**
+at <https://github.com/vllm-project/vllm>. The discussion / issue
+format is deliberate — we do not yet have a PR to show, and vLLM
+maintainers prefer a discussion when the proposal needs scoping before
+code.
+
+## Where to file
+
+- **Primary choice**: `https://github.com/vllm-project/vllm/discussions`
+  under the category **"RFC"** (Request For Comment).
+- **Fallback**: `https://github.com/vllm-project/vllm/issues/new/choose`
+  → "Feature request". Use this only if Discussions are disabled for
+  your account or the maintainers redirect you.
+
+## Suggested title
+
+```
+[RFC] Third-party KV-cache quantiser plugin: KakeyaLattice (nested D4/E8 lattice, 2.4–2.8× CR at <1% |Δppl|)
+```
+
+## Suggested labels
+
+Ask the maintainers to add (one of these at filing time is plenty —
+they will add the rest):
+
+- `new-feature-proposal`
+- `kv-cache`
+- `quantization`
+- `RFC`
+
+## Body (paste verbatim from here)
+
+```markdown
+## Proposal
+
+Ship a third-party plugin path for a new KV-cache quantiser,
+**KakeyaLattice**, a nested D4 / E8 lattice codec for transformer
+KV activations that lands via vLLM's existing `general_plugins` entry
+point. Code already exists at
+<https://github.com/FluffyAIcode/LLM-KV--Cache-compress> and is on
+PyPI as `kakeyalattice` (v1.5.0, MIT-licensed).
+
+I am filing this as an RFC rather than a PR because the code I have
+works today as a capture / replace monkey-patch and is **not yet a
+clean integration into vLLM's paged KV manager**. I want to align on
+the integration path before writing that bridge, to avoid a duplicate
+of the `QuantoQuantizedCache` / `HQQQuantizedCache` work in
+`transformers` that landed piecemeal.
+
+## What KakeyaLattice does
+
+- **Input**: a `[seq, heads, head_dim]` K or V tensor from any
+  transformer attention layer.
+- **Pipeline**: Sylvester–Hadamard rotate → per-vector adaptive L²
+  scale → nested D4 (dim-4) or E8 (dim-8) lattice closest-point
+  encode → store indices.
+- **Decode**: one matmul + one unscale.
+- **Operating points**: three canonical `q_range` settings
+  (10 aggressive, 38 balanced, 152 near-lossless).
+
+It is a **stateless per-vector function** — no calibration, no
+warm-up, no cross-token state. Streaming / online decode is
+supported by construction.
+
+## Why another KV quantiser
+
+vLLM already ships `--kv-cache-dtype fp8` and interoperates with
+`transformers`'s `QuantoQuantizedCache` / `HQQQuantizedCache`
+classes. What changes:
+
+At the **tight quality budget most production deployments tune for**
+(≤ 1 % |Δppl|), KakeyaLattice compresses **9 %–38 % harder** than
+TurboQuant (the strongest published per-channel scalar baseline) on
+four open-source model families. Real vLLM prefill + real
+FlashAttention bf16 forward on NVIDIA H200, WikiText-103, n=8
+passages × 64 eval positions per passage, 128 k context:
+
+| model                          | KakeyaLattice CR | TurboQuant CR | advantage |
+|:-------------------------------|-----------------:|--------------:|----------:|
+| Qwen3-4B                       |        **2.40×** |         1.95× | +23.3 %   |
+| GLM-4-9B-Chat                  |        **1.73×** | out of range  | KL only   |
+| Gemma-4-E4B                    |        **3.04×** |         3.04× | tied      |
+| DeepSeek-R1-Distill-Qwen-1.5B  |        **2.29×** |         2.09× | +9.2 %    |
+
+At ≤ 2 % |Δppl| the advantage grows to +27, +38, tied, +3 %
+respectively. Raw JSON + reproducer at
+<https://github.com/FluffyAIcode/LLM-KV--Cache-compress/tree/main/reports/v1_4_release/kv_128k_isoppl_n8>.
+
+The mechanism: real LLM KV activations are **heavy-tailed and
+non-isotropic**. Per-channel scalar quantisers allocate bits for the
+worst-case channel. A Sylvester–Hadamard rotation empirically
+gaussianises the distribution (see the non-Gaussian audit in
+<https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/reports/v1_5_release/dsv4_stage075/FINDINGS_N8.md>),
+after which D4/E8 lattice quantisation exploits the densest sphere
+packings in those dimensions.
+
+## What already exists
+
+In the [kakeyalattice repo](https://github.com/FluffyAIcode/LLM-KV--Cache-compress):
+
+1. **`kakeyalattice.hf.KakeyaLatticeCache`** — a drop-in subclass
+   of `transformers.DynamicCache`. Works today with any
+   `model.generate(past_key_values=cache, ...)` call. The
+   [HF Space](https://huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress)
+   uses this on Qwen3-0.6B.
+2. **`vllm_backend/kakeya_v1_4_snapshot/`** — a
+   `vllm.general_plugins` entry point that monkey-patches the
+   Attention path on Qwen2/3, Gemma4, GLM to capture
+   post-QK/V-norm, pre-RoPE K and V and replace them with the
+   roundtripped versions. **This is the "capture and replace" mode
+   used to generate the 128 k iso-PPL tables above**. It works
+   today on vLLM `0.19.2rc1.dev100` with `transformers` 5.5.2.
+
+The monkey-patch mode is **not** a real memory-saving integration —
+it stores the reconstructed tensor in the model's KV dtype. That
+is the gap I want to close with vLLM's help.
+
+## Integration path I'd like to RFC
+
+Three possible landing points, from lowest to highest invasiveness:
+
+### Path A — register as a `KVCacheQuantConfig` backend
+
+`vllm/config.py` has a `KVCacheQuantConfig` enum and a registry in
+`vllm/kv_transfer/`. Add a `"lattice"` value that dispatches to a
+`KakeyaLatticeKVManager` implementing the existing
+`KVCacheManagerBase` protocol. The manager would:
+
+- On prefill / decode: encode K and V blocks via E8 closest-point,
+  store lattice indices in the paged KV buffer instead of bf16 /
+  fp8 values.
+- On attention read: one matmul to decode (per 8-D block), then
+  the existing FlashAttention path runs unchanged.
+
+Pros: uses vLLM's own page-allocator; no kernel changes.
+Cons: decode overhead per attention read is ~0.25 ms on H200 today
+(< 2 % of bf16 decode step at batch 1); on smaller GPUs this might
+be worse.
+
+### Path B — fused decode in the attention kernel
+
+Write a Triton kernel that reads lattice indices + fused-unscales
+during the QK⁻¹ step of FlashAttention. Faster but **invasive** and
+requires maintaining a codec-aware variant of FlashAttention.
+
+### Path C — compressed cache in `nextn`-style hot tier only
+
+Store bf16 for the last ~1 k tokens (active decode window), encode
+the rest via E8. Trades a small HBM win for zero decode-path
+complexity.
+
+**My default proposal is Path A** because it matches what vLLM
+already does for INT8 / FP8 and keeps the blast radius small.
+
+## What I'd like from the vLLM maintainers
+
+1. Agreement on **Path A** (or a pointer to Path B / C if you see a
+   better fit).
+2. Pointer to the **exact interface** you want a new KV-cache backend
+   to implement. I read `vllm/worker/model_runner.py`,
+   `vllm/kv_transfer/`, and `vllm/core/block/block_manager.py`, but
+   the canonical integration point has moved around in the 0.6 → 1.0
+   transition.
+3. A **`vllm-plugin-kakeyalattice`** naming convention, if you'd
+   prefer the plugin live under a `vllm-project/*-plugin-*` naming
+   scheme rather than in my own namespace.
+
+Happy to open the PR as soon as we've agreed on Path and interface.
+Paper draft at
+<https://github.com/FluffyAIcode/LLM-KV--Cache-compress/blob/main/reports/paper/kakeyalattice.pdf>
+(arXiv submission pending).
+
+## Compliance note
+
+All numbers above come from real vLLM prefill + real FlashAttention
+bf16 forward on NVIDIA H200. No mocks, no fallbacks. The `reports/`
+tree of the kakeyalattice repo carries a SHA-256 manifest so claims
+are reproducible end-to-end from the committed JSON.
+```
+
+## How to follow up
+
+After filing:
+
+1. Post a one-line cross-reference from the kakeyalattice repo —
+   either on the `AgentMemory/discovery-runbook-c478` PR (this PR) or
+   as a new issue titled `discovery: vLLM RFC filed at vllm-project/vllm#<N>`.
+
+2. When a vLLM maintainer engages, reply **within 4 hours** during
+   Pacific business hours. This is the single highest-leverage
+   engagement of the six discovery tasks; the thread's visibility
+   drops sharply once the first reply goes stale.
+
+3. If the RFC gets closed without a maintainer reply within 7 days,
+   file it as an issue (not a discussion) tagged `feature-request`
+   + `kv-cache` and @-mention a maintainer who has touched
+   `vllm/core/block/block_manager.py` in the last 90 days. A
+   `git log` + email-on-commits query can find the right handle.
+
+## If vLLM wants numbers beyond what we have
+
+Two likely maintainer asks, and the one-sentence response:
+
+- **"Have you measured real HBM savings, not just rel-MSE?"** — No,
+  that's exactly why Path A is the RFC. The reference impl round-trips
+  K/V through the codec. Path A is the first integration where CR
+  equals HBM ratio.
+
+- **"Have you benchmarked against KIVI on Qwen3?"** — Not yet, direct
+  iso-bit head-to-head vs KIVI is on the roadmap. Our current
+  baseline is TurboQuant (the strongest published scalar KV quantiser
+  at our bit budgets) and the Paper with Code submission at
+  <https://paperswithcode.com/paper/kakeyalattice> will carry the
+  KIVI comparison once the arXiv ID is minted.