Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 32 additions & 12 deletions benchmarks/dsv4_stage075/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ Upgrade path from Stage 0.5:
| file | purpose |
| --- | --- |
| `dsv4_weight_loader.py` | load FP8-E4M3 safetensor shards, dequantize via E8M0 block scales, inject into Stage 0.5's `DSV4MainKVProjection` + `DSV4Compressor` |
| `run_stage075_real_weights.py` | end-to-end driver: host hidden → V4 KV streams (trained weights) → non-Gaussian audit + KakeyaLattice / FP8 codec comparison |
| `run_stage075_real_weights.py` | **n=1** driver: host hidden → V4 KV streams (trained weights) → non-Gaussian audit + KakeyaLattice / FP8 codec comparison (single passage) |
| `run_stage075_n8.py` | **n=8 driver** (new): same pipeline, 8 semantically diverse passages, Student-t 95% CI aggregation per stream. See `reports/v1_5_release/dsv4_stage075/FINDINGS_N8.md` for results. |
| `README.md` | this file |

## Why this runs on our existing vast H200
Expand All @@ -39,22 +40,41 @@ End-to-end wall time on H200: ~15 seconds.
`reports/v1_5_release/dsv4_stage075/FINDINGS.md`. See FINDINGS.md for the
analysis.

## Headline finding (2026-04-25 H200 run, TRAINED V4-Flash weights)
## Headline finding — **n=8 with 95 % CI** (2026-04-26 H200 run)

E8 Q=38 vs FP8 per-64-block across three V4 KV streams:
**Canonical one-liner (please reuse verbatim across sources for
cross-source consistency):**

> KakeyaLattice E8 Q=38 on DeepSeek-V4-Flash KV cache:
> **−22 % bits per vector at matched or better reconstruction quality on 23 / 43
> attention layers, neutral on the remaining 20**.
> Measured on 2 × H200, n = 8 passages, Student-t 95 % CI.

**Product headline:**

> V4-Flash + KakeyaLattice = **−22 % KV HBM at zero net quality cost**.
> 4 × H200 node: **126 → ~150 concurrent users at 1 M context**.

E8 Q=38 vs FP8 per-64-block across three V4 KV streams, aggregated
over n=8 diverse WikiText-style passages on trained V4-Flash weights:

```
stream E8/FP8 rel-MSE bit savings
sliding_window_kv 0.786 -22.0% ← strong Pareto win
csa_pool_kv_ratio4 0.902 -22.0% ← moderate Pareto win
hca_pool_kv_ratio128 0.966 -22.0% ← marginal Pareto win
mean 0.884 -22.0%
stream (V4 layer count) E8/FP8 (mean ± CI95) n=1 value bit savings quality at 78 % bits
sliding_window_kv (3/43) 0.790 ± 0.005 0.786 -22.0 % +21 % ← strong win
csa_pool_kv_ratio4 (20/43) 0.900 ± 0.006 0.902 -22.0 % +10 % ← moderate win
hca_pool_kv_ratio128 (20/43) 1.043 ± 0.051 0.966 -22.0 % 0 % ← tied with FP8
```

**~22% bit savings with 12% lower MSE on average.** The bit saving is
identical across streams (same codec arithmetic); the MSE advantage
depends on how well our Sylvester-Hadamard rotation decorrelates the
post-pool anisotropy in each stream.
- The **bit saving is codec-arithmetic** (3296 bit/vec vs 4224 bit/vec) and
identical across every stream, every layer, every passage.
- The **quality side** improves on the 23 SWA+CSA layers that dominate the
V4-Flash stack and ties with FP8 on the 20 HCA pool layers. Net
layer-weighted rel-MSE is **−4.1 % ± 2.3 pp**, so the combined package is
"22 % fewer bits, no quality regression on any layer type".
- The n=1 HCA "marginal win" (0.966) was a 1.6 σ lucky-tail draw and is
corrected here. See `reports/v1_5_release/dsv4_stage075/FINDINGS_N8.md`
for per-passage tables, full audit CI, layer-weighted recomputation,
tweet/HN/FAQ/paper phrasings, and revised deployment forecast.

Non-Gaussian audit vs paper gates: V4-Flash KV smashes all four paper
gates (kurt, isotropy, Hadamard-variance, W2/σ) by 2–10 000 000×,
Expand Down
Loading