kekzl · github-actions · May 15, 2026 · May 15, 2026 · May 15, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,13 @@ All notable changes since v0.6. Format loosely follows [Keep a Changelog](https:
 
 ## Unreleased
 
+- **Gemma-4 FP8 prefill carve-out removed** — the 2026-05-09 measurement
+  showing -5..-19% prefill on Gemma-4 vs FP16 has substantially closed
+  with intermediate prefill work (PRs #177, #181). Re-measured 2026-05-15
+  on Q4_K_M: pp128 +1.0%, pp512 -0.9%, pp833 -4.2%, pp2048 **+7.3%** —
+  neutral with long-context advantage. FP8 also halves the activation
+  cache. Coherence bit-exact on chat prompts. Closes the last entry in
+  the "Gemma-4 remaining carve-outs" roadmap section.
 - **Gemma-4 NVFP4 decode cache for Q*_K source weights** — drops the
   "per-layer head_dim not yet supported" carve-out at `engine.cpp:864-866`.
   The per-tensor convert→quantize loop in `executor_pre_dequant.cu` handles

diff --git a/docs/roadmap.md b/docs/roadmap.md
@@ -6,15 +6,13 @@ This is a single-author single-target experiment, so "roadmap" is more "current
 
 ## Known limitations
 
-### Gemma-4 remaining carve-out (FP8 prefill)
+### ~~Gemma-4 carve-outs~~ — all removed
 
-Earlier Gemma-4 carve-outs removed:
-- **FP8 KV cache** — PR #91 (2026-05-01). The "dual head_dim 256/512 needs per-layer-aware kernels" hypothesis was a red herring; the KV write/read kernels handle per-layer head_dim correctly via `Q.shape[3]` template dispatch. Real bugs were (a) FP8 calibration reading the workspace's allocated shape (`max_hd=512`) instead of the live shape (`hd=256` on SWA layers, junk in trailing 256 cols) and (b) warmup-derived absmax poisoning the high-water-mark scale on Gemma-4's `output_norm` outliers (max=588).
-- **NVFP4 decode cache for Q*_K source** — 2026-05-15. The per-tensor convert→quantize loop in `executor_pre_dequant.cu` already handled mixed (N, K) shapes correctly; the disable was overly defensive. Removing it on Q4_K_M / UD-Q4_K_M: pp512 1713 → 2394 tok/s (**+40%**), tg256 176 → 197 tok/s (**+12%**).
-
-One Gemma-4 carve-out remains active in `engine.cpp`:
+All three Gemma-4 carve-outs are now gone:
 
-- **FP8 prefill** (`config_.use_fp8_prefill = 0` for Gemma-4) — different code path from the KV cache. Documented as a *perf* issue (5-19% slower on prefill vs FP16), not a correctness issue; cuBLASLt FP8 algos for Gemma-4's per-layer head_dim shape (256/512 split) lose to FP16 cuBLAS at the standard tile sizes.
+- **FP8 KV cache** — PR #91 (2026-05-01). The "dual head_dim 256/512 needs per-layer-aware kernels" hypothesis was a red herring; the KV write/read kernels handle per-layer head_dim correctly via `Q.shape[3]` template dispatch. Real bugs were (a) FP8 calibration reading the workspace's allocated shape (`max_hd=512`) instead of the live shape (`hd=256` on SWA layers, junk in trailing 256 cols) and (b) warmup-derived absmax poisoning the high-water-mark scale on Gemma-4's `output_norm` outliers (max=588).
+- **NVFP4 decode cache for Q*_K source** — PR #186 (2026-05-15). The per-tensor convert→quantize loop in `executor_pre_dequant.cu` already handled mixed (N, K) shapes correctly; the disable was overly defensive. Removing it on Q4_K_M / UD-Q4_K_M: pp512 1713 → 2394 tok/s (**+40%**), tg256 176 → 197 tok/s (**+12%**).
+- **FP8 prefill** — 2026-05-15. The 2026-05-09 -5..-19% slowdown was real at the time but mostly closed by intermediate prefill work (PRs #177, #181). Re-measured Q4_K_M: pp128 +1.0%, pp512 -0.9%, pp833 -4.2%, pp2048 **+7.3%** — neutral with long-context advantage. FP8 also halves the activation cache size. Users wanting max prefill at medium pp can opt out via `[attention] fp8_prefill = "never"`.
 
 Default KV dtype is FP16; FP8 is opt-in via `--kv-fp8` (or `kv_cache.dtype = "fp8"` in `imp.conf`). Coherent on Qwen3 dense, Qwen3.5/3.6 GDN, Llama-3.2, and Gemma-4 (post PR #91).
 

diff --git a/src/runtime/engine.cpp b/src/runtime/engine.cpp
@@ -829,19 +829,13 @@ bool Engine::init(std::shared_ptr<Model> model, const EngineConfig& config) {
         // CUDA graphs: enabled for Gemma-4 decode. The MoE decode fast path is fully
         // device-side (dp4a GEMV, no D2H memcpy), so graph capture works.
         // Only the MoE prefill path uses D2H sync, but prefill is never graph-captured.
-        if (config_.use_fp8_prefill) {
-            // FP8 prefill on Gemma-4 is correctness-OK (output matches FP16 path
-            // bit-for-bit on the smoke "capital of France is Paris" prompt) but
-            // ~5-19% slower on prefill (measured 2026-05-09 on Q4_K_M: pp=123 was
-            // 270 vs 334 tok/s, pp=833 was 1086 vs 1141 tok/s — both runs FP8 < FP16).
-            // Likely cause: cuBLASLt FP8 algos for Gemma-4's per-layer head_dim
-            // shape (256/512 split) lose to FP16 cuBLAS at our tile sizes. The
-            // earlier "per-layer head_dim not yet supported" comment was inherited
-            // from the FP8 KV story and inaccurate — the issue is perf, not
-            // correctness. Auto-disable for default-perf.
-            IMP_LOG_INFO("Gemma 4: disabling FP8 prefill (~5-19%% slower than FP16 on this arch)");
-            config_.use_fp8_prefill = 0;
-        }
+        // FP8 prefill carve-out removed 2026-05-15. The 2026-05-09 measurement
+        // showed -5..-19% prefill on Gemma-4 vs FP16; since then (PRs #177, #181)
+        // the gap has closed. Re-measured 2026-05-15 on Q4_K_M:
+        //   pp128:  +1.0%  pp512:  -0.9%  pp833:  -4.2%  pp2048: +7.3%
+        // Net effect is neutral with a long-context advantage. FP8 also halves
+        // the activation cache, which helps VRAM at long context. Users wanting
+        // max prefill at medium pp can opt out via [attention] fp8_prefill = "never".
         if (config_.use_nvfp4_decode) {
             // Prequant SafeTensors NVFP4 weights are already in NVFP4 layout on
             // disk. Phase 3a (Q*_K → NVFP4 conversion) and Phase 3b