diff --git a/CHANGELOG.md b/CHANGELOG.md index c1dafa8..e62aa5d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,13 @@ All notable changes since v0.6. Format loosely follows [Keep a Changelog](https: ## Unreleased +- **Gemma-4 FP8 prefill carve-out removed** — the 2026-05-09 measurement + showing -5..-19% prefill on Gemma-4 vs FP16 has substantially closed + with intermediate prefill work (PRs #177, #181). Re-measured 2026-05-15 + on Q4_K_M: pp128 +1.0%, pp512 -0.9%, pp833 -4.2%, pp2048 **+7.3%** — + neutral with long-context advantage. FP8 also halves the activation + cache. Coherence bit-exact on chat prompts. Closes the last entry in + the "Gemma-4 remaining carve-outs" roadmap section. - **Gemma-4 NVFP4 decode cache for Q*_K source weights** — drops the "per-layer head_dim not yet supported" carve-out at `engine.cpp:864-866`. The per-tensor convert→quantize loop in `executor_pre_dequant.cu` handles diff --git a/docs/roadmap.md b/docs/roadmap.md index 213345f..a43604f 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -6,15 +6,13 @@ This is a single-author single-target experiment, so "roadmap" is more "current ## Known limitations -### Gemma-4 remaining carve-out (FP8 prefill) +### ~~Gemma-4 carve-outs~~ — all removed -Earlier Gemma-4 carve-outs removed: -- **FP8 KV cache** — PR #91 (2026-05-01). The "dual head_dim 256/512 needs per-layer-aware kernels" hypothesis was a red herring; the KV write/read kernels handle per-layer head_dim correctly via `Q.shape[3]` template dispatch. Real bugs were (a) FP8 calibration reading the workspace's allocated shape (`max_hd=512`) instead of the live shape (`hd=256` on SWA layers, junk in trailing 256 cols) and (b) warmup-derived absmax poisoning the high-water-mark scale on Gemma-4's `output_norm` outliers (max=588). -- **NVFP4 decode cache for Q*_K source** — 2026-05-15. The per-tensor convert→quantize loop in `executor_pre_dequant.cu` already handled mixed (N, K) shapes correctly; the disable was overly defensive. Removing it on Q4_K_M / UD-Q4_K_M: pp512 1713 → 2394 tok/s (**+40%**), tg256 176 → 197 tok/s (**+12%**). - -One Gemma-4 carve-out remains active in `engine.cpp`: +All three Gemma-4 carve-outs are now gone: -- **FP8 prefill** (`config_.use_fp8_prefill = 0` for Gemma-4) — different code path from the KV cache. Documented as a *perf* issue (5-19% slower on prefill vs FP16), not a correctness issue; cuBLASLt FP8 algos for Gemma-4's per-layer head_dim shape (256/512 split) lose to FP16 cuBLAS at the standard tile sizes. +- **FP8 KV cache** — PR #91 (2026-05-01). The "dual head_dim 256/512 needs per-layer-aware kernels" hypothesis was a red herring; the KV write/read kernels handle per-layer head_dim correctly via `Q.shape[3]` template dispatch. Real bugs were (a) FP8 calibration reading the workspace's allocated shape (`max_hd=512`) instead of the live shape (`hd=256` on SWA layers, junk in trailing 256 cols) and (b) warmup-derived absmax poisoning the high-water-mark scale on Gemma-4's `output_norm` outliers (max=588). +- **NVFP4 decode cache for Q*_K source** — PR #186 (2026-05-15). The per-tensor convert→quantize loop in `executor_pre_dequant.cu` already handled mixed (N, K) shapes correctly; the disable was overly defensive. Removing it on Q4_K_M / UD-Q4_K_M: pp512 1713 → 2394 tok/s (**+40%**), tg256 176 → 197 tok/s (**+12%**). +- **FP8 prefill** — 2026-05-15. The 2026-05-09 -5..-19% slowdown was real at the time but mostly closed by intermediate prefill work (PRs #177, #181). Re-measured Q4_K_M: pp128 +1.0%, pp512 -0.9%, pp833 -4.2%, pp2048 **+7.3%** — neutral with long-context advantage. FP8 also halves the activation cache size. Users wanting max prefill at medium pp can opt out via `[attention] fp8_prefill = "never"`. Default KV dtype is FP16; FP8 is opt-in via `--kv-fp8` (or `kv_cache.dtype = "fp8"` in `imp.conf`). Coherent on Qwen3 dense, Qwen3.5/3.6 GDN, Llama-3.2, and Gemma-4 (post PR #91). diff --git a/src/runtime/engine.cpp b/src/runtime/engine.cpp index 9c7044b..3379311 100644 --- a/src/runtime/engine.cpp +++ b/src/runtime/engine.cpp @@ -829,19 +829,13 @@ bool Engine::init(std::shared_ptr model, const EngineConfig& config) { // CUDA graphs: enabled for Gemma-4 decode. The MoE decode fast path is fully // device-side (dp4a GEMV, no D2H memcpy), so graph capture works. // Only the MoE prefill path uses D2H sync, but prefill is never graph-captured. - if (config_.use_fp8_prefill) { - // FP8 prefill on Gemma-4 is correctness-OK (output matches FP16 path - // bit-for-bit on the smoke "capital of France is Paris" prompt) but - // ~5-19% slower on prefill (measured 2026-05-09 on Q4_K_M: pp=123 was - // 270 vs 334 tok/s, pp=833 was 1086 vs 1141 tok/s — both runs FP8 < FP16). - // Likely cause: cuBLASLt FP8 algos for Gemma-4's per-layer head_dim - // shape (256/512 split) lose to FP16 cuBLAS at our tile sizes. The - // earlier "per-layer head_dim not yet supported" comment was inherited - // from the FP8 KV story and inaccurate — the issue is perf, not - // correctness. Auto-disable for default-perf. - IMP_LOG_INFO("Gemma 4: disabling FP8 prefill (~5-19%% slower than FP16 on this arch)"); - config_.use_fp8_prefill = 0; - } + // FP8 prefill carve-out removed 2026-05-15. The 2026-05-09 measurement + // showed -5..-19% prefill on Gemma-4 vs FP16; since then (PRs #177, #181) + // the gap has closed. Re-measured 2026-05-15 on Q4_K_M: + // pp128: +1.0% pp512: -0.9% pp833: -4.2% pp2048: +7.3% + // Net effect is neutral with a long-context advantage. FP8 also halves + // the activation cache, which helps VRAM at long context. Users wanting + // max prefill at medium pp can opt out via [attention] fp8_prefill = "never". if (config_.use_nvfp4_decode) { // Prequant SafeTensors NVFP4 weights are already in NVFP4 layout on // disk. Phase 3a (Q*_K → NVFP4 conversion) and Phase 3b