Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,13 @@ All notable changes since v0.6. Format loosely follows [Keep a Changelog](https:

## Unreleased

- **Gemma-4 FP8 prefill carve-out removed** — the 2026-05-09 measurement
showing -5..-19% prefill on Gemma-4 vs FP16 has substantially closed
with intermediate prefill work (PRs #177, #181). Re-measured 2026-05-15
on Q4_K_M: pp128 +1.0%, pp512 -0.9%, pp833 -4.2%, pp2048 **+7.3%** —
neutral with long-context advantage. FP8 also halves the activation
cache. Coherence bit-exact on chat prompts. Closes the last entry in
the "Gemma-4 remaining carve-outs" roadmap section.
- **Gemma-4 NVFP4 decode cache for Q*_K source weights** — drops the
"per-layer head_dim not yet supported" carve-out at `engine.cpp:864-866`.
The per-tensor convert→quantize loop in `executor_pre_dequant.cu` handles
Expand Down
12 changes: 5 additions & 7 deletions docs/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,13 @@ This is a single-author single-target experiment, so "roadmap" is more "current

## Known limitations

### Gemma-4 remaining carve-out (FP8 prefill)
### ~~Gemma-4 carve-outs~~ — all removed

Earlier Gemma-4 carve-outs removed:
- **FP8 KV cache** — PR #91 (2026-05-01). The "dual head_dim 256/512 needs per-layer-aware kernels" hypothesis was a red herring; the KV write/read kernels handle per-layer head_dim correctly via `Q.shape[3]` template dispatch. Real bugs were (a) FP8 calibration reading the workspace's allocated shape (`max_hd=512`) instead of the live shape (`hd=256` on SWA layers, junk in trailing 256 cols) and (b) warmup-derived absmax poisoning the high-water-mark scale on Gemma-4's `output_norm` outliers (max=588).
- **NVFP4 decode cache for Q*_K source** — 2026-05-15. The per-tensor convert→quantize loop in `executor_pre_dequant.cu` already handled mixed (N, K) shapes correctly; the disable was overly defensive. Removing it on Q4_K_M / UD-Q4_K_M: pp512 1713 → 2394 tok/s (**+40%**), tg256 176 → 197 tok/s (**+12%**).

One Gemma-4 carve-out remains active in `engine.cpp`:
All three Gemma-4 carve-outs are now gone:

- **FP8 prefill** (`config_.use_fp8_prefill = 0` for Gemma-4) — different code path from the KV cache. Documented as a *perf* issue (5-19% slower on prefill vs FP16), not a correctness issue; cuBLASLt FP8 algos for Gemma-4's per-layer head_dim shape (256/512 split) lose to FP16 cuBLAS at the standard tile sizes.
- **FP8 KV cache** — PR #91 (2026-05-01). The "dual head_dim 256/512 needs per-layer-aware kernels" hypothesis was a red herring; the KV write/read kernels handle per-layer head_dim correctly via `Q.shape[3]` template dispatch. Real bugs were (a) FP8 calibration reading the workspace's allocated shape (`max_hd=512`) instead of the live shape (`hd=256` on SWA layers, junk in trailing 256 cols) and (b) warmup-derived absmax poisoning the high-water-mark scale on Gemma-4's `output_norm` outliers (max=588).
- **NVFP4 decode cache for Q*_K source** — PR #186 (2026-05-15). The per-tensor convert→quantize loop in `executor_pre_dequant.cu` already handled mixed (N, K) shapes correctly; the disable was overly defensive. Removing it on Q4_K_M / UD-Q4_K_M: pp512 1713 → 2394 tok/s (**+40%**), tg256 176 → 197 tok/s (**+12%**).
- **FP8 prefill** — 2026-05-15. The 2026-05-09 -5..-19% slowdown was real at the time but mostly closed by intermediate prefill work (PRs #177, #181). Re-measured Q4_K_M: pp128 +1.0%, pp512 -0.9%, pp833 -4.2%, pp2048 **+7.3%** — neutral with long-context advantage. FP8 also halves the activation cache size. Users wanting max prefill at medium pp can opt out via `[attention] fp8_prefill = "never"`.

Default KV dtype is FP16; FP8 is opt-in via `--kv-fp8` (or `kv_cache.dtype = "fp8"` in `imp.conf`). Coherent on Qwen3 dense, Qwen3.5/3.6 GDN, Llama-3.2, and Gemma-4 (post PR #91).

Expand Down
20 changes: 7 additions & 13 deletions src/runtime/engine.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -829,19 +829,13 @@ bool Engine::init(std::shared_ptr<Model> model, const EngineConfig& config) {
// CUDA graphs: enabled for Gemma-4 decode. The MoE decode fast path is fully
// device-side (dp4a GEMV, no D2H memcpy), so graph capture works.
// Only the MoE prefill path uses D2H sync, but prefill is never graph-captured.
if (config_.use_fp8_prefill) {
// FP8 prefill on Gemma-4 is correctness-OK (output matches FP16 path
// bit-for-bit on the smoke "capital of France is Paris" prompt) but
// ~5-19% slower on prefill (measured 2026-05-09 on Q4_K_M: pp=123 was
// 270 vs 334 tok/s, pp=833 was 1086 vs 1141 tok/s — both runs FP8 < FP16).
// Likely cause: cuBLASLt FP8 algos for Gemma-4's per-layer head_dim
// shape (256/512 split) lose to FP16 cuBLAS at our tile sizes. The
// earlier "per-layer head_dim not yet supported" comment was inherited
// from the FP8 KV story and inaccurate — the issue is perf, not
// correctness. Auto-disable for default-perf.
IMP_LOG_INFO("Gemma 4: disabling FP8 prefill (~5-19%% slower than FP16 on this arch)");
config_.use_fp8_prefill = 0;
}
// FP8 prefill carve-out removed 2026-05-15. The 2026-05-09 measurement
// showed -5..-19% prefill on Gemma-4 vs FP16; since then (PRs #177, #181)
// the gap has closed. Re-measured 2026-05-15 on Q4_K_M:
// pp128: +1.0% pp512: -0.9% pp833: -4.2% pp2048: +7.3%
// Net effect is neutral with a long-context advantage. FP8 also halves
// the activation cache, which helps VRAM at long context. Users wanting
// max prefill at medium pp can opt out via [attention] fp8_prefill = "never".
if (config_.use_nvfp4_decode) {
// Prequant SafeTensors NVFP4 weights are already in NVFP4 layout on
// disk. Phase 3a (Q*_K → NVFP4 conversion) and Phase 3b
Expand Down
Loading