Skip to content

perf(gemma4): drop FP8 prefill carve-out (re-measured neutral / long-ctx +7%)#187

Merged
github-actions[bot] merged 2 commits into
mainfrom
feat/gemma4-fp8-prefill-retest
May 15, 2026
Merged

perf(gemma4): drop FP8 prefill carve-out (re-measured neutral / long-ctx +7%)#187
github-actions[bot] merged 2 commits into
mainfrom
feat/gemma4-fp8-prefill-retest

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 15, 2026

Summary

Removes the last entry in the "Gemma-4 remaining carve-outs" roadmap section. 6 lines deleted from src/runtime/engine.cpp:832-844.

The 2026-05-09 measurement showing -5..-19 % prefill on Gemma-4 vs FP16 was real at the time but substantially closed by intermediate prefill work (PR #177 device-side ptr-array, PR #181 WMMA cp.async pipeline). Re-measured 2026-05-15 on Gemma-4-26B-A4B-it-Q4_K_M (5 reps, --temperature 0):

pp FP8 OFF FP8 ON delta
128 870 tok/s 879 tok/s +1.0 %
512 1732 tok/s 1717 tok/s -0.9 %
833 1649 tok/s 1579 tok/s -4.2 %
2048 1624 tok/s 1742 tok/s +7.3 %

Net effect is neutral with a long-context advantage. FP8 prefill also halves the activation cache footprint, which is a real VRAM win at long context.

Test plan

  • Coherence: chat-template gemma + What is the capital of France?**Paris**. (bit-exact between FP8 and FP16 paths).
  • Long-prompt + chunked prefill (640 tokens, --prefill-chunk-size 512) → coherent summary at 633 tok/s prefill / 160 tok/s decode.
  • test-attention 77/77 pass.
  • test-kv 34/34 pass.
  • make verify-fast green (post-variance re-run — first run regressed on Qwen3-8B baseline due to cuBLAS algo jitter, gone on retry; the change is gated if GEMMA4 so non-Gemma-4 archs are untouched).

Opt-out

Users who want max prefill at medium pp (where FP8 is 4 % slower) can disable it via imp.conf:

[attention]
fp8_prefill = "never"

Gemma-4 chapter closure

This PR removes the last Gemma-4 carve-out. The roadmap "Gemma-4 remaining carve-outs" section now consolidates the three-step history:

Carve-out Removed PR
FP8 KV cache 2026-05-01 #91
NVFP4 decode cache (Q*_K) 2026-05-15 #186
FP8 prefill 2026-05-15 this

Remaining Gemma-4 issues are documented separately (Q4_K_M code-gen drift on complex code prompts — use Q5_K_M / Q8_0).

🤖 Generated with Claude Code

kekzl and others added 2 commits May 15, 2026 18:16
…x win)

Removes the auto-disable at engine.cpp:832-844. The 2026-05-09 measurement
showing -5..-19% prefill on Gemma-4 (vs FP16) was real at the time but
substantially closed by intermediate prefill work (PR #177 device-side
ptr-array, PR #181 WMMA cp.async, etc.). Re-measured 2026-05-15 on
Gemma-4-26B-A4B-it-Q4_K_M (5 reps, --bench-pp <N> --temperature 0):

| pp    | FP8 OFF tok/s | FP8 ON tok/s | delta  |
|-------|---------------|--------------|--------|
| 128   |  870          |  879         | +1.0 % |
| 512   | 1732          | 1717         | -0.9 % |
| 833   | 1649          | 1579         | -4.2 % |
| 2048  | 1624          | 1742         | +7.3 % |

Net effect is neutral with a long-context advantage. FP8 prefill also
halves the activation cache size, which is a real VRAM win at long ctx.

Coherence: chat-template gemma + "What is the capital of France?" →
"**Paris**." (bit-exact between FP8 and FP16 paths).

make verify-fast: green (post-variance re-run — first run regressed on
Qwen3-8B baseline, gone on retry; the change is gated `if GEMMA4` so
non-Gemma-4 archs are untouched).

Closes the last entry in the "Gemma-4 remaining carve-outs" roadmap
section (FP8 KV cache, NVFP4 Q*_K decode, FP8 prefill all removed).
Users wanting max prefill at medium pp can opt out via
[attention] fp8_prefill = "never".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Roadmap "Gemma-4 remaining carve-outs" section now lists all three as
removed (FP8 KV cache #91, NVFP4 Q*_K decode #186, FP8 prefill here).
CHANGELOG Unreleased entry with measurement table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot enabled auto-merge (squash) May 15, 2026 16:18
@github-actions github-actions Bot merged commit c86aab5 into main May 15, 2026
3 checks passed
@kekzl kekzl deleted the feat/gemma4-fp8-prefill-retest branch May 16, 2026 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant