perf(gemma4): drop FP8 prefill carve-out (re-measured neutral / long-ctx +7%) by kekzl · Pull Request #187 · kekzl/imp

kekzl · 2026-05-15T16:18:46Z

Summary

Removes the last entry in the "Gemma-4 remaining carve-outs" roadmap section. 6 lines deleted from src/runtime/engine.cpp:832-844.

The 2026-05-09 measurement showing -5..-19 % prefill on Gemma-4 vs FP16 was real at the time but substantially closed by intermediate prefill work (PR #177 device-side ptr-array, PR #181 WMMA cp.async pipeline). Re-measured 2026-05-15 on Gemma-4-26B-A4B-it-Q4_K_M (5 reps, --temperature 0):

pp	FP8 OFF	FP8 ON	delta
128	870 tok/s	879 tok/s	+1.0 %
512	1732 tok/s	1717 tok/s	-0.9 %
833	1649 tok/s	1579 tok/s	-4.2 %
2048	1624 tok/s	1742 tok/s	+7.3 %

Net effect is neutral with a long-context advantage. FP8 prefill also halves the activation cache footprint, which is a real VRAM win at long context.

Test plan

Coherence: chat-template gemma + What is the capital of France? → **Paris**. (bit-exact between FP8 and FP16 paths).
Long-prompt + chunked prefill (640 tokens, --prefill-chunk-size 512) → coherent summary at 633 tok/s prefill / 160 tok/s decode.
test-attention 77/77 pass.
test-kv 34/34 pass.
make verify-fast green (post-variance re-run — first run regressed on Qwen3-8B baseline due to cuBLAS algo jitter, gone on retry; the change is gated if GEMMA4 so non-Gemma-4 archs are untouched).

Opt-out

Users who want max prefill at medium pp (where FP8 is 4 % slower) can disable it via imp.conf:

[attention]
fp8_prefill = "never"

Gemma-4 chapter closure

This PR removes the last Gemma-4 carve-out. The roadmap "Gemma-4 remaining carve-outs" section now consolidates the three-step history:

Carve-out	Removed	PR
FP8 KV cache	2026-05-01	#91
NVFP4 decode cache (Q*_K)	2026-05-15	#186
FP8 prefill	2026-05-15	this

Remaining Gemma-4 issues are documented separately (Q4_K_M code-gen drift on complex code prompts — use Q5_K_M / Q8_0).

🤖 Generated with Claude Code

…x win) Removes the auto-disable at engine.cpp:832-844. The 2026-05-09 measurement showing -5..-19% prefill on Gemma-4 (vs FP16) was real at the time but substantially closed by intermediate prefill work (PR #177 device-side ptr-array, PR #181 WMMA cp.async, etc.). Re-measured 2026-05-15 on Gemma-4-26B-A4B-it-Q4_K_M (5 reps, --bench-pp <N> --temperature 0): | pp | FP8 OFF tok/s | FP8 ON tok/s | delta | |-------|---------------|--------------|--------| | 128 | 870 | 879 | +1.0 % | | 512 | 1732 | 1717 | -0.9 % | | 833 | 1649 | 1579 | -4.2 % | | 2048 | 1624 | 1742 | +7.3 % | Net effect is neutral with a long-context advantage. FP8 prefill also halves the activation cache size, which is a real VRAM win at long ctx. Coherence: chat-template gemma + "What is the capital of France?" → "**Paris**." (bit-exact between FP8 and FP16 paths). make verify-fast: green (post-variance re-run — first run regressed on Qwen3-8B baseline, gone on retry; the change is gated `if GEMMA4` so non-Gemma-4 archs are untouched). Closes the last entry in the "Gemma-4 remaining carve-outs" roadmap section (FP8 KV cache, NVFP4 Q*_K decode, FP8 prefill all removed). Users wanting max prefill at medium pp can opt out via [attention] fp8_prefill = "never". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Roadmap "Gemma-4 remaining carve-outs" section now lists all three as removed (FP8 KV cache #91, NVFP4 Q*_K decode #186, FP8 prefill here). CHANGELOG Unreleased entry with measurement table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl and others added 2 commits May 15, 2026 18:16

github-actions Bot enabled auto-merge (squash) May 15, 2026 16:18

github-actions Bot merged commit c86aab5 into main May 15, 2026
3 checks passed

kekzl deleted the feat/gemma4-fp8-prefill-retest branch May 16, 2026 11:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(gemma4): drop FP8 prefill carve-out (re-measured neutral / long-ctx +7%)#187

perf(gemma4): drop FP8 prefill carve-out (re-measured neutral / long-ctx +7%)#187
github-actions[bot] merged 2 commits into
mainfrom
feat/gemma4-fp8-prefill-retest

kekzl commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 15, 2026

Summary

Test plan

Opt-out

Gemma-4 chapter closure

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant