perf(prefill): tighten WMMA dispatch guard — graceful fallback for small-N by kekzl · Pull Request #180 · kekzl/imp

kekzl · 2026-05-15T06:46:00Z

Summary

Follow-up to #179. Cross-model validation revealed the v1 WMMA kernel is significantly slower than cuBLASLt on small-N shapes. Tighten the dispatch guard so such shapes fall back to eager (no regression) instead of capturing a slow kernel.

Cross-model A/B (same-session warm-container, pp=1024 reps=3)

Model	Baseline	`IMP_PREFILL_GRAPH=1` (PR #179)	Δ
Qwen3-Coder-30B-NVFP4	15336	14593	-4.8%
Qwen3.6-35B-NVFP4	10219	8614	-15.7%
Gemma-4-26B-NVFP4	26524	31277	+17.9%

The earlier +27% Qwen3-Coder measurement at PR #179 was on a cold-container session where cuBLASLt was at its slowest. Warm-state shows the WMMA kernel is ~5% behind cuBLASLt on Qwen3-Coder shapes. Qwen3.6 regresses 15.7% because its Q/K/V projection at M=512 N=32 K=2048 launches only ⌈32/BN⌉=1 block × ⌈512/BM⌉=4 = 4 blocks across 128 SMs (3% SM saturation) AND wastes 75% of MMA cycles on zero-padded B fragments.

Fix

Tight dispatch guard N >= BN && M >= BM rejects shapes where the WMMA kernel is uncompetitive. Caller falls through to cuBLASLt; under stream capture cuBLASLt fails with status 14, the wrapper aborts capture, and falls back to eager — same as baseline, no regression. Models with all GEMMs ≥ BN (Qwen3-Coder, Modelopt) are unaffected by the guard change.

Gemma-4 (SWA, no chunked prefill, wrapper doesn't fire) still gets the +17.9% from #179's engine-init prewarms — that path is independent of the WMMA kernel.

Default flip status

Deferred. Real IMP_PREFILL_GRAPH default-on requires the WMMA kernel to achieve cuBLASLt-warm-state parity across all production NVFP4 MoE shapes — multi-day kernel work (cp.async pipelining, BN=32 small-N specialization, possibly larger tile geometry). IMP_PREFILL_GRAPH=1 remains opt-in.

🤖 Generated with Claude Code

…all-N Follow-up to PR #179. Cross-model validation (Qwen3-Coder-NVFP4, Qwen3.6-NVFP4, Qwen3-30B-Modelopt, Gemma-4-NVFP4) on warm container revealed the v1 WMMA kernel is significantly slower than cuBLASLt on small-N shapes — Qwen3.6 has a Q/K/V projection at M=512 N=32 K=2048 where the kernel launches only ⌈32/BN⌉=1 block × ⌈512/BM⌉=4 = 4 blocks across 128 SMs (3% SM saturation) AND wastes 75% of MMA cycles on zero-padded B fragments. Same-session A/B (post-warmup, pp=1024 reps=3) with the permissive guard (PR #179): | Model | Baseline | IMP_PREFILL_GRAPH=1 | Δ | |---|---:|---:|---:| | Qwen3-Coder-30B-NVFP4 | 15336 | 14593 | -4.8% | | Qwen3.6-35B-NVFP4 | 10219 | 8614 | -15.7% | | Gemma-4-26B-NVFP4 | 26524 | 31277 | +17.9% | The earlier +27% Qwen3-Coder measurement at PR #179 was on a cold-container session where cuBLASLt was at its slowest; the warm A/B above shows the WMMA kernel is ~5% behind cuBLASLt on Qwen3-Coder shapes, not ahead. Qwen3.6 regresses 15.7% because the kernel runs the N=32 shape with single-block grid + heavy MMA waste. ## Fix Tight dispatch guard (`N >= BN && M >= BM`): rejects shapes where the WMMA kernel is uncompetitive. Caller (`gemm.cu`) falls through to cuBLASLt for declined shapes; under stream capture cuBLASLt fails with status 14, the wrapper aborts capture, and falls back to eager — same as baseline, no regression. Models with all GEMMs ≥ BN (Qwen3-Coder, Modelopt) are unaffected by the guard change. Gemma-4 (SWA, no chunked prefill, wrapper doesn't fire) still gets the +17.9% from PR #179's engine-init prewarms. ## What ships Single-line guard tightening in `gemm_capture_fp16_sm120.cu`. Keeps `IMP_PREFILL_GRAPH=1` opt-in. Default behavior unchanged. Default flip deferred until WMMA kernel achieves cuBLASLt-warm-state parity across all production NVFP4 MoE shapes — multi-day kernel work (cp.async pipelining, BN=32 small-N specialization, possibly larger tile geometry). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl enabled auto-merge (squash) May 15, 2026 06:46

kekzl merged commit 3003b90 into main May 15, 2026
3 checks passed

kekzl deleted the perf/prefill-graph-default-on branch May 15, 2026 06:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(prefill): tighten WMMA dispatch guard — graceful fallback for small-N#180

perf(prefill): tighten WMMA dispatch guard — graceful fallback for small-N#180
kekzl merged 1 commit into
mainfrom
perf/prefill-graph-default-on

kekzl commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 15, 2026

Summary

Cross-model A/B (same-session warm-container, pp=1024 reps=3)

Fix

Default flip status

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant