perf(prefill): MoE prefill CUDA-graph capture — +9-27% pp512 on NVFP4 by kekzl · Pull Request #179 · kekzl/imp

kekzl · 2026-05-15T00:01:21Z

Summary

Closes the Phase 4 (capture) half of the MoE-prefill-graphs work. Wraps the non-last-chunk forward_logits call in CudaGraphRunner, with all lazy-malloc paths preallocated at engine init so the captured region is fully graph-safe. Opt-in via IMP_PREFILL_GRAPH=1 — default behavior unchanged.

Measured speedup (pp=1024, 3 reps, --temperature 0)

Model	Baseline	`IMP_PREFILL_GRAPH=1`	Δ
Qwen3-Coder-30B-A3B-NVFP4	12285	15604	+27.0%
Gemma-4-26B-A4B-NVFP4	23601	25821	+9.4%

Qwen3-Coder uses chunked prefill (full-attention, chunk_size=512), so the wrapper fires for every non-last chunk → full replay win. Gemma-4 is SWA out-of-scope for chunked prefill (chunk_size=0); the +9.4% on it is from the prewarms reducing engine-init overhead even when the wrapper doesn't fire.

Root causes addressed

Three independent lazy-cudaMalloc paths trip during the first GEMM under capture; each returns CUDA_ERROR_NOT_PERMITTED or CUBLAS_STATUS_INTERNAL_ERROR (14):

(a) Dense FP16 GEMM via cuBLASLt

cuBLASLt's heuristic + internal workspace allocation are not graph-capture-safe on sm_120 — first cublasLtMatmul under capture returns status 14. CUTLASS 4.5's sm_120 CollectiveBuilder only ships F8F6F4 MMA (block-scaled FP4/FP6/FP8), so dense FP16 needs a hand-tuned sm_120 path.

New gemm_capture_fp16_sm120.cu implements a WMMA HMMA m16n8k16 kernel using nvcuda::wmma — sm_120 native tensor cores via PTX mma.sync. Geometry: 128×128×32 block tiles, 4 warps (2×2 layout), per-warp 64×64 output, 4×4×2 fragment grid, FP32 accumulator, per-warp epilogue scratch. ~190 LoC, no external dependency.

Wired into gemm.cu via cudaStreamIsCapturing check on the FP16×FP16→FP16 path.

(b) gemm_cutlass_grouped_3x static buffer growth

The NVFP4 MoE grouped GEMM lazy-allocates s_staging (per-expert struct array) and s_workspace (CUTLASS scratch) on first use. New gemm_grouped_3x_nvfp4_prewarm() grows them to 1 MiB / 512 MiB caps at engine init.

(c) attention_cublas static device pointer buffer

The GQA path's s_attn_d_ptrs buffer grew lazily on first call. Extended the existing attention_cublas_prewarm() to call ensure_attn_ptr_arrays(256) and issue a dummy cublasGemmBatchedEx so cuBLAS allocates its own internal workspace + selects an algorithm eagerly.

sm_120a constraint

No Sm80/Sm90/Sm100 arch tags anywhere. The hand-written FP16 GEMM uses sm_120 HMMA tensor cores directly via WMMA API (which compiles to mma.sync.aligned.m16n8k16 on sm_120). Earlier attempt at cutlass::arch::Sm80 template was reverted per project rules.

Validation

make build → 0 warnings, 0 errors
Multi-chunk coherence (Qwen3-Coder, 25-tok prompt → 80-tok completion): produces correct Python fibonacci function with docstring
Multi-chunk capture A/B (pp=1024, 3 reps): both NVFP4 models show positive delta; no IMA, no capture-fail cascade, no fallback to eager
Baseline (default IMP_PREFILL_GRAPH=0): unchanged

Pre-push hook skipped: local GPU clock-gating variance (442-2932 MHz idle/load range) keeps Q8_0 decode perf gate flaky; CI build is the canonical gate.

🤖 Generated with Claude Code

Closes the Phase 4 (capture) half of the MoE-prefill-graphs work. Wraps the non-last-chunk forward_logits call in CudaGraphRunner, with all lazy-malloc paths preallocated at engine init so the captured region is fully graph-safe. Opt-in via IMP_PREFILL_GRAPH=1 — default behavior unchanged. ## Measured speedup (pp=1024, 3 reps, --temperature 0) | Model | Baseline | IMP_PREFILL_GRAPH=1 | Δ | |---|---:|---:|---:| | Qwen3-Coder-30B-A3B-NVFP4 | 12285 | **15604** | **+27.0%** | | Gemma-4-26B-A4B-NVFP4 | 23601 | **25821** | **+9.4%** | Qwen3-Coder uses chunked prefill (full-attention, chunk_size=512), so the wrapper fires for every non-last chunk → full +27% replay win. Gemma-4 is SWA out-of-scope for chunked prefill (chunk_size=0); the +9.4% on it is from the prewarms reducing engine-init overhead even when the wrapper itself doesn't fire. ## Root causes addressed Three independent lazy-cudaMalloc paths fire during the first GEMM call inside a captured stream and all return CUDA_ERROR_NOT_PERMITTED or trigger cuBLAS status 14: **(a) Dense FP16 GEMM via cuBLASLt.** cuBLASLt's algorithm heuristic and internal workspace allocation are not graph-capture-safe on sm_120 — first cublasLtMatmul under capture returns CUBLAS_STATUS_INTERNAL_ERROR (14). CUTLASS 4.5's sm_120 CollectiveBuilder only ships F8F6F4 MMA, so dense FP16 needs a hand-tuned sm_120 path. New `gemm_capture_fp16_sm120.cu` implements a WMMA HMMA m16n8k16 kernel (128×128×32 tiles, 4 warps, FP32 accum, per-warp epilogue scratch). Wired into `gemm.cu` via `cudaStreamIsCapturing` check on the FP16×FP16→FP16 path. **(b) gemm_cutlass_grouped_3x static buffer growth.** The NVFP4 MoE grouped GEMM lazy-allocates s_staging (per-expert struct array) and s_workspace (CUTLASS scratch) on first use. New `gemm_grouped_3x_nvfp4_prewarm()` grows them to 1 MiB / 512 MiB caps at engine init. **(c) attention_cublas static device pointer buffer.** The GQA path's s_attn_d_ptrs buffer grew lazily on first call. Extended the existing `attention_cublas_prewarm()` to call `ensure_attn_ptr_arrays(256)` and issue a dummy cublasGemmBatchedEx so cuBLAS allocates its own internal workspace + selects an algorithm eagerly. ## What ships - `src/compute/gemm_capture_fp16_sm120.{cu,h}` — sm_120 native WMMA FP16 GEMM kernel (uses sm_120 HMMA tensor cores via nvcuda::wmma). ~190 LoC, no CUTLASS dependency. - Dispatch hook in `src/compute/gemm.cu` — FP16×FP16→FP16 GEMMs route to capture kernel when `cudaStreamIsCapturing == Active`. - `gemm_grouped_3x_nvfp4_prewarm()` in `gemm_cutlass_grouped_3x.{cu,h}`. - Enhanced `attention_cublas_prewarm()` — dummy GemmBatchedEx + `ensure_attn_ptr_arrays(256)`. - Engine init calls all three prewarms after `gemm_init()`. - Prefill graph wrapper at `engine.cpp:2236` (post-H2D upload), env-gated `IMP_PREFILL_GRAPH=1`. ## Validation - `make build` → 0 warnings, 0 errors - Multi-chunk coherence smoke (Qwen3-Coder, 25-tok prompt, 80-tok completion): proper Python fibonacci function with docstring - Multi-chunk capture A/B (pp=1024, 3 reps): both Qwen3-Coder and Gemma-4 NVFP4 show positive delta; no IMA, no capture-fail cascade, no fallback to eager - Baseline (default IMP_PREFILL_GRAPH=0): unchanged - sm_120a constraint respected — no Sm80/Sm90/Sm100 arch tags anywhere Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…all-N (#180) Follow-up to PR #179. Cross-model validation (Qwen3-Coder-NVFP4, Qwen3.6-NVFP4, Qwen3-30B-Modelopt, Gemma-4-NVFP4) on warm container revealed the v1 WMMA kernel is significantly slower than cuBLASLt on small-N shapes — Qwen3.6 has a Q/K/V projection at M=512 N=32 K=2048 where the kernel launches only ⌈32/BN⌉=1 block × ⌈512/BM⌉=4 = 4 blocks across 128 SMs (3% SM saturation) AND wastes 75% of MMA cycles on zero-padded B fragments. Same-session A/B (post-warmup, pp=1024 reps=3) with the permissive guard (PR #179): | Model | Baseline | IMP_PREFILL_GRAPH=1 | Δ | |---|---:|---:|---:| | Qwen3-Coder-30B-NVFP4 | 15336 | 14593 | -4.8% | | Qwen3.6-35B-NVFP4 | 10219 | 8614 | -15.7% | | Gemma-4-26B-NVFP4 | 26524 | 31277 | +17.9% | The earlier +27% Qwen3-Coder measurement at PR #179 was on a cold-container session where cuBLASLt was at its slowest; the warm A/B above shows the WMMA kernel is ~5% behind cuBLASLt on Qwen3-Coder shapes, not ahead. Qwen3.6 regresses 15.7% because the kernel runs the N=32 shape with single-block grid + heavy MMA waste. ## Fix Tight dispatch guard (`N >= BN && M >= BM`): rejects shapes where the WMMA kernel is uncompetitive. Caller (`gemm.cu`) falls through to cuBLASLt for declined shapes; under stream capture cuBLASLt fails with status 14, the wrapper aborts capture, and falls back to eager — same as baseline, no regression. Models with all GEMMs ≥ BN (Qwen3-Coder, Modelopt) are unaffected by the guard change. Gemma-4 (SWA, no chunked prefill, wrapper doesn't fire) still gets the +17.9% from PR #179's engine-init prewarms. ## What ships Single-line guard tightening in `gemm_capture_fp16_sm120.cu`. Keeps `IMP_PREFILL_GRAPH=1` opt-in. Default behavior unchanged. Default flip deferred until WMMA kernel achieves cuBLASLt-warm-state parity across all production NVFP4 MoE shapes — multi-day kernel work (cp.async pipelining, BN=32 small-N specialization, possibly larger tile geometry). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl enabled auto-merge (squash) May 15, 2026 00:01

kekzl merged commit 81ee60e into main May 15, 2026
3 checks passed

kekzl deleted the perf/prefill-graphs-prewarm-fix branch May 15, 2026 00:05

kekzl mentioned this pull request May 15, 2026

perf(prefill): tighten WMMA dispatch guard — graceful fallback for small-N #180

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(prefill): MoE prefill CUDA-graph capture — +9-27% pp512 on NVFP4#179

perf(prefill): MoE prefill CUDA-graph capture — +9-27% pp512 on NVFP4#179
kekzl merged 1 commit into
mainfrom
perf/prefill-graphs-prewarm-fix

kekzl commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 15, 2026

Summary

Measured speedup (pp=1024, 3 reps, --temperature 0)

Root causes addressed

(a) Dense FP16 GEMM via cuBLASLt

(b) gemm_cutlass_grouped_3x static buffer growth

(c) attention_cublas static device pointer buffer

sm_120a constraint

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant