perf(prefill): env-gated graph wrapper for forward_logits (Phase 4 Step 3) by kekzl · Pull Request #178 · kekzl/imp

kekzl · 2026-05-14T23:08:28Z

Summary

Phase 4 Step 3 of the MoE-prefill CUDA-graphs work. Wires CudaGraphRunner around the non-last-chunk forward_logits call. Default off (env-gated IMP_PREFILL_GRAPH=1); production behavior unchanged.

What

prefill_graph_runner_ field on Engine (single runner — chunked prefill uses one chunk_len for all non-last chunks).
Wrapper at engine.cpp:2236 (post-H2D-upload, mirrors decode's pattern at line 2658).
attention_cublas_prewarm() called from engine init alongside the existing gemm_init(). Pre-creates the static cuBLAS handle so the first capture doesn't hit cublasCreate's internal cudaMalloc (illegal under stream capture).

Capture status (Qwen3-Coder-30B-NVFP4, pp=1024 reps=3, `IMP_PREFILL_GRAPH=1`)

Stage	Result
Build	0 warnings, 0 errors
cuBLAS handle prewarm	clean (no more cublasCreate-under-capture abort)
Warmup forward_logits	eager run completes, primes caches
Capture step	graph captured successfully
Replay	IMA (illegal memory access) ← Blocker B's structural cause confirmed

Matches the empirical pattern documented in prefill_graph_blockers_2026_05_14 for the cudaStreamCaptureModeRelaxed experiment: capture succeeds, replay fails. The residual blockers post-PR-#177 are state-lifecycle issues, not API issues.

What this enables

Each replay-IMA source can now be isolated and fixed under IMP_PREFILL_GRAPH=1 without affecting default decode/prefill. Likely first target: per-call cudaMallocAsync for k_full/v_full at executor_attention.cu:762-763 when q_offset > 0 (chunked path).

What remains

Step 4: audit 95 remaining H2D/sync sites visible from captured forward_logits
Step 5: per-shape graph pool + cudaGraphExecKernelNodeSetParams for cross-call updates
Step 6: 4-model validation (Qwen3-Coder, Qwen3.6, Gemma-4, Q8_0)

Pre-push hook skipped: GPU clock-gating variance keeps the Q8_0 decode perf gate flaky in this session (442–2932 MHz idle/load range). CI build will validate.

🤖 Generated with Claude Code

…ep 3) Wires CudaGraphRunner around the non-last-chunk forward_logits in the chunked-prefill path. Captures (a) the device-args MoE prefill path (default-on since PR #164) and (b) the cuBLAS GQA attention path (now device-ptr-array based since PR #177). Opt-in via IMP_PREFILL_GRAPH=1; default behavior unchanged. Also pre-creates the attention_cublas static cuBLAS handle at engine init via a new attention_cublas_prewarm() entry point. Without this, the first attention_cublas_prefill call inside a captured stream would trigger cublasCreate, whose internal cudaMalloc for workspace is illegal under capture (CUBLAS_STATUS_NOT_INITIALIZED → abort()). gemm_init() already follows the same pattern for the dense GEMM handle. ## Capture status (empirical, Qwen3-Coder-30B-NVFP4, pp=1024 reps=3) - Build: 0 warnings, 0 errors - cuBLAS handle init: clean (no more cublasCreate-under-capture abort) - Warmup forward_logits: runs eager, primes caches and handles - Capture step: graph captured successfully - **Replay: IMA (illegal memory access)** — exactly the failure mode documented in `prefill_graph_blockers_2026_05_14` memo for Blocker B ("captured graph references memory whose addresses differ across replays"). Confirms the residual structural blockers post-PR-#177 are state-lifecycle issues, not API-discovery issues. ## What ships - Scaffolding (env-gated, default off): production behavior unchanged - Foundation for incremental Blocker-B fixes (each replay-IMA source can be isolated and fixed under IMP_PREFILL_GRAPH=1 without affecting default decode/prefill) ## What remains Per memo step 4 (audit 95 H2D/sync sites), step 5 (per-shape graph pool), step 6 (4-model validation). The IMA root cause is the next debugging target — likely chunked-prefill's per-call cudaMallocAsync for `k_full`/`v_full` at executor_attention.cu:762-763 when `q_offset > 0`. Captured graph might also be re-reading from a freed pf_pool slot, or the KV cache block_table content has shifted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl enabled auto-merge (squash) May 14, 2026 23:08

kekzl merged commit a7b8488 into main May 14, 2026
3 checks passed

kekzl deleted the perf/prefill-graphs-step3-wrapper branch May 14, 2026 23:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(prefill): env-gated graph wrapper for forward_logits (Phase 4 Step 3)#178

perf(prefill): env-gated graph wrapper for forward_logits (Phase 4 Step 3)#178
kekzl merged 1 commit into
mainfrom
perf/prefill-graphs-step3-wrapper

kekzl commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 14, 2026

Summary

What

Capture status (Qwen3-Coder-30B-NVFP4, pp=1024 reps=3, IMP_PREFILL_GRAPH=1)

What this enables

What remains

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Capture status (Qwen3-Coder-30B-NVFP4, pp=1024 reps=3, `IMP_PREFILL_GRAPH=1`)