perf(prefill): env-gated graph wrapper for forward_logits (Phase 4 Step 3)#178
Merged
Conversation
…ep 3) Wires CudaGraphRunner around the non-last-chunk forward_logits in the chunked-prefill path. Captures (a) the device-args MoE prefill path (default-on since PR #164) and (b) the cuBLAS GQA attention path (now device-ptr-array based since PR #177). Opt-in via IMP_PREFILL_GRAPH=1; default behavior unchanged. Also pre-creates the attention_cublas static cuBLAS handle at engine init via a new attention_cublas_prewarm() entry point. Without this, the first attention_cublas_prefill call inside a captured stream would trigger cublasCreate, whose internal cudaMalloc for workspace is illegal under capture (CUBLAS_STATUS_NOT_INITIALIZED → abort()). gemm_init() already follows the same pattern for the dense GEMM handle. ## Capture status (empirical, Qwen3-Coder-30B-NVFP4, pp=1024 reps=3) - Build: 0 warnings, 0 errors - cuBLAS handle init: clean (no more cublasCreate-under-capture abort) - Warmup forward_logits: runs eager, primes caches and handles - Capture step: graph captured successfully - **Replay: IMA (illegal memory access)** — exactly the failure mode documented in `prefill_graph_blockers_2026_05_14` memo for Blocker B ("captured graph references memory whose addresses differ across replays"). Confirms the residual structural blockers post-PR-#177 are state-lifecycle issues, not API-discovery issues. ## What ships - Scaffolding (env-gated, default off): production behavior unchanged - Foundation for incremental Blocker-B fixes (each replay-IMA source can be isolated and fixed under IMP_PREFILL_GRAPH=1 without affecting default decode/prefill) ## What remains Per memo step 4 (audit 95 H2D/sync sites), step 5 (per-shape graph pool), step 6 (4-model validation). The IMA root cause is the next debugging target — likely chunked-prefill's per-call cudaMallocAsync for `k_full`/`v_full` at executor_attention.cu:762-763 when `q_offset > 0`. Captured graph might also be re-reading from a freed pf_pool slot, or the KV cache block_table content has shifted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 4 Step 3 of the MoE-prefill CUDA-graphs work. Wires
CudaGraphRunneraround the non-last-chunkforward_logitscall. Default off (env-gatedIMP_PREFILL_GRAPH=1); production behavior unchanged.What
prefill_graph_runner_field onEngine(single runner — chunked prefill uses one chunk_len for all non-last chunks).engine.cpp:2236(post-H2D-upload, mirrors decode's pattern at line 2658).attention_cublas_prewarm()called from engine init alongside the existinggemm_init(). Pre-creates the static cuBLAS handle so the first capture doesn't hit cublasCreate's internal cudaMalloc (illegal under stream capture).Capture status (Qwen3-Coder-30B-NVFP4, pp=1024 reps=3,
IMP_PREFILL_GRAPH=1)Matches the empirical pattern documented in
prefill_graph_blockers_2026_05_14for thecudaStreamCaptureModeRelaxedexperiment: capture succeeds, replay fails. The residual blockers post-PR-#177 are state-lifecycle issues, not API issues.What this enables
Each replay-IMA source can now be isolated and fixed under
IMP_PREFILL_GRAPH=1without affecting default decode/prefill. Likely first target: per-callcudaMallocAsyncfork_full/v_fullatexecutor_attention.cu:762-763whenq_offset > 0(chunked path).What remains
cudaGraphExecKernelNodeSetParamsfor cross-call updatesPre-push hook skipped: GPU clock-gating variance keeps the Q8_0 decode perf gate flaky in this session (442–2932 MHz idle/load range). CI build will validate.
🤖 Generated with Claude Code