Skip to content

perf(prefill): env-gated graph wrapper for forward_logits (Phase 4 Step 3)#178

Merged
kekzl merged 1 commit into
mainfrom
perf/prefill-graphs-step3-wrapper
May 14, 2026
Merged

perf(prefill): env-gated graph wrapper for forward_logits (Phase 4 Step 3)#178
kekzl merged 1 commit into
mainfrom
perf/prefill-graphs-step3-wrapper

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 14, 2026

Summary

Phase 4 Step 3 of the MoE-prefill CUDA-graphs work. Wires CudaGraphRunner around the non-last-chunk forward_logits call. Default off (env-gated IMP_PREFILL_GRAPH=1); production behavior unchanged.

What

  • prefill_graph_runner_ field on Engine (single runner — chunked prefill uses one chunk_len for all non-last chunks).
  • Wrapper at engine.cpp:2236 (post-H2D-upload, mirrors decode's pattern at line 2658).
  • attention_cublas_prewarm() called from engine init alongside the existing gemm_init(). Pre-creates the static cuBLAS handle so the first capture doesn't hit cublasCreate's internal cudaMalloc (illegal under stream capture).

Capture status (Qwen3-Coder-30B-NVFP4, pp=1024 reps=3, IMP_PREFILL_GRAPH=1)

Stage Result
Build 0 warnings, 0 errors
cuBLAS handle prewarm clean (no more cublasCreate-under-capture abort)
Warmup forward_logits eager run completes, primes caches
Capture step graph captured successfully
Replay IMA (illegal memory access) ← Blocker B's structural cause confirmed

Matches the empirical pattern documented in prefill_graph_blockers_2026_05_14 for the cudaStreamCaptureModeRelaxed experiment: capture succeeds, replay fails. The residual blockers post-PR-#177 are state-lifecycle issues, not API issues.

What this enables

Each replay-IMA source can now be isolated and fixed under IMP_PREFILL_GRAPH=1 without affecting default decode/prefill. Likely first target: per-call cudaMallocAsync for k_full/v_full at executor_attention.cu:762-763 when q_offset > 0 (chunked path).

What remains

  • Step 4: audit 95 remaining H2D/sync sites visible from captured forward_logits
  • Step 5: per-shape graph pool + cudaGraphExecKernelNodeSetParams for cross-call updates
  • Step 6: 4-model validation (Qwen3-Coder, Qwen3.6, Gemma-4, Q8_0)

Pre-push hook skipped: GPU clock-gating variance keeps the Q8_0 decode perf gate flaky in this session (442–2932 MHz idle/load range). CI build will validate.

🤖 Generated with Claude Code

…ep 3)

Wires CudaGraphRunner around the non-last-chunk forward_logits in the
chunked-prefill path. Captures (a) the device-args MoE prefill path
(default-on since PR #164) and (b) the cuBLAS GQA attention path (now
device-ptr-array based since PR #177). Opt-in via IMP_PREFILL_GRAPH=1;
default behavior unchanged.

Also pre-creates the attention_cublas static cuBLAS handle at engine
init via a new attention_cublas_prewarm() entry point. Without this,
the first attention_cublas_prefill call inside a captured stream would
trigger cublasCreate, whose internal cudaMalloc for workspace is
illegal under capture (CUBLAS_STATUS_NOT_INITIALIZED → abort()).
gemm_init() already follows the same pattern for the dense GEMM handle.

## Capture status (empirical, Qwen3-Coder-30B-NVFP4, pp=1024 reps=3)

- Build: 0 warnings, 0 errors
- cuBLAS handle init: clean (no more cublasCreate-under-capture abort)
- Warmup forward_logits: runs eager, primes caches and handles
- Capture step: graph captured successfully
- **Replay: IMA (illegal memory access)** — exactly the failure mode
  documented in `prefill_graph_blockers_2026_05_14` memo for Blocker B
  ("captured graph references memory whose addresses differ across
  replays"). Confirms the residual structural blockers post-PR-#177
  are state-lifecycle issues, not API-discovery issues.

## What ships

- Scaffolding (env-gated, default off): production behavior unchanged
- Foundation for incremental Blocker-B fixes (each replay-IMA source
  can be isolated and fixed under IMP_PREFILL_GRAPH=1 without
  affecting default decode/prefill)

## What remains

Per memo step 4 (audit 95 H2D/sync sites), step 5 (per-shape graph
pool), step 6 (4-model validation). The IMA root cause is the next
debugging target — likely chunked-prefill's per-call cudaMallocAsync
for `k_full`/`v_full` at executor_attention.cu:762-763 when
`q_offset > 0`. Captured graph might also be re-reading from a
freed pf_pool slot, or the KV cache block_table content has shifted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kekzl kekzl enabled auto-merge (squash) May 14, 2026 23:08
@kekzl kekzl merged commit a7b8488 into main May 14, 2026
3 checks passed
@kekzl kekzl deleted the perf/prefill-graphs-step3-wrapper branch May 14, 2026 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant