perf(deepseek/v4): qkv_proj_rope tiling + fused rms+rope (decode -56%)#578
Conversation
## What
Cut DeepSeek-V4 qkv_proj_rope decode latency ~56% (a2a3 L2 swimlane, 5-rep
median: 936us -> 407us) by retiling the projection matmuls to the L2 cache
line and fusing per-head/KV RMSNorm with RoPE. Golden green (decode + prefill).
## Changes
- qr_proj / kv_proj: split-K (zero-seed + atomic-add) with N-tile 32 -> 256, so
each wq_a/wkv row-read fills a 512B L2 line instead of a 64B sub-line (was 8x
weight over-fetch). qr_proj / kv_proj occupancy -84% / -75%.
- qproj_matmul: decouple the matmul N-tile from the dequant N-tile; bump matmul
TN 128 -> 256 (256B/row), capped by the L0C Acc limit (TM*TN*4 <= 128KB). TN=512
needs M-split (TM=64) and measured no better end-to-end on device.
- Fuse per-head RMSNorm + NOPE + RoPE into q_head_rms_nope_rope, and KV RMSNorm +
RoPE into kv_rms_norm_rope: inv_rms stays in registers (no GM round-trip via the
old q_head_inv_rms_all / kv_inv_rms_tensor) and each pair collapses to one
dispatch. RoPE keeps the interleaved (CANN A3) swap-gather layout.
- decode_indexer{,_compressor}: point the rope-pattern cross-reference comments at
the fused kernel names.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThe DeepSeek-V4 QKV projection kernel is rewritten to use split-K atomic-add accumulation for both ChangesDeepSeek-V4 QKV split-K + fused RMSNorm/RoPE kernel rewrite
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request optimizes the DeepSeek v4 decode kernels by decoupling projection matmul tile sizes, implementing split-K parallelization for qr_proj and kv_proj, and fusing RMSNorm, NOPE writeback, and interleaved RoPE rotations to eliminate global memory round-trips. The review feedback suggests peeling the first iteration out of the pipelined loops in both qr_proj_matmul and kv_proj_matmul to avoid conditional checks inside pl.pipeline, which can hinder software pipelining or cause compilation issues on CANN/Ascend hardware.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Summary
qkv_proj_ropeprojection matmuls to the 512B L2 cache line and fuse RMSNorm with RoPE. Decode end-to-end −56% (a2a3 L2 swimlane, 5-rep median: 936µs → 407µs); golden green on decode and prefill.qr_proj/kv_proj: split-K (zero-seed + atomic-add) with N-tile 32 → 256, so eachwq_a/wkvrow-read fills a full 512B cache line instead of a 64B sub-line (was 8× weight over-fetch). Kernel occupancy −84% / −75%.qproj_matmul: decouple the matmul N-tile from the dequant N-tile and bump matmulTN128 → 256 (256B/row), capped by the L0CAcclimit (TM*TN*4 ≤ 128KB).TN=512needs an M-split (TM=64) and measured no faster end-to-end on device.q_head_rms_nope_rope, and KV RMSNorm + RoPE intokv_rms_norm_rope:inv_rmsstays in registers (no GM round-trip via the oldq_head_inv_rms_all/kv_inv_rms_tensor), collapsing each pair of dispatches into one. RoPE keeps the interleaved (CANN A3) swap-gather layout.Related Issues