Revert "feat(deepseek/v4): gather-free split-half RoPE for decode + prefill"#575
Conversation
…refill (…" This reverts commit cdb64e0.
📝 WalkthroughWalkthroughAcross the DeepSeek-V4 decode and prefill attention stack, split-half NeoX RoPE frequency tables ( ChangesDeepSeek-V4 RoPE: split-half → interleaved frequency tables
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the Rotary Position Embedding (RoPE) implementation across multiple prefill and decode attention/compressor modules, transitioning from a split-half (NeoX) layout to an interleaved layout (CANN A3 rotate_interleaved). This change eliminates the need for half-width RoPE tables and simplifies the kernel signatures by directly utilizing the full-width freqs_cos and freqs_sin tensors. However, in prefill_sparse_attn.py, the undefined constant ROPE_HALF is used instead of HALF_ROPE in several places, which will cause runtime NameErrors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
models/deepseek/v4/decode_attention_hca.py (1)
152-155:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDerive HCA compressor RoPE halves from even lanes.
cmp_cos/cmp_sinare half-width compressor inputs. With full interleavedfreqs_*, taking0 : ROPE_HEAD_DIM // 2feeds duplicated low-pair frequencies; gather even lanes from the full row instead, and update the golden reference to use0::2.Golden-side fix
- cmp_cos[b] = freqs_cos[cmp_pos_b, :half_rd].float() - cmp_sin[b] = freqs_sin[cmp_pos_b, :half_rd].float() + cmp_cos[b] = freqs_cos[cmp_pos_b, 0::2].float() + cmp_sin[b] = freqs_sin[cmp_pos_b, 0::2].float()Also applies to: 423-424
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/decode_attention_hca.py` around lines 152 - 155, The slicing in the HCA compressor RoPE assignments is using 0 : ROPE_HEAD_DIM // 2 which extracts the first half of duplicated low-pair frequencies from the full interleaved freqs_cos and freqs_sin arrays. Replace the slice syntax 0 : ROPE_HEAD_DIM // 2 with 0::2 in the extraction of cmp_cos_row and cmp_sin_row to gather only the even lanes from the full row instead. Apply the same fix to the corresponding assignments in both the current location and the other occurrence mentioned at lines 423-424 to ensure consistent even-lane extraction across all HCA compressor RoPE operations.models/deepseek/v4/decode_attention_csa.py (1)
202-214:⚠️ Potential issue | 🟠 Major | ⚡ Quick winExtract compressor/indexer RoPE halves from even interleaved lanes.
step_cos/step_sinandcmp_cos/cmp_sinare half-width inputs, but slicingfreqs_*[:, :HALF_ROPE]from a full interleaved row yields[c0, c0, c1, c1, ...]instead of[c0, c1, ...]. Gather lanes0, 2, 4, ...in the PyPTO path and mirror that with0::2in the golden path.Golden-side fix to mirror the expected half layout
- step_cos = freqs_cos[first_pos, :HALF_ROPE].float().contiguous() - step_sin = freqs_sin[first_pos, :HALF_ROPE].float().contiguous() + step_cos = freqs_cos[first_pos, 0::2].float().contiguous() + step_sin = freqs_sin[first_pos, 0::2].float().contiguous() cmp_pos = first_pos + (COMPRESS_RATIO - (first_pos % COMPRESS_RATIO)) - COMPRESS_RATIO - cmp_cos = freqs_cos[cmp_pos, :HALF_ROPE].float().contiguous() - cmp_sin = freqs_sin[cmp_pos, :HALF_ROPE].float().contiguous() + cmp_cos = freqs_cos[cmp_pos, 0::2].float().contiguous() + cmp_sin = freqs_sin[cmp_pos, 0::2].float().contiguous()Also applies to: 521-525
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/decode_attention_csa.py` around lines 202 - 214, The slicing of freqs_cos and freqs_sin in the step_cos/step_sin and cmp_cos/cmp_sin tensor assembly blocks produces interleaved duplicated values [c0, c0, c1, c1, ...] instead of the required half-width format [c0, c1, ...]. Extract only the even-indexed lanes (0, 2, 4, ...) from the sliced results for both the step rope computation block (around the pl.slice calls for step_cos and step_sin) and the compress rope computation block (around the pl.slice calls for cmp_cos and cmp_sin). This ensures the tensors contain the correct unique RoPE values at the required positions without duplication.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@models/deepseek/v4/decode_sparse_attn_hca.py`:
- Around line 314-318: The RoPE indexing in the gather operations for cs_cos and
cs_sin is not maintaining the full interleaved layout contract which should be
[c0, c0, c1, c1, ...] but instead produces [c0, c1, ..., c0, c1, ...]. Fix the
cs_dup_idx construction or adjust how the gather operations use it to ensure the
output follows the proper full interleaved RoPE layout with consecutive pairs of
identical indices. Apply the same fix to the similar gather operations also
appearing in the code (the additional locations mentioned in the comment).
In `@models/deepseek/v4/decode_sparse_attn.py`:
- Around line 346-350: The issue is that the slicing of freqs_cos and freqs_sin
in the code around lines 346-350 (variables cs_cos and cs_sin) is only
extracting the contiguous first half and then duplicating it, but these
frequency tables are now full interleaved tables. Instead of slicing just cp_r0
: cp_r0 + ROPE_TILE and duplicating through cs_dup_idx, you need to slice the
full interleaved tile from the frequency tables and use even lanes to access the
correct interleaved positions. Apply the same fix to the duplicate code section
mentioned at lines 626-630.
---
Outside diff comments:
In `@models/deepseek/v4/decode_attention_csa.py`:
- Around line 202-214: The slicing of freqs_cos and freqs_sin in the
step_cos/step_sin and cmp_cos/cmp_sin tensor assembly blocks produces
interleaved duplicated values [c0, c0, c1, c1, ...] instead of the required
half-width format [c0, c1, ...]. Extract only the even-indexed lanes (0, 2, 4,
...) from the sliced results for both the step rope computation block (around
the pl.slice calls for step_cos and step_sin) and the compress rope computation
block (around the pl.slice calls for cmp_cos and cmp_sin). This ensures the
tensors contain the correct unique RoPE values at the required positions without
duplication.
In `@models/deepseek/v4/decode_attention_hca.py`:
- Around line 152-155: The slicing in the HCA compressor RoPE assignments is
using 0 : ROPE_HEAD_DIM // 2 which extracts the first half of duplicated
low-pair frequencies from the full interleaved freqs_cos and freqs_sin arrays.
Replace the slice syntax 0 : ROPE_HEAD_DIM // 2 with 0::2 in the extraction of
cmp_cos_row and cmp_sin_row to gather only the even lanes from the full row
instead. Apply the same fix to the corresponding assignments in both the current
location and the other occurrence mentioned at lines 423-424 to ensure
consistent even-lane extraction across all HCA compressor RoPE operations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ce2a7cda-b605-4343-8380-ec0039b51b4a
📒 Files selected for processing (16)
models/deepseek/v4/decode_attention_csa.pymodels/deepseek/v4/decode_attention_hca.pymodels/deepseek/v4/decode_attention_swa.pymodels/deepseek/v4/decode_compressor_ratio128.pymodels/deepseek/v4/decode_compressor_ratio4.pymodels/deepseek/v4/decode_layer.pymodels/deepseek/v4/decode_sparse_attn.pymodels/deepseek/v4/decode_sparse_attn_hca.pymodels/deepseek/v4/decode_sparse_attn_swa.pymodels/deepseek/v4/prefill_attention_csa.pymodels/deepseek/v4/prefill_attention_hca.pymodels/deepseek/v4/prefill_attention_swa.pymodels/deepseek/v4/prefill_compressor_ratio128.pymodels/deepseek/v4/prefill_compressor_ratio4.pymodels/deepseek/v4/prefill_sparse_attn.pymodels/deepseek/v4/qkv_proj_rope.py
💤 Files with no reviewable changes (1)
- models/deepseek/v4/decode_layer.py
| cs_cos = pl.cast(freqs_cos[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32) | ||
| cs_sin = pl.cast(freqs_sin[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32) | ||
| rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.gather(cs_cos, dim=-1, index=cs_dup_idx) | ||
| rope_sin_signed[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.mul( | ||
| pl.gather(cs_sin, dim=-1, index=cs_dup_idx), cs_sign) |
There was a problem hiding this comment.
Keep the HCA sparse path on the full interleaved RoPE layout.
The kernel and golden reference still consume a contiguous half-table, and the fixture builds [c0, c1, ..., c0, c1, ...]. That masks the mismatch with the full interleaved contract [c0, c0, c1, c1, ...] and can rotate production inputs with the wrong frequencies.
Proposed layout fix
- cs_dup_idx = pl.cast(cs_dup_f, target_type=pl.INT32) # j>>1
cs_lane = pl.sub(cs_col, pl.mul(cs_dup_f, 2.0)) # j%2
cs_sign = pl.neg(pl.sub(pl.mul(cs_lane, 2.0), 1.0)) # [+1,-1,...] (conjugate)
- cs_cos = pl.cast(freqs_cos[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32)
- cs_sin = pl.cast(freqs_sin[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32)
- rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.gather(cs_cos, dim=-1, index=cs_dup_idx)
+ cs_cos = pl.cast(freqs_cos[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE], target_type=pl.FP32)
+ cs_sin = pl.cast(freqs_sin[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE], target_type=pl.FP32)
+ rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = cs_cos
rope_sin_signed[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.mul(
- pl.gather(cs_sin, dim=-1, index=cs_dup_idx), cs_sign)
+ cs_sin, cs_sign)- cos_half = cos[:, :HALF_ROPE].unsqueeze(1)
- sin_half = sin[:, :HALF_ROPE].unsqueeze(1)
+ cos_half = cos[:, 0::2].unsqueeze(1)
+ sin_half = sin[:, 0::2].unsqueeze(1)- return torch.cat([cos_half, cos_half], dim=-1)
+ return torch.repeat_interleave(cos_half, repeats=2, dim=-1)- return torch.cat([sin_half, sin_half], dim=-1)
+ return torch.repeat_interleave(sin_half, repeats=2, dim=-1)Also applies to: 594-598, 701-711
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@models/deepseek/v4/decode_sparse_attn_hca.py` around lines 314 - 318, The
RoPE indexing in the gather operations for cs_cos and cs_sin is not maintaining
the full interleaved layout contract which should be [c0, c0, c1, c1, ...] but
instead produces [c0, c1, ..., c0, c1, ...]. Fix the cs_dup_idx construction or
adjust how the gather operations use it to ensure the output follows the proper
full interleaved RoPE layout with consecutive pairs of identical indices. Apply
the same fix to the similar gather operations also appearing in the code (the
additional locations mentioned in the comment).
| cs_cos = pl.cast(freqs_cos[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32) | ||
| cs_sin = pl.cast(freqs_sin[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32) | ||
| rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.gather(cs_cos, dim=-1, index=cs_dup_idx) | ||
| rope_sin_signed[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.mul( | ||
| pl.gather(cs_sin, dim=-1, index=cs_dup_idx), cs_sign) |
There was a problem hiding this comment.
Consume the full interleaved RoPE lanes instead of re-expanding the first half.
freqs_cos/freqs_sin are now full interleaved tables, but these paths still slice the contiguous first half and duplicate it. For an interleaved row like [c0, c0, c1, c1, ...], that shifts the frequency mapping after the first pair. Slice the full interleaved tile in the kernel and use even lanes in the golden reference.
Proposed layout fix
- cs_dup_idx = pl.cast(cs_dup_f, target_type=pl.INT32) # j>>1
cs_lane = pl.sub(cs_col, pl.mul(cs_dup_f, 2.0)) # j%2
cs_sign = pl.neg(pl.sub(pl.mul(cs_lane, 2.0), 1.0)) # [+1,-1,...] (conjugate)
- cs_cos = pl.cast(freqs_cos[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32)
- cs_sin = pl.cast(freqs_sin[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32)
- rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.gather(cs_cos, dim=-1, index=cs_dup_idx)
+ cs_cos = pl.cast(freqs_cos[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE], target_type=pl.FP32)
+ cs_sin = pl.cast(freqs_sin[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE], target_type=pl.FP32)
+ rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = cs_cos
rope_sin_signed[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.mul(
- pl.gather(cs_sin, dim=-1, index=cs_dup_idx), cs_sign)
+ cs_sin, cs_sign)- cos_half = cos[:, :HALF_ROPE].unsqueeze(1)
- sin_half = sin[:, :HALF_ROPE].unsqueeze(1)
+ cos_half = cos[:, 0::2].unsqueeze(1)
+ sin_half = sin[:, 0::2].unsqueeze(1)Also applies to: 626-630
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@models/deepseek/v4/decode_sparse_attn.py` around lines 346 - 350, The issue
is that the slicing of freqs_cos and freqs_sin in the code around lines 346-350
(variables cs_cos and cs_sin) is only extracting the contiguous first half and
then duplicating it, but these frequency tables are now full interleaved tables.
Instead of slicing just cp_r0 : cp_r0 + ROPE_TILE and duplicating through
cs_dup_idx, you need to slice the full interleaved tile from the frequency
tables and use even lanes to access the correct interleaved positions. Apply the
same fix to the duplicate code section mentioned at lines 626-630.
#578) ## Summary - Retile the DeepSeek-V4 `qkv_proj_rope` projection matmuls to the 512B L2 cache line and fuse RMSNorm with RoPE. **Decode end-to-end −56%** (a2a3 L2 swimlane, 5-rep median: 936µs → 407µs); golden green on decode and prefill. - `qr_proj` / `kv_proj`: split-K (zero-seed + atomic-add) with N-tile 32 → 256, so each `wq_a`/`wkv` row-read fills a full 512B cache line instead of a 64B sub-line (was 8× weight over-fetch). Kernel occupancy −84% / −75%. - `qproj_matmul`: decouple the matmul N-tile from the dequant N-tile and bump matmul `TN` 128 → 256 (256B/row), capped by the L0C `Acc` limit (`TM*TN*4 ≤ 128KB`). `TN=512` needs an M-split (`TM=64`) and measured no faster end-to-end on device. - Fuse per-head RMSNorm + NOPE + RoPE into `q_head_rms_nope_rope`, and KV RMSNorm + RoPE into `kv_rms_norm_rope`: `inv_rms` stays in registers (no GM round-trip via the old `q_head_inv_rms_all` / `kv_inv_rms_tensor`), collapsing each pair of dispatches into one. RoPE keeps the interleaved (CANN A3) swap-gather layout. ## Related Issues - The RMSNorm+RoPE fusion re-introduces fused rope on top of the **interleaved** layout restored by #575 (the revert of #570); it does not bring back the split-half layout. The matmul retiling is independent of the rope layout.
Reverts #570