Skip to content

Revert "feat(deepseek/v4): gather-free split-half RoPE for decode + prefill"#575

Merged
zhangqi-chen merged 1 commit into
hw-native-sys:mainfrom
lwDavid:revert-570-research/pr564-splithalf-rope
Jun 22, 2026
Merged

Revert "feat(deepseek/v4): gather-free split-half RoPE for decode + prefill"#575
zhangqi-chen merged 1 commit into
hw-native-sys:mainfrom
lwDavid:revert-570-research/pr564-splithalf-rope

Conversation

@lwDavid

@lwDavid lwDavid commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Reverts #570

@lwDavid lwDavid requested a review from zhangqi-chen June 22, 2026 12:10
@lwDavid lwDavid self-assigned this Jun 22, 2026
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Across the DeepSeek-V4 decode and prefill attention stack, split-half NeoX RoPE frequency tables (rope_cos_half/rope_sin_half, FP32, shape [T, HALF_ROPE]) are replaced with full interleaved frequency tables (freqs_cos/freqs_sin, BF16, shape [T, ROPE_DIM]). The rotation math in sparse attention kernels, compressor kernels, and QKV projection is rewritten from contiguous low/high half-vector operations to an A3-style swap-gather (j^1 index + sign mask). Attention harnesses remove half-table construction and wire full tables through. Golden references and test specs are updated throughout.

Changes

DeepSeek-V4 RoPE: split-half → interleaved frequency tables

Layer / File(s) Summary
Sparse attention kernel signatures + interleaved inverse-RoPE
models/deepseek/v4/decode_sparse_attn.py, models/deepseek/v4/decode_sparse_attn_hca.py, models/deepseek/v4/decode_sparse_attn_swa.py, models/deepseek/v4/prefill_sparse_attn.py
Adds ROPE_TILE/ROPE_INTERLEAVE_TILE constants and get_standalone_cmp_valid helper. Changes sparse_attn, sparse_attn_hca, sparse_attn_swa, and prefill_sparse_attn signatures from rope_cos_half/rope_sin_half (FP32, half-width) to freqs_cos/freqs_sin (BF16, full-width). Rewrites in-kernel inverse-RoPE to precompute head-invariant interleaved cos/signed-sin buffers and apply j^1 swap-gather rotation into o_packed. Updates test wrappers, golden references, and build_tensor_specs to use rope_tables helpers.
Compressor NeoX→interleaved RoPE rewrite
models/deepseek/v4/decode_compressor_ratio4.py, models/deepseek/v4/decode_compressor_ratio128.py, models/deepseek/v4/prefill_compressor_ratio4.py, models/deepseek/v4/prefill_compressor_ratio128.py
Rewrites the NOPE_HEAD_DIM:HEAD_DIM RoPE rotation in all four compressor kernels: replaces gamma_lo/gamma_hi split-half NeoX rotation with even/odd lane gather (P0101/P1010), j^1 swap-gather, sign-adjusted rotation, and scatter back into rope_buf. Golden references change from torch.cat(rot_lo, rot_hi) to unflatten-into-pairs + stack + flatten.
QKV projection Q and KV interleaved RoPE rewrite
models/deepseek/v4/qkv_proj_rope.py
Replaces split-half Q RoPE (q_lo/q_hi, two-region writeback) and split-half KV RoPE (gamma_lo/gamma_hi) with a single-pass interleaved swap-gather rotation for both paths. Builds per-lane j^1 and sign mask in-kernel, gathers duplicated cos/sin per lane, folds gamma before swapping, and writes the full ROPE_DIM slice. Updates golden apply_rope from cat-after-rotation to unflatten-stack pair rotation.
Attention harness half-table removal and call-site wiring
models/deepseek/v4/decode_attention_csa.py, models/deepseek/v4/decode_attention_hca.py, models/deepseek/v4/decode_attention_swa.py, models/deepseek/v4/prefill_attention_csa.py, models/deepseek/v4/prefill_attention_hca.py, models/deepseek/v4/prefill_attention_swa.py, models/deepseek/v4/decode_layer.py
Removes allocation and per-token assembly of rope_cos_half_t/rope_sin_half_t in all six decode/prefill CSA, HCA, and SWA attention harnesses. Sparse attention call sites updated to pass full rope_cos_t/rope_sin_t tables directly as freqs_cos/freqs_sin. Golden references remove FP32 half-table construction and update golden_sparse_attn invocations accordingly. Removes the NeoX split-half documentation comment from decode_layer.py.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#480: Updates the same decode compressor ratio4/ratio128 paths to use A3 interleaved swap-gather RoPE, directly overlapping with the compressor kernel rewrites in this PR.
  • hw-native-sys/pypto-lib#525: Modifies decode_sparse_attn.py's inverse-RoPE pipeline to use freqs_cos/freqs_sin with interleaved precompute (rope_cos_il/signed sine), which is the same kernel-level change this PR finalizes.
  • hw-native-sys/pypto-lib#538: Switches build_tensor_specs initializers across the same test harnesses to use rope_tables.build_deepseek_v4_rope_tables for freqs_cos/freqs_sin, directly matching this PR's test-spec wiring changes.

Poem

🐇 Hop hop, the half-tables are gone today,
Full freqs_cos and freqs_sin lead the way!
No more split lo and hi in a NeoX pair,
j^1 swap-gather rotates through the air.
The bunny approves—interleaved is chic,
BF16 all the way, what a sleek technique!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly identifies that this is a revert of a previous gather-free split-half RoPE feature for DeepSeek v4, matching the actual changeset that removes this optimization and restores the prior RoPE table approach.
Description check ✅ Passed The PR description correctly identifies that this reverts PR #570, which is directly related to the changeset that updates RoPE handling across multiple DeepSeek v4 files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lwDavid lwDavid added the enhancement New feature or request label Jun 22, 2026
@lwDavid lwDavid moved this to Done in pto project Jun 22, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Rotary Position Embedding (RoPE) implementation across multiple prefill and decode attention/compressor modules, transitioning from a split-half (NeoX) layout to an interleaved layout (CANN A3 rotate_interleaved). This change eliminates the need for half-width RoPE tables and simplifies the kernel signatures by directly utilizing the full-width freqs_cos and freqs_sin tensors. However, in prefill_sparse_attn.py, the undefined constant ROPE_HALF is used instead of HALF_ROPE in several places, which will cause runtime NameErrors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread models/deepseek/v4/prefill_sparse_attn.py
Comment thread models/deepseek/v4/prefill_sparse_attn.py
Comment thread models/deepseek/v4/prefill_sparse_attn.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
models/deepseek/v4/decode_attention_hca.py (1)

152-155: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Derive HCA compressor RoPE halves from even lanes.

cmp_cos/cmp_sin are half-width compressor inputs. With full interleaved freqs_*, taking 0 : ROPE_HEAD_DIM // 2 feeds duplicated low-pair frequencies; gather even lanes from the full row instead, and update the golden reference to use 0::2.

Golden-side fix
-        cmp_cos[b] = freqs_cos[cmp_pos_b, :half_rd].float()
-        cmp_sin[b] = freqs_sin[cmp_pos_b, :half_rd].float()
+        cmp_cos[b] = freqs_cos[cmp_pos_b, 0::2].float()
+        cmp_sin[b] = freqs_sin[cmp_pos_b, 0::2].float()

Also applies to: 423-424

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/decode_attention_hca.py` around lines 152 - 155, The
slicing in the HCA compressor RoPE assignments is using 0 : ROPE_HEAD_DIM // 2
which extracts the first half of duplicated low-pair frequencies from the full
interleaved freqs_cos and freqs_sin arrays. Replace the slice syntax 0 :
ROPE_HEAD_DIM // 2 with 0::2 in the extraction of cmp_cos_row and cmp_sin_row to
gather only the even lanes from the full row instead. Apply the same fix to the
corresponding assignments in both the current location and the other occurrence
mentioned at lines 423-424 to ensure consistent even-lane extraction across all
HCA compressor RoPE operations.
models/deepseek/v4/decode_attention_csa.py (1)

202-214: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Extract compressor/indexer RoPE halves from even interleaved lanes.

step_cos/step_sin and cmp_cos/cmp_sin are half-width inputs, but slicing freqs_*[:, :HALF_ROPE] from a full interleaved row yields [c0, c0, c1, c1, ...] instead of [c0, c1, ...]. Gather lanes 0, 2, 4, ... in the PyPTO path and mirror that with 0::2 in the golden path.

Golden-side fix to mirror the expected half layout
-    step_cos = freqs_cos[first_pos, :HALF_ROPE].float().contiguous()
-    step_sin = freqs_sin[first_pos, :HALF_ROPE].float().contiguous()
+    step_cos = freqs_cos[first_pos, 0::2].float().contiguous()
+    step_sin = freqs_sin[first_pos, 0::2].float().contiguous()
     cmp_pos = first_pos + (COMPRESS_RATIO - (first_pos % COMPRESS_RATIO)) - COMPRESS_RATIO
-    cmp_cos = freqs_cos[cmp_pos, :HALF_ROPE].float().contiguous()
-    cmp_sin = freqs_sin[cmp_pos, :HALF_ROPE].float().contiguous()
+    cmp_cos = freqs_cos[cmp_pos, 0::2].float().contiguous()
+    cmp_sin = freqs_sin[cmp_pos, 0::2].float().contiguous()

Also applies to: 521-525

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/decode_attention_csa.py` around lines 202 - 214, The
slicing of freqs_cos and freqs_sin in the step_cos/step_sin and cmp_cos/cmp_sin
tensor assembly blocks produces interleaved duplicated values [c0, c0, c1, c1,
...] instead of the required half-width format [c0, c1, ...]. Extract only the
even-indexed lanes (0, 2, 4, ...) from the sliced results for both the step rope
computation block (around the pl.slice calls for step_cos and step_sin) and the
compress rope computation block (around the pl.slice calls for cmp_cos and
cmp_sin). This ensures the tensors contain the correct unique RoPE values at the
required positions without duplication.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@models/deepseek/v4/decode_sparse_attn_hca.py`:
- Around line 314-318: The RoPE indexing in the gather operations for cs_cos and
cs_sin is not maintaining the full interleaved layout contract which should be
[c0, c0, c1, c1, ...] but instead produces [c0, c1, ..., c0, c1, ...]. Fix the
cs_dup_idx construction or adjust how the gather operations use it to ensure the
output follows the proper full interleaved RoPE layout with consecutive pairs of
identical indices. Apply the same fix to the similar gather operations also
appearing in the code (the additional locations mentioned in the comment).

In `@models/deepseek/v4/decode_sparse_attn.py`:
- Around line 346-350: The issue is that the slicing of freqs_cos and freqs_sin
in the code around lines 346-350 (variables cs_cos and cs_sin) is only
extracting the contiguous first half and then duplicating it, but these
frequency tables are now full interleaved tables. Instead of slicing just cp_r0
: cp_r0 + ROPE_TILE and duplicating through cs_dup_idx, you need to slice the
full interleaved tile from the frequency tables and use even lanes to access the
correct interleaved positions. Apply the same fix to the duplicate code section
mentioned at lines 626-630.

---

Outside diff comments:
In `@models/deepseek/v4/decode_attention_csa.py`:
- Around line 202-214: The slicing of freqs_cos and freqs_sin in the
step_cos/step_sin and cmp_cos/cmp_sin tensor assembly blocks produces
interleaved duplicated values [c0, c0, c1, c1, ...] instead of the required
half-width format [c0, c1, ...]. Extract only the even-indexed lanes (0, 2, 4,
...) from the sliced results for both the step rope computation block (around
the pl.slice calls for step_cos and step_sin) and the compress rope computation
block (around the pl.slice calls for cmp_cos and cmp_sin). This ensures the
tensors contain the correct unique RoPE values at the required positions without
duplication.

In `@models/deepseek/v4/decode_attention_hca.py`:
- Around line 152-155: The slicing in the HCA compressor RoPE assignments is
using 0 : ROPE_HEAD_DIM // 2 which extracts the first half of duplicated
low-pair frequencies from the full interleaved freqs_cos and freqs_sin arrays.
Replace the slice syntax 0 : ROPE_HEAD_DIM // 2 with 0::2 in the extraction of
cmp_cos_row and cmp_sin_row to gather only the even lanes from the full row
instead. Apply the same fix to the corresponding assignments in both the current
location and the other occurrence mentioned at lines 423-424 to ensure
consistent even-lane extraction across all HCA compressor RoPE operations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ce2a7cda-b605-4343-8380-ec0039b51b4a

📥 Commits

Reviewing files that changed from the base of the PR and between cdb64e0 and 4e76c17.

📒 Files selected for processing (16)
  • models/deepseek/v4/decode_attention_csa.py
  • models/deepseek/v4/decode_attention_hca.py
  • models/deepseek/v4/decode_attention_swa.py
  • models/deepseek/v4/decode_compressor_ratio128.py
  • models/deepseek/v4/decode_compressor_ratio4.py
  • models/deepseek/v4/decode_layer.py
  • models/deepseek/v4/decode_sparse_attn.py
  • models/deepseek/v4/decode_sparse_attn_hca.py
  • models/deepseek/v4/decode_sparse_attn_swa.py
  • models/deepseek/v4/prefill_attention_csa.py
  • models/deepseek/v4/prefill_attention_hca.py
  • models/deepseek/v4/prefill_attention_swa.py
  • models/deepseek/v4/prefill_compressor_ratio128.py
  • models/deepseek/v4/prefill_compressor_ratio4.py
  • models/deepseek/v4/prefill_sparse_attn.py
  • models/deepseek/v4/qkv_proj_rope.py
💤 Files with no reviewable changes (1)
  • models/deepseek/v4/decode_layer.py

Comment on lines +314 to +318
cs_cos = pl.cast(freqs_cos[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32)
cs_sin = pl.cast(freqs_sin[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32)
rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.gather(cs_cos, dim=-1, index=cs_dup_idx)
rope_sin_signed[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.mul(
pl.gather(cs_sin, dim=-1, index=cs_dup_idx), cs_sign)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the HCA sparse path on the full interleaved RoPE layout.

The kernel and golden reference still consume a contiguous half-table, and the fixture builds [c0, c1, ..., c0, c1, ...]. That masks the mismatch with the full interleaved contract [c0, c0, c1, c1, ...] and can rotate production inputs with the wrong frequencies.

Proposed layout fix
-        cs_dup_idx = pl.cast(cs_dup_f, target_type=pl.INT32)                                      # j>>1
         cs_lane = pl.sub(cs_col, pl.mul(cs_dup_f, 2.0))                                           # j%2
         cs_sign = pl.neg(pl.sub(pl.mul(cs_lane, 2.0), 1.0))                                       # [+1,-1,...] (conjugate)
-        cs_cos = pl.cast(freqs_cos[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32)
-        cs_sin = pl.cast(freqs_sin[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32)
-        rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.gather(cs_cos, dim=-1, index=cs_dup_idx)
+        cs_cos = pl.cast(freqs_cos[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE], target_type=pl.FP32)
+        cs_sin = pl.cast(freqs_sin[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE], target_type=pl.FP32)
+        rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = cs_cos
         rope_sin_signed[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.mul(
-            pl.gather(cs_sin, dim=-1, index=cs_dup_idx), cs_sign)
+            cs_sin, cs_sign)
-    cos_half = cos[:, :HALF_ROPE].unsqueeze(1)
-    sin_half = sin[:, :HALF_ROPE].unsqueeze(1)
+    cos_half = cos[:, 0::2].unsqueeze(1)
+    sin_half = sin[:, 0::2].unsqueeze(1)
-        return torch.cat([cos_half, cos_half], dim=-1)
+        return torch.repeat_interleave(cos_half, repeats=2, dim=-1)
-        return torch.cat([sin_half, sin_half], dim=-1)
+        return torch.repeat_interleave(sin_half, repeats=2, dim=-1)

Also applies to: 594-598, 701-711

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/decode_sparse_attn_hca.py` around lines 314 - 318, The
RoPE indexing in the gather operations for cs_cos and cs_sin is not maintaining
the full interleaved layout contract which should be [c0, c0, c1, c1, ...] but
instead produces [c0, c1, ..., c0, c1, ...]. Fix the cs_dup_idx construction or
adjust how the gather operations use it to ensure the output follows the proper
full interleaved RoPE layout with consecutive pairs of identical indices. Apply
the same fix to the similar gather operations also appearing in the code (the
additional locations mentioned in the comment).

Comment on lines +346 to +350
cs_cos = pl.cast(freqs_cos[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32)
cs_sin = pl.cast(freqs_sin[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32)
rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.gather(cs_cos, dim=-1, index=cs_dup_idx)
rope_sin_signed[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.mul(
pl.gather(cs_sin, dim=-1, index=cs_dup_idx), cs_sign)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Consume the full interleaved RoPE lanes instead of re-expanding the first half.

freqs_cos/freqs_sin are now full interleaved tables, but these paths still slice the contiguous first half and duplicate it. For an interleaved row like [c0, c0, c1, c1, ...], that shifts the frequency mapping after the first pair. Slice the full interleaved tile in the kernel and use even lanes in the golden reference.

Proposed layout fix
-        cs_dup_idx = pl.cast(cs_dup_f, target_type=pl.INT32)                                      # j>>1
         cs_lane = pl.sub(cs_col, pl.mul(cs_dup_f, 2.0))                                           # j%2
         cs_sign = pl.neg(pl.sub(pl.mul(cs_lane, 2.0), 1.0))                                       # [+1,-1,...] (conjugate)
-        cs_cos = pl.cast(freqs_cos[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32)
-        cs_sin = pl.cast(freqs_sin[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32)
-        rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.gather(cs_cos, dim=-1, index=cs_dup_idx)
+        cs_cos = pl.cast(freqs_cos[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE], target_type=pl.FP32)
+        cs_sin = pl.cast(freqs_sin[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE], target_type=pl.FP32)
+        rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = cs_cos
         rope_sin_signed[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.mul(
-            pl.gather(cs_sin, dim=-1, index=cs_dup_idx), cs_sign)
+            cs_sin, cs_sign)
-    cos_half = cos[:, :HALF_ROPE].unsqueeze(1)
-    sin_half = sin[:, :HALF_ROPE].unsqueeze(1)
+    cos_half = cos[:, 0::2].unsqueeze(1)
+    sin_half = sin[:, 0::2].unsqueeze(1)

Also applies to: 626-630

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/decode_sparse_attn.py` around lines 346 - 350, The issue
is that the slicing of freqs_cos and freqs_sin in the code around lines 346-350
(variables cs_cos and cs_sin) is only extracting the contiguous first half and
then duplicating it, but these frequency tables are now full interleaved tables.
Instead of slicing just cp_r0 : cp_r0 + ROPE_TILE and duplicating through
cs_dup_idx, you need to slice the full interleaved tile from the frequency
tables and use even lanes to access the correct interleaved positions. Apply the
same fix to the duplicate code section mentioned at lines 626-630.

@zhangqi-chen zhangqi-chen merged commit 5fdf7ec into hw-native-sys:main Jun 22, 2026
5 of 7 checks passed
zhangqi-chen pushed a commit that referenced this pull request Jun 23, 2026
#578)

## Summary
- Retile the DeepSeek-V4 `qkv_proj_rope` projection matmuls to the 512B
L2 cache line and fuse RMSNorm with RoPE. **Decode end-to-end −56%**
(a2a3 L2 swimlane, 5-rep median: 936µs → 407µs); golden green on decode
and prefill.
- `qr_proj` / `kv_proj`: split-K (zero-seed + atomic-add) with N-tile 32
→ 256, so each `wq_a`/`wkv` row-read fills a full 512B cache line
instead of a 64B sub-line (was 8× weight over-fetch). Kernel occupancy
−84% / −75%.
- `qproj_matmul`: decouple the matmul N-tile from the dequant N-tile and
bump matmul `TN` 128 → 256 (256B/row), capped by the L0C `Acc` limit
(`TM*TN*4 ≤ 128KB`). `TN=512` needs an M-split (`TM=64`) and measured no
faster end-to-end on device.
- Fuse per-head RMSNorm + NOPE + RoPE into `q_head_rms_nope_rope`, and
KV RMSNorm + RoPE into `kv_rms_norm_rope`: `inv_rms` stays in registers
(no GM round-trip via the old `q_head_inv_rms_all` /
`kv_inv_rms_tensor`), collapsing each pair of dispatches into one. RoPE
keeps the interleaved (CANN A3) swap-gather layout.

## Related Issues
- The RMSNorm+RoPE fusion re-introduces fused rope on top of the
**interleaved** layout restored by #575 (the revert of #570); it does
not bring back the split-half layout. The matmul retiling is independent
of the rope layout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants