Skip to content

feat(deepseek/v4): gather-free split-half RoPE for decode + prefill#570

Merged
zhangqi-chen merged 3 commits into
hw-native-sys:mainfrom
lwDavid:research/pr564-splithalf-rope
Jun 22, 2026
Merged

feat(deepseek/v4): gather-free split-half RoPE for decode + prefill#570
zhangqi-chen merged 3 commits into
hw-native-sys:mainfrom
lwDavid:research/pr564-splithalf-rope

Conversation

@lwDavid

@lwDavid lwDavid commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Converts the DeepSeek-V4 RoPE path from the interleaved (GPT-J, gather-based) layout to split-half (GPT-NeoX, gather-free) for both decode and prefill, so the whole chain is layout-consistent. The rotation partner of lane k becomes lane k+HALF — a contiguous lo=[:HALF]/hi=[HALF:] slice instead of a j^1 swap-gather + j>>1 cos/sin dup-gather. Every RoPE rotation becomes contiguous slices + plain FMAs, with no cross-lane op and no in-kernel rope_cs dup-gather pre-pass.

  • Decode (commit 1): adopts the split-half conversion across qkv_proj_rope (forward, shared), decode_compressor_ratio{4,128} (forward), and decode_sparse_attn{,_hca,_swa} (inverse), with the callers feeding half-width FP32 rope_cos_half/rope_sin_half. Two precision details vs current main: the compressor gamma_rope is kept cast to FP32 (since norm_w is BF16), and the inverse-rope stage (already FP32) is read directly — no FP32→FP32 identity cast.
  • Prefill (commit 2): mirrors the same conversion onto prefill_compressor_ratio{4,128} (forward) and prefill_sparse_attn (inverse), with prefill_attention_{csa,hca,swa} building and passing the half-width tables. This removes the latent "half-converted" hazard from converting the shared qkv_proj_rope without converting prefill's own compressors/inverse. Rebased on top of Fix dsv4 prefill_sparse_attn per-head NOPE corruption; align to decode #569 — its per-head NOPE fix and the prefill_sparse_attn_padded_indices removal are preserved.

Forward: out_lo = x_lo*cos − x_hi*sin, out_hi = x_lo*sin + x_hi*cos. Inverse (conjugate): out_lo = x_lo*cos + x_hi*sin, out_hi = x_hi*cos − x_lo*sin.

The lightning indexer ({decode,prefill}_indexer{,_compressor}) is intentionally left interleaved: it is a self-contained RoPE subsystem (own query/KV rope from the freqs tables) that feeds sparse attention only integer top-k indices, so it is decoupled from the main path.

Why

On-device profiling showed the per-element gather (j^1 swap + j>>1 dup) — not the arithmetic — is the dominant RoPE cost. The earlier interleaved L2 swimlane on the HCA attention module measured rope compute 2970 → 877 µs (−70.5%) and module wall-clock −9.6% from this change.

Validation (a2a3sim, golden, all PASS)

decode_compressor_ratio4/128, decode_sparse_attn{,_hca,_swa}, qkv_proj_rope, decode_attention_{csa,hca,swa}, decode_layer; prefill_compressor_ratio4/128, prefill_sparse_attn, prefill_attention_{csa,hca,swa}, prefill_layer. The *_sparse_attn_swa standalone test is occasionally flaky (~4/6) but identically so on main (unseeded fixtures from #563 against a tight tolerance) — not introduced by this change.

Caveat

Real checkpoints need an offline interleaved→split-half permutation of the trained q-proj/k_pe/wo_a rope columns to stay bit-identical to the trained model. Synthetic tests need none. Not yet validated on-device or end-to-end by the serving system.

Related Issues

lwDavid added 2 commits June 22, 2026 15:40
… main

Cherry-pick of c5aa9c5 (feat: gather-free split-half RoPE for the decode
path) resolved onto main (ea299a1). 5 of 10 files auto-merged; 5 conflicted
(both sides fully rewrote the same RoPE block).

Resolution:
- All 5 conflicts: adopt the PR split-half (NeoX) side, discard main's
  interleaved (GPT-J) side.
- decode_compressor_ratio{4,128}: restore `gamma_rope = pl.cast(..., pl.FP32)`
  on the split-half branch. hw-native-sys#568 flipped norm_w FP32->BF16 and added the cast
  at every apply site; the PR (branched pre-hw-native-sys#568) dropped it. Without the cast
  the per-column gamma fold would run in BF16, asymmetric with the NOPE branch
  and the float golden rmsnorm.
- decode_sparse_attn{,_hca,_swa}: drop the PR's `pl.cast(r_tile, FP32)`. The PR
  added it when attn_rope_stage was BF16; hw-native-sys#568 made the stage FP32, so the cast
  is an FP32->FP32 identity that the pypto op registry rejects (trace-time
  failure). Read the already-FP32 stage directly, matching main's interleaved
  version.

Validated on a2a3sim (golden): decode_compressor_ratio4/128 PASS, qkv_proj_rope
PASS, decode_sparse_attn PASS, decode_sparse_attn_hca PASS. decode_sparse_attn_swa
is flaky (4/6) but identically so on clean main (FAIL PASS FAIL PASS PASS PASS on
both) -- pre-existing unseeded-input tolerance flakiness from hw-native-sys#563, not a merge
regression.

Decode-only, inheriting the PR's caveat: qkv_proj_rope is shared with prefill,
whose sparse_attn/compressors stay interleaved, so prefill is latently
half-converted. Landing requires the prefill split-half follow-up + offline
weight permutation for real checkpoints.
Rebased onto upstream hw-native-sys#569 (which rewrote prefill_sparse_attn + fixed the
per-head NOPE corruption and removed prefill_sparse_attn_padded_indices). This
re-applies the prefill split-half conversion on top of hw-native-sys#569 so the whole prefill
chain is layout-consistent with the now-split-half shared qkv_proj_rope forward.

Converted (kernel + golden + standalone fixtures), mirroring the validated decode
analogs on this branch:
- prefill_compressor_ratio4 / ratio128: forward rope P0101/P1010 even/odd
  gather+scatter -> contiguous lo/hi slices; gamma folded per-half in FP32.
- prefill_sparse_attn: inverse rope -- removed the rope_cs dup-gather pre-pass and
  the per-head j^1 swap-gather; now a gather-free contiguous lo/hi conjugate rotate
  (out_lo=x_lo*cos+x_hi*sin, out_hi=x_hi*cos-x_lo*sin) reading half-width FP32
  rope_cos_half/rope_sin_half (both prefill_sparse_attn and prefill_sparse_attn_test
  signatures + the in-file test call); golden + build_tensor_specs fixture updated.
- prefill_attention_{csa,hca,swa}: build half-width FP32 rope_cos_half/sin_half and
  pass them to the (now directly-called) prefill_sparse_attn; golden dict keys
  renamed. qkv forward still gets the full BF16 tables and slices [:HALF] internally.

Indexer (prefill_indexer / prefill_indexer_compressor) intentionally left
interleaved -- self-contained, feeds sparse_attn only integer indices.

Validated on a2a3sim (golden), all PASS: prefill_compressor_ratio4/128,
prefill_sparse_attn, prefill_attention_csa/hca/swa, prefill_layer.

Real checkpoints still need the offline interleaved->split-half permutation of the
trained q-proj/k_pe/wo_a rope columns (unchanged PR caveat).
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 88106ca8-137e-4402-9ee8-fdc2737afd0d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR migrates all DeepSeek-V4 decode and prefill RoPE rotation code from an interleaved swap/gather scheme to a gather-free split-half (NeoX) scheme. Sparse attention kernels receive new half-width FP32 rope_cos_half/rope_sin_half tables instead of full-width BF16 freqs_cos/freqs_sin. Compressor and QKV projection kernels replace even/odd lane gather rotation with contiguous lo/hi half rotation. All callers, golden references, and test harnesses are updated throughout.

Changes

Split-half NeoX RoPE migration across DeepSeek-V4

Layer / File(s) Summary
Sparse attention kernel signature + inverse RoPE rewrite
models/deepseek/v4/decode_sparse_attn.py, models/deepseek/v4/decode_sparse_attn_hca.py, models/deepseek/v4/decode_sparse_attn_swa.py, models/deepseek/v4/prefill_sparse_attn.py
sparse_attn, sparse_attn_hca, sparse_attn_swa, and prefill_sparse_attn replace freqs_cos/freqs_sin (BF16, ROPE_DIM) parameters with rope_cos_half/rope_sin_half (FP32, HALF_ROPE). The inverse RoPE kernel body in each file is rewritten from precomputed interleaved cosine/signed-sine + per-head j^1 gather to gather-free split-half lo/hi rotation writing directly into o_packed rope columns. Test harnesses, golden references, and TensorSpec initializers are updated to generate and use the BF16-rounded FP32 half-width tables.
Attention orchestrators: allocate and wire half-width RoPE tables
models/deepseek/v4/decode_attention_csa.py, models/deepseek/v4/decode_attention_hca.py, models/deepseek/v4/decode_attention_swa.py, models/deepseek/v4/prefill_attention_csa.py, models/deepseek/v4/prefill_attention_hca.py, models/deepseek/v4/prefill_attention_swa.py
All six attention orchestrators allocate FP32 rope_cos_half_t/rope_sin_half_t tensors (first HALF_ROPE columns of per-token RoPE snapshots) and pass them to the sparse attention calls instead of the prior full-width tables. Golden references in each file are updated in parallel.
Compressor forward RoPE rewrite
models/deepseek/v4/decode_compressor_ratio4.py, models/deepseek/v4/decode_compressor_ratio128.py, models/deepseek/v4/prefill_compressor_ratio4.py, models/deepseek/v4/prefill_compressor_ratio128.py
All four compressor kernels replace gather-based interleaved even/odd forward RoPE with split-half NeoX rotation: rope segment is split into lo/hi, gamma is folded per half, and results are computed and written back as two contiguous halves of normed_kv. Golden references use the same x_lo/x_hi concat pattern.
QKV projection split-half RoPE rewrite
models/deepseek/v4/qkv_proj_rope.py
Both Q and KV fused RoPE paths in qkv_proj_rope remove in-kernel swap/sign index construction and gather rotation, and instead fold inv_rms/gamma into contiguous ROPE_HALF lo/hi slices before applying the standard split-half formulas. The apply_rope golden reference is updated to the same lo/hi concat scheme.
Documentation
models/deepseek/v4/decode_layer.py
Three comments added to build_tensor_specs clarifying that split-half NeoX cos/sin tables are sourced from the first half-columns of freqs_cos/freqs_sin with no separate interleaved table.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#468: Updates qkv_proj_rope.py to use split-half contiguous lo/hi RoPE rotation, which is the same QKV RoPE layout change applied in this PR.
  • hw-native-sys/pypto-lib#568: Adjusts inverse-RoPE numerics and rounding (attn_rope_stage FP32, mode="rint") in the same decode_sparse_attn*.py kernels modified here.
  • hw-native-sys/pypto-lib#533: Modifies the decode RoPE and compressor paths in the same decode_compressor_ratio4.py and decode_compressor_ratio128.py files changed here.

Suggested labels

enhancement

🐇 The old gather indices are gone—hooray!
Two clean halves now rotate and play,
lo times cos, hi times sin,
NeoX style, no gather needed within.
Split-half tables, FP32 and bright—
The bunny hops left and right! 🌀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.74% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main technical change: converting DeepSeek-V4 RoPE from gather-based to gather-free split-half across both decode and prefill paths.
Description check ✅ Passed The description provides comprehensive context on the RoPE conversion, performance improvements, validation approach, and implementation details across decode and prefill paths.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lwDavid lwDavid self-assigned this Jun 22, 2026
@lwDavid lwDavid added the enhancement New feature or request label Jun 22, 2026
@lwDavid lwDavid moved this to In Progress in pto project Jun 22, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Rotary Position Embedding (RoPE) implementation across DeepSeek v4 attention and compressor modules to use a split-half (NeoX) layout instead of an interleaved layout. This simplifies the kernels by removing in-kernel index building and gather operations, using half-width unsigned cosine and sine tables instead. The code review feedback identifies several opportunities to optimize memory access in decode_attention_csa.py, decode_attention_hca.py, and prefill_attention_csa.py by slicing already-populated local tensors directly rather than performing redundant global memory lookups.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread models/deepseek/v4/decode_attention_csa.py
Comment thread models/deepseek/v4/decode_attention_hca.py
Comment thread models/deepseek/v4/prefill_attention_csa.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@models/deepseek/v4/prefill_attention_csa.py`:
- Around line 185-198: The rope_cos_half_t and rope_sin_half_t tensors are
created but only filled for rows where half_t < num_tokens, leaving padding rows
(half_t >= num_tokens) uninitialized which causes divergence in subsequent RoPE
operations. Initialize both rope_cos_half_t and rope_sin_half_t with finite
identity defaults (zeros) immediately after tensor creation and before the loop
starting with "for half_t in pl.range(T)" to ensure all T rows have valid values
before the RoPE multiply operations use them.

In `@models/deepseek/v4/prefill_attention_hca.py`:
- Around line 153-161: The rope_cos_half_t and rope_sin_half_t tensors are
created but only populated for rows where half_t < num_tokens, leaving rows >=
num_tokens uninitialized. Since the sparse-attn inverse-RoPE pass reads all T
rows, the uninitialized rows will cause issues. Add initialization code within
the pl.at context block to set rope_cos_half_t and rope_sin_half_t to identity
values (cos values of 1.0 and sin values of 0.0) for the entire T rows before
the loop that conditionally overwrites only the active token rows.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 79e262c1-51f8-42c9-8d98-ab64261de144

📥 Commits

Reviewing files that changed from the base of the PR and between 86c3b04 and 13f8853.

📒 Files selected for processing (16)
  • models/deepseek/v4/decode_attention_csa.py
  • models/deepseek/v4/decode_attention_hca.py
  • models/deepseek/v4/decode_attention_swa.py
  • models/deepseek/v4/decode_compressor_ratio128.py
  • models/deepseek/v4/decode_compressor_ratio4.py
  • models/deepseek/v4/decode_layer.py
  • models/deepseek/v4/decode_sparse_attn.py
  • models/deepseek/v4/decode_sparse_attn_hca.py
  • models/deepseek/v4/decode_sparse_attn_swa.py
  • models/deepseek/v4/prefill_attention_csa.py
  • models/deepseek/v4/prefill_attention_hca.py
  • models/deepseek/v4/prefill_attention_swa.py
  • models/deepseek/v4/prefill_compressor_ratio128.py
  • models/deepseek/v4/prefill_compressor_ratio4.py
  • models/deepseek/v4/prefill_sparse_attn.py
  • models/deepseek/v4/qkv_proj_rope.py

Comment thread models/deepseek/v4/prefill_attention_csa.py Outdated
Comment thread models/deepseek/v4/prefill_attention_hca.py
@lwDavid

lwDavid commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Performance (a2a3, real device, L2 swimlane)

Measured decode_attention_hca on a2a3 (real NPU), baseline = main (interleaved GPT-J) vs this branch (split-half GPT-NeoX), summing per-kernel AICore busy time from the L2 swimlane (--enable-l2-swimlane), one clean run per branch.

RoPE perf comparison

metric baseline (interleaved) optimized (split-half) Δ
RoPE compute (Σ AICore busy) 3123.0 us 920.6 us −70.5%
HCA attention module wall-clock 1508.6 us 1450.5 us −3.8%
whole-module AICore busy 55418.9 us 53046.8 us −4.3%

Per-kernel:

kernel baseline optimized Δ
q_head_rope_fused (q fwd) 1443.8 439.2 −69.6%
rope (inverse) 1419.0 288.9 −79.6%
kv_rope_fused (kv fwd) 46.1 13.5 −70.7%
rmsnorm_rope 69.6 52.6 −24.4%
rope_cs (dup-gather pre-pass) 53.3 0 eliminated
hca_rope 91.3 126.5 +38% (tiny, ×1)

The RoPE kernels drop ~70% (gather → contiguous lo/hi slice + plain FMAs; every VGATHER and the whole rope_cs dup-gather pre-pass removed), independently reproducing the original −70.5% figure. End-to-end this module is ~−4% wall-clock because RoPE is only ~5.6% of its AICore busy time and the chip runs ~36 cores in parallel — the win grows on RoPE-heavier / more dispatch-bound configs.

Notes: the device shows an occasional transient 507018 fault (retried for a clean run); x_out validates bit-exact on device. The same gather-elimination applies to the prefill kernels converted here.



- prefill_attention_{csa,hca,swa}: identity-init rope_cos_half/rope_sin_half
  (cos=1, sin=0) over all T rows so padding rows (>= num_tokens) stay finite --
  prefill_sparse_attn rotates all T rows (CodeRabbit review). Active rows are
  overwritten below as before; no effect when num_tokens == T.
- prefill_attention_csa: fill the active rows by slicing the already-materialized
  rope_cos_t/rope_sin_t instead of re-reading freqs_cos from GM, consistent with
  hca/swa (gemini review).

The gemini suggestions to slice the local cos_row/step_cos_row in the DECODE
callers were NOT applied: in decode_attention_csa it trips a PTO2 runtime
assertion (index < output_count_); kept the original freqs_cos slice.

Validated a2a3sim: prefill_attention_csa/hca PASS, prefill_layer PASS. swa
standalone stays pre-existing flaky (unseeded hw-native-sys#563 fixtures), unaffected.
@lwDavid

lwDavid commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Review comments addressed (022b525)

All 5 review threads resolved:

  • CodeRabbit (Major) — prefill_attention_csa / _hca padding rows: Fixed. rope_cos_half/rope_sin_half are now identity-initialized (cos=1, sin=0) over all T rows before the active-row overwrite, so padding rows (>= num_tokens) stay finite for the all-T-row inverse RoPE in prefill_sparse_attn. Applied to csa/hca/swa.
  • gemini — prefill_attention_csa redundant GM reads: Applied. csa now fills active rows by slicing the already-materialized rope_cos_t/rope_sin_t (consistent with hca/swa) instead of re-reading freqs_cos.
  • gemini — decode_attention_csa / _hca: Not applied. Slicing the local cos_row inside pl.assemble trips a PTO2 runtime assertion (index < output_count_) on a2a3sim; kept the original freqs_cos slice (one-time per-token setup-loop read, negligible).

Validated on a2a3sim: prefill_attention_csa/_hca PASS, prefill_layer PASS.

CI failure triage (none are code defects in this PR)

  • a2a3 (device): halMemCtl failed rc=13 / run_prepared code 13 on a few kernels — an intermittent CI-device register-access fault (csa/swa/decode_layer passed on the same run).
  • sim (a2a3sim / a5sim): decode_layerNo space left on device (CI-runner shm/disk); decode_sparse_attn{,_hca} attn_out slightly over tolerance (0.99–8.8%, a different one each run, swa passed) — the pre-existing unseeded-fixture flakiness from Refactor: drop decode fixture seeds + EP selector in decode_layer #563 (these standalone tests draw fresh torch.rand inputs against a tight max_error_ratio=0.005 and flake identically on main).

Every prefill kernel this PR adds passes on each run. A CI re-run is in progress for 022b525.

@lwDavid lwDavid moved this from In Progress to Done in pto project Jun 22, 2026
@zhangqi-chen zhangqi-chen merged commit cdb64e0 into hw-native-sys:main Jun 22, 2026
5 of 7 checks passed
@lwDavid lwDavid requested a review from zhangqi-chen June 22, 2026 09:50
zhangqi-chen pushed a commit that referenced this pull request Jun 22, 2026
zhangqi-chen pushed a commit that referenced this pull request Jun 23, 2026
#578)

## Summary
- Retile the DeepSeek-V4 `qkv_proj_rope` projection matmuls to the 512B
L2 cache line and fuse RMSNorm with RoPE. **Decode end-to-end −56%**
(a2a3 L2 swimlane, 5-rep median: 936µs → 407µs); golden green on decode
and prefill.
- `qr_proj` / `kv_proj`: split-K (zero-seed + atomic-add) with N-tile 32
→ 256, so each `wq_a`/`wkv` row-read fills a full 512B cache line
instead of a 64B sub-line (was 8× weight over-fetch). Kernel occupancy
−84% / −75%.
- `qproj_matmul`: decouple the matmul N-tile from the dequant N-tile and
bump matmul `TN` 128 → 256 (256B/row), capped by the L0C `Acc` limit
(`TM*TN*4 ≤ 128KB`). `TN=512` needs an M-split (`TM=64`) and measured no
faster end-to-end on device.
- Fuse per-head RMSNorm + NOPE + RoPE into `q_head_rms_nope_rope`, and
KV RMSNorm + RoPE into `kv_rms_norm_rope`: `inv_rms` stays in registers
(no GM round-trip via the old `q_head_inv_rms_all` /
`kv_inv_rms_tensor`), collapsing each pair of dispatches into one. RoPE
keeps the interleaved (CANN A3) swap-gather layout.

## Related Issues
- The RMSNorm+RoPE fusion re-introduces fused rope on top of the
**interleaved** layout restored by #575 (the revert of #570); it does
not bring back the split-half layout. The matmul retiling is independent
of the rope layout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants