Revert "feat(deepseek/v4): gather-free split-half RoPE for decode + prefill" by lwDavid · Pull Request #575 · hw-native-sys/pypto-lib

lwDavid · 2026-06-22T12:10:36Z

Reverts #570

…refill (…" This reverts commit cdb64e0.

coderabbitai · 2026-06-22T12:10:51Z

📝 Walkthrough

Walkthrough

Across the DeepSeek-V4 decode and prefill attention stack, split-half NeoX RoPE frequency tables (rope_cos_half/rope_sin_half, FP32, shape [T, HALF_ROPE]) are replaced with full interleaved frequency tables (freqs_cos/freqs_sin, BF16, shape [T, ROPE_DIM]). The rotation math in sparse attention kernels, compressor kernels, and QKV projection is rewritten from contiguous low/high half-vector operations to an A3-style swap-gather (j^1 index + sign mask). Attention harnesses remove half-table construction and wire full tables through. Golden references and test specs are updated throughout.

Changes

DeepSeek-V4 RoPE: split-half → interleaved frequency tables

Layer / File(s)	Summary
Sparse attention kernel signatures + interleaved inverse-RoPE `models/deepseek/v4/decode_sparse_attn.py`, `models/deepseek/v4/decode_sparse_attn_hca.py`, `models/deepseek/v4/decode_sparse_attn_swa.py`, `models/deepseek/v4/prefill_sparse_attn.py`	Adds `ROPE_TILE`/`ROPE_INTERLEAVE_TILE` constants and `get_standalone_cmp_valid` helper. Changes `sparse_attn`, `sparse_attn_hca`, `sparse_attn_swa`, and `prefill_sparse_attn` signatures from `rope_cos_half`/`rope_sin_half` (FP32, half-width) to `freqs_cos`/`freqs_sin` (BF16, full-width). Rewrites in-kernel inverse-RoPE to precompute head-invariant interleaved cos/signed-sin buffers and apply j^1 swap-gather rotation into `o_packed`. Updates test wrappers, golden references, and `build_tensor_specs` to use `rope_tables` helpers.
Compressor NeoX→interleaved RoPE rewrite `models/deepseek/v4/decode_compressor_ratio4.py`, `models/deepseek/v4/decode_compressor_ratio128.py`, `models/deepseek/v4/prefill_compressor_ratio4.py`, `models/deepseek/v4/prefill_compressor_ratio128.py`	Rewrites the `NOPE_HEAD_DIM:HEAD_DIM` RoPE rotation in all four compressor kernels: replaces `gamma_lo`/`gamma_hi` split-half NeoX rotation with even/odd lane gather (`P0101`/`P1010`), j^1 swap-gather, sign-adjusted rotation, and scatter back into `rope_buf`. Golden references change from `torch.cat(rot_lo, rot_hi)` to unflatten-into-pairs + stack + flatten.
QKV projection Q and KV interleaved RoPE rewrite `models/deepseek/v4/qkv_proj_rope.py`	Replaces split-half Q RoPE (`q_lo`/`q_hi`, two-region writeback) and split-half KV RoPE (`gamma_lo`/`gamma_hi`) with a single-pass interleaved swap-gather rotation for both paths. Builds per-lane j^1 and sign mask in-kernel, gathers duplicated cos/sin per lane, folds `gamma` before swapping, and writes the full `ROPE_DIM` slice. Updates golden `apply_rope` from `cat`-after-rotation to unflatten-stack pair rotation.
Attention harness half-table removal and call-site wiring `models/deepseek/v4/decode_attention_csa.py`, `models/deepseek/v4/decode_attention_hca.py`, `models/deepseek/v4/decode_attention_swa.py`, `models/deepseek/v4/prefill_attention_csa.py`, `models/deepseek/v4/prefill_attention_hca.py`, `models/deepseek/v4/prefill_attention_swa.py`, `models/deepseek/v4/decode_layer.py`	Removes allocation and per-token assembly of `rope_cos_half_t`/`rope_sin_half_t` in all six decode/prefill CSA, HCA, and SWA attention harnesses. Sparse attention call sites updated to pass full `rope_cos_t`/`rope_sin_t` tables directly as `freqs_cos`/`freqs_sin`. Golden references remove FP32 half-table construction and update `golden_sparse_attn` invocations accordingly. Removes the NeoX split-half documentation comment from `decode_layer.py`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

hw-native-sys/pypto-lib#480: Updates the same decode compressor ratio4/ratio128 paths to use A3 interleaved swap-gather RoPE, directly overlapping with the compressor kernel rewrites in this PR.
hw-native-sys/pypto-lib#525: Modifies decode_sparse_attn.py's inverse-RoPE pipeline to use freqs_cos/freqs_sin with interleaved precompute (rope_cos_il/signed sine), which is the same kernel-level change this PR finalizes.
hw-native-sys/pypto-lib#538: Switches build_tensor_specs initializers across the same test harnesses to use rope_tables.build_deepseek_v4_rope_tables for freqs_cos/freqs_sin, directly matching this PR's test-spec wiring changes.

Poem

🐇 Hop hop, the half-tables are gone today,
Full freqs_cos and freqs_sin lead the way!
No more split lo and hi in a NeoX pair,
j^1 swap-gather rotates through the air.
The bunny approves—interleaved is chic,
BF16 all the way, what a sleek technique! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 56.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly identifies that this is a revert of a previous gather-free split-half RoPE feature for DeepSeek v4, matching the actual changeset that removes this optimization and restores the prior RoPE table approach.
Description check	✅ Passed	The PR description correctly identifies that this reverts PR `#570`, which is directly related to the changeset that updates RoPE handling across multiple DeepSeek v4 files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request refactors the Rotary Position Embedding (RoPE) implementation across multiple prefill and decode attention/compressor modules, transitioning from a split-half (NeoX) layout to an interleaved layout (CANN A3 rotate_interleaved). This change eliminates the need for half-width RoPE tables and simplifies the kernel signatures by directly utilizing the full-width freqs_cos and freqs_sin tensors. However, in prefill_sparse_attn.py, the undefined constant ROPE_HALF is used instead of HALF_ROPE in several places, which will cause runtime NameErrors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

models/deepseek/v4/decode_attention_hca.py (1)

152-155: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Derive HCA compressor RoPE halves from even lanes.

cmp_cos/cmp_sin are half-width compressor inputs. With full interleaved freqs_*, taking 0 : ROPE_HEAD_DIM // 2 feeds duplicated low-pair frequencies; gather even lanes from the full row instead, and update the golden reference to use 0::2.

Golden-side fix

-        cmp_cos[b] = freqs_cos[cmp_pos_b, :half_rd].float()
-        cmp_sin[b] = freqs_sin[cmp_pos_b, :half_rd].float()
+        cmp_cos[b] = freqs_cos[cmp_pos_b, 0::2].float()
+        cmp_sin[b] = freqs_sin[cmp_pos_b, 0::2].float()

Also applies to: 423-424

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/decode_attention_hca.py` around lines 152 - 155, The
slicing in the HCA compressor RoPE assignments is using 0 : ROPE_HEAD_DIM // 2
which extracts the first half of duplicated low-pair frequencies from the full
interleaved freqs_cos and freqs_sin arrays. Replace the slice syntax 0 :
ROPE_HEAD_DIM // 2 with 0::2 in the extraction of cmp_cos_row and cmp_sin_row to
gather only the even lanes from the full row instead. Apply the same fix to the
corresponding assignments in both the current location and the other occurrence
mentioned at lines 423-424 to ensure consistent even-lane extraction across all
HCA compressor RoPE operations.

models/deepseek/v4/decode_attention_csa.py (1)

202-214: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Extract compressor/indexer RoPE halves from even interleaved lanes.

step_cos/step_sin and cmp_cos/cmp_sin are half-width inputs, but slicing freqs_*[:, :HALF_ROPE] from a full interleaved row yields [c0, c0, c1, c1, ...] instead of [c0, c1, ...]. Gather lanes 0, 2, 4, ... in the PyPTO path and mirror that with 0::2 in the golden path.

Golden-side fix to mirror the expected half layout

-    step_cos = freqs_cos[first_pos, :HALF_ROPE].float().contiguous()
-    step_sin = freqs_sin[first_pos, :HALF_ROPE].float().contiguous()
+    step_cos = freqs_cos[first_pos, 0::2].float().contiguous()
+    step_sin = freqs_sin[first_pos, 0::2].float().contiguous()
     cmp_pos = first_pos + (COMPRESS_RATIO - (first_pos % COMPRESS_RATIO)) - COMPRESS_RATIO
-    cmp_cos = freqs_cos[cmp_pos, :HALF_ROPE].float().contiguous()
-    cmp_sin = freqs_sin[cmp_pos, :HALF_ROPE].float().contiguous()
+    cmp_cos = freqs_cos[cmp_pos, 0::2].float().contiguous()
+    cmp_sin = freqs_sin[cmp_pos, 0::2].float().contiguous()

Also applies to: 521-525

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/decode_attention_csa.py` around lines 202 - 214, The
slicing of freqs_cos and freqs_sin in the step_cos/step_sin and cmp_cos/cmp_sin
tensor assembly blocks produces interleaved duplicated values [c0, c0, c1, c1,
...] instead of the required half-width format [c0, c1, ...]. Extract only the
even-indexed lanes (0, 2, 4, ...) from the sliced results for both the step rope
computation block (around the pl.slice calls for step_cos and step_sin) and the
compress rope computation block (around the pl.slice calls for cmp_cos and
cmp_sin). This ensures the tensors contain the correct unique RoPE values at the
required positions without duplication.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@models/deepseek/v4/decode_sparse_attn_hca.py`:
- Around line 314-318: The RoPE indexing in the gather operations for cs_cos and
cs_sin is not maintaining the full interleaved layout contract which should be
[c0, c0, c1, c1, ...] but instead produces [c0, c1, ..., c0, c1, ...]. Fix the
cs_dup_idx construction or adjust how the gather operations use it to ensure the
output follows the proper full interleaved RoPE layout with consecutive pairs of
identical indices. Apply the same fix to the similar gather operations also
appearing in the code (the additional locations mentioned in the comment).

In `@models/deepseek/v4/decode_sparse_attn.py`:
- Around line 346-350: The issue is that the slicing of freqs_cos and freqs_sin
in the code around lines 346-350 (variables cs_cos and cs_sin) is only
extracting the contiguous first half and then duplicating it, but these
frequency tables are now full interleaved tables. Instead of slicing just cp_r0
: cp_r0 + ROPE_TILE and duplicating through cs_dup_idx, you need to slice the
full interleaved tile from the frequency tables and use even lanes to access the
correct interleaved positions. Apply the same fix to the duplicate code section
mentioned at lines 626-630.

---

Outside diff comments:
In `@models/deepseek/v4/decode_attention_csa.py`:
- Around line 202-214: The slicing of freqs_cos and freqs_sin in the
step_cos/step_sin and cmp_cos/cmp_sin tensor assembly blocks produces
interleaved duplicated values [c0, c0, c1, c1, ...] instead of the required
half-width format [c0, c1, ...]. Extract only the even-indexed lanes (0, 2, 4,
...) from the sliced results for both the step rope computation block (around
the pl.slice calls for step_cos and step_sin) and the compress rope computation
block (around the pl.slice calls for cmp_cos and cmp_sin). This ensures the
tensors contain the correct unique RoPE values at the required positions without
duplication.

In `@models/deepseek/v4/decode_attention_hca.py`:
- Around line 152-155: The slicing in the HCA compressor RoPE assignments is
using 0 : ROPE_HEAD_DIM // 2 which extracts the first half of duplicated
low-pair frequencies from the full interleaved freqs_cos and freqs_sin arrays.
Replace the slice syntax 0 : ROPE_HEAD_DIM // 2 with 0::2 in the extraction of
cmp_cos_row and cmp_sin_row to gather only the even lanes from the full row
instead. Apply the same fix to the corresponding assignments in both the current
location and the other occurrence mentioned at lines 423-424 to ensure
consistent even-lane extraction across all HCA compressor RoPE operations.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ce2a7cda-b605-4343-8380-ec0039b51b4a

📥 Commits

Reviewing files that changed from the base of the PR and between cdb64e0 and 4e76c17.

📒 Files selected for processing (16)

models/deepseek/v4/decode_attention_csa.py
models/deepseek/v4/decode_attention_hca.py
models/deepseek/v4/decode_attention_swa.py
models/deepseek/v4/decode_compressor_ratio128.py
models/deepseek/v4/decode_compressor_ratio4.py
models/deepseek/v4/decode_layer.py
models/deepseek/v4/decode_sparse_attn.py
models/deepseek/v4/decode_sparse_attn_hca.py
models/deepseek/v4/decode_sparse_attn_swa.py
models/deepseek/v4/prefill_attention_csa.py
models/deepseek/v4/prefill_attention_hca.py
models/deepseek/v4/prefill_attention_swa.py
models/deepseek/v4/prefill_compressor_ratio128.py
models/deepseek/v4/prefill_compressor_ratio4.py
models/deepseek/v4/prefill_sparse_attn.py
models/deepseek/v4/qkv_proj_rope.py

💤 Files with no reviewable changes (1)

models/deepseek/v4/decode_layer.py

coderabbitai · 2026-06-22T12:22:32Z

+        cs_cos = pl.cast(freqs_cos[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32)
+        cs_sin = pl.cast(freqs_sin[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32)
+        rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.gather(cs_cos, dim=-1, index=cs_dup_idx)
+        rope_sin_signed[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.mul(
+            pl.gather(cs_sin, dim=-1, index=cs_dup_idx), cs_sign)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the HCA sparse path on the full interleaved RoPE layout.

The kernel and golden reference still consume a contiguous half-table, and the fixture builds [c0, c1, ..., c0, c1, ...]. That masks the mismatch with the full interleaved contract [c0, c0, c1, c1, ...] and can rotate production inputs with the wrong frequencies.

Proposed layout fix

- cs_dup_idx = pl.cast(cs_dup_f, target_type=pl.INT32) # j>>1 cs_lane = pl.sub(cs_col, pl.mul(cs_dup_f, 2.0)) # j%2 cs_sign = pl.neg(pl.sub(pl.mul(cs_lane, 2.0), 1.0)) # [+1,-1,...] (conjugate) - cs_cos = pl.cast(freqs_cos[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32) - cs_sin = pl.cast(freqs_sin[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32) - rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.gather(cs_cos, dim=-1, index=cs_dup_idx) + cs_cos = pl.cast(freqs_cos[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE], target_type=pl.FP32) + cs_sin = pl.cast(freqs_sin[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE], target_type=pl.FP32) + rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = cs_cos rope_sin_signed[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.mul( - pl.gather(cs_sin, dim=-1, index=cs_dup_idx), cs_sign) + cs_sin, cs_sign)

- cos_half = cos[:, :HALF_ROPE].unsqueeze(1) - sin_half = sin[:, :HALF_ROPE].unsqueeze(1) + cos_half = cos[:, 0::2].unsqueeze(1) + sin_half = sin[:, 0::2].unsqueeze(1)

- return torch.cat([cos_half, cos_half], dim=-1) + return torch.repeat_interleave(cos_half, repeats=2, dim=-1)

- return torch.cat([sin_half, sin_half], dim=-1) + return torch.repeat_interleave(sin_half, repeats=2, dim=-1)

Also applies to: 594-598, 701-711

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/decode_sparse_attn_hca.py` around lines 314 - 318, The RoPE indexing in the gather operations for cs_cos and cs_sin is not maintaining the full interleaved layout contract which should be [c0, c0, c1, c1, ...] but instead produces [c0, c1, ..., c0, c1, ...]. Fix the cs_dup_idx construction or adjust how the gather operations use it to ensure the output follows the proper full interleaved RoPE layout with consecutive pairs of identical indices. Apply the same fix to the similar gather operations also appearing in the code (the additional locations mentioned in the comment).

coderabbitai · 2026-06-22T12:22:32Z

+        cs_cos = pl.cast(freqs_cos[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32)
+        cs_sin = pl.cast(freqs_sin[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32)
+        rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.gather(cs_cos, dim=-1, index=cs_dup_idx)
+        rope_sin_signed[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.mul(
+            pl.gather(cs_sin, dim=-1, index=cs_dup_idx), cs_sign)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Consume the full interleaved RoPE lanes instead of re-expanding the first half.

freqs_cos/freqs_sin are now full interleaved tables, but these paths still slice the contiguous first half and duplicate it. For an interleaved row like [c0, c0, c1, c1, ...], that shifts the frequency mapping after the first pair. Slice the full interleaved tile in the kernel and use even lanes in the golden reference.

Proposed layout fix

- cs_dup_idx = pl.cast(cs_dup_f, target_type=pl.INT32) # j>>1 cs_lane = pl.sub(cs_col, pl.mul(cs_dup_f, 2.0)) # j%2 cs_sign = pl.neg(pl.sub(pl.mul(cs_lane, 2.0), 1.0)) # [+1,-1,...] (conjugate) - cs_cos = pl.cast(freqs_cos[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32) - cs_sin = pl.cast(freqs_sin[0:T, cp_r0 : cp_r0 + ROPE_TILE], target_type=pl.FP32) - rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.gather(cs_cos, dim=-1, index=cs_dup_idx) + cs_cos = pl.cast(freqs_cos[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE], target_type=pl.FP32) + cs_sin = pl.cast(freqs_sin[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE], target_type=pl.FP32) + rope_cos_il[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = cs_cos rope_sin_signed[0:T, cp_c0 : cp_c0 + ROPE_INTERLEAVE_TILE] = pl.mul( - pl.gather(cs_sin, dim=-1, index=cs_dup_idx), cs_sign) + cs_sin, cs_sign)

- cos_half = cos[:, :HALF_ROPE].unsqueeze(1) - sin_half = sin[:, :HALF_ROPE].unsqueeze(1) + cos_half = cos[:, 0::2].unsqueeze(1) + sin_half = sin[:, 0::2].unsqueeze(1)

Also applies to: 626-630

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/decode_sparse_attn.py` around lines 346 - 350, The issue is that the slicing of freqs_cos and freqs_sin in the code around lines 346-350 (variables cs_cos and cs_sin) is only extracting the contiguous first half and then duplicating it, but these frequency tables are now full interleaved tables. Instead of slicing just cp_r0 : cp_r0 + ROPE_TILE and duplicating through cs_dup_idx, you need to slice the full interleaved tile from the frequency tables and use even lanes to access the correct interleaved positions. Apply the same fix to the duplicate code section mentioned at lines 626-630.

#578) ## Summary - Retile the DeepSeek-V4 `qkv_proj_rope` projection matmuls to the 512B L2 cache line and fuse RMSNorm with RoPE. **Decode end-to-end −56%** (a2a3 L2 swimlane, 5-rep median: 936µs → 407µs); golden green on decode and prefill. - `qr_proj` / `kv_proj`: split-K (zero-seed + atomic-add) with N-tile 32 → 256, so each `wq_a`/`wkv` row-read fills a full 512B cache line instead of a 64B sub-line (was 8× weight over-fetch). Kernel occupancy −84% / −75%. - `qproj_matmul`: decouple the matmul N-tile from the dequant N-tile and bump matmul `TN` 128 → 256 (256B/row), capped by the L0C `Acc` limit (`TM*TN*4 ≤ 128KB`). `TN=512` needs an M-split (`TM=64`) and measured no faster end-to-end on device. - Fuse per-head RMSNorm + NOPE + RoPE into `q_head_rms_nope_rope`, and KV RMSNorm + RoPE into `kv_rms_norm_rope`: `inv_rms` stays in registers (no GM round-trip via the old `q_head_inv_rms_all` / `kv_inv_rms_tensor`), collapsing each pair of dispatches into one. RoPE keeps the interleaved (CANN A3) swap-gather layout. ## Related Issues - The RMSNorm+RoPE fusion re-introduces fused rope on top of the **interleaved** layout restored by #575 (the revert of #570); it does not bring back the split-half layout. The matmul retiling is independent of the rope layout.

Revert "feat(deepseek/v4): gather-free split-half RoPE for decode + p…

4e76c17

…refill (…" This reverts commit cdb64e0.

lwDavid requested a review from zhangqi-chen June 22, 2026 12:10

lwDavid self-assigned this Jun 22, 2026

lwDavid added the enhancement New feature or request label Jun 22, 2026

lwDavid added this to pto project Jun 22, 2026

lwDavid moved this to Done in pto project Jun 22, 2026

gemini-code-assist Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread models/deepseek/v4/prefill_sparse_attn.py

Comment thread models/deepseek/v4/prefill_sparse_attn.py

Comment thread models/deepseek/v4/prefill_sparse_attn.py

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

zhangqi-chen merged commit 5fdf7ec into hw-native-sys:main Jun 22, 2026
5 of 7 checks passed

Hzfengsy mentioned this pull request Jun 22, 2026

perf(deepseek/v4): qkv_proj_rope tiling + fused rms+rope (decode -56%) #578

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert "feat(deepseek/v4): gather-free split-half RoPE for decode + prefill"#575

Revert "feat(deepseek/v4): gather-free split-half RoPE for decode + prefill"#575
zhangqi-chen merged 1 commit into
hw-native-sys:mainfrom
lwDavid:revert-570-research/pr564-splithalf-rope

lwDavid commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Uh oh!

coderabbitai Bot Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lwDavid commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading