perf(dsv4 qkv): precompute q-head rope factors by wangqin1723-max · Pull Request #648 · hw-native-sys/pypto-lib

wangqin1723-max · 2026-06-30T06:49:05Z

Summary

Move the head-invariant q-head RoPE setup out of each q_head_rms_nope_rope task and into a single q_rope_cs task.

Before this change, each q-head task rebuilt the interleaved RoPE setup locally:

dup_idx = j >> 1
sign = [-1, +1, ...]
swap_idx = j ^ 1
duplicated/interleaved cos and sin

This PR precomputes:

q_rope_cos_il
q_rope_sin_signed
q_rope_swap_idx

Then q_head_rms_nope_rope only loads those factors and keeps the same j^1 gather rotation:

out[j] = inv_rms * (x[j] * cos_il[j] + x[j^1] * sign[j] * sin_il[j])

No kernel input is added. The implementation avoids the even/odd scatter path, which was measured much slower.

Results

Measured on a2a3, fixed --device 3, 3 runs each. Logs and copied swimlane JSON files are under /data/w00956228/newpto/pypto-lib/.

Standalone `qkv_proj_rope.py` decode

Command:

python models/deepseek/v4/qkv_proj_rope.py -p a2a3 --mode decode --enable-l2-swimlane

metric	before avg (n=3)	after avg (n=3)	delta
Total Test Time	220.967us	221.673us	+0.32%
swimlane wall	258.613us	262.353us	+1.45%
`q_head_rms_nope_rope` Avg Exec	22.227us	19.130us	-13.93%
`q_head_rms_nope_rope` span	34.420us	31.347us	-8.93%
new `q_rope_cs` Avg Exec	n/a	5.047us	n/a

Per-run logs:

qkv_proj_rope_qhead_ropecs_before_r{1,2,3}_dev3.log
qkv_proj_rope_qhead_ropecs_after_r{1,2,3}_dev3.log
qkv_proj_rope_qhead_ropecs_dev3_summary.txt

Full `decode_attention_hca.py`

Command:

python models/deepseek/v4/decode_attention_hca.py -p a2a3 --enable-l2-swimlane 1

Total Test Time in the HCA text table is unreliable (0.00us) due to L2 swimlane skipped-record warnings, so wall is computed from merged_swimlane_*.json, excluding setup events.

metric	before avg (n=3)	after avg (n=3)	delta
HCA swimlane wall	578.607us	569.740us	-1.53%
`q_head_rms_nope_rope` Avg Exec	14.197us	11.007us	-22.47%
`q_head_rms_nope_rope` span	16.867us	14.873us	-11.82%
new `q_rope_cs` Avg Exec	n/a	4.127us	n/a

Per-run logs:

decode_attention_hca_qkv_ropecs_before_r{1,2,3}_dev3.log
decode_attention_hca_qkv_ropecs_after_r{1,2,3}_dev3.log
decode_attention_hca_qkv_ropecs_dev3_summary.txt

Correctness

All measured runs PASS.

Standalone qkv_proj_rope.py validates:

q PASS
kv PASS
qr PASS
qr_scale PASS

Full decode_attention_hca.py validates:

kv_cache PASS
x_out PASS

Also checked:

python -m py_compile models/deepseek/v4/qkv_proj_rope.py

gemini-code-assist · 2026-06-30T06:49:09Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-06-30T06:49:35Z

📝 Walkthrough

Walkthrough

In qkv_proj_rope.py, the interleaved RoPE index/sign and cos/sin dup-gather preparation is moved from inside each q_head_rms_nope_rope SPMD task into a single kernel-wide pl.spmd(1) precomputation block. Three tensors—q_rope_cos_il, q_rope_sin_signed, and q_rope_swap_idx—are computed once and reused per head. The rotation writeback is updated to use q_rope_sin_signed directly, removing the q_sign * q_sin_il multiplication.

Changes

RoPE Precomputation Refactor

Layer / File(s)	Summary
Kernel-wide RoPE precompute and updated writeback `models/deepseek/v4/qkv_proj_rope.py`	Adds a `pl.spmd(1)` block before the per-head loop to precompute `q_rope_cos_il`, `q_rope_sin_signed` (sin with sign folded in), and `q_rope_swap_idx` (`j^1` swap index). The per-head loop slices these tensors instead of rebuilding index/sign/gathered values. The rotation writeback at line 332 drops the explicit `q_sign * q_sin_il` multiplication and uses `q_rope_sin_signed` directly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

hw-native-sys/pypto-lib#480: Implements the same interleaved RoPE swap-gather index/sign logic in qkv_proj_rope.py that this PR moves out of the per-head task.
hw-native-sys/pypto-lib#525: Applies the same sign-folding-into-sin refactor pattern (removing separate sign multiplication) in a different kernel file.
hw-native-sys/pypto-lib#578: Modifies the same q_head_rms_nope_rope RoPE interleaved swap/cos/sin preparation and writeback logic in qkv_proj_rope.py.

Poem

🐇 Hop, hop—no more repeat!
I've gathered cos and sin so neat,
One block to rule the swap and sign,
Each head just slices down the line.
Less work per loop, the kernels gleam—
A rabbit's tidy precompute dream! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly matches the main change: precomputing q-head RoPE factors for DeepSeek v4 QKV performance.
Description check	✅ Passed	The description accurately summarizes the refactor, motivation, and measured results, and it aligns with the code changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@models/deepseek/v4/qkv_proj_rope.py`:
- Around line 269-271: `q_rope_cos_il` and `q_rope_sin_signed` in
`qkv_proj_rope.py` are allocated with the runtime-derived `t_dim`, which
violates the static allocation pattern used by this kernel. Update the
`create_tensor` shapes in the `qkv_proj_rope` path to use the compile-time
`T_MAX` for the row dimension, matching the other GM allocations like
`x_matmul`, `qr_fp32`, `qr_i8_matmul`, and `q_proj_fp32/i32`, while keeping
`t_dim` only for later views or slices; `q_rope_swap_idx` already follows the
correct static sizing.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 01af0ff6-6726-48a3-ab70-e3bd9c59e7b1

📥 Commits

Reviewing files that changed from the base of the PR and between 9d923d9 and bff20f4.

📒 Files selected for processing (1)

models/deepseek/v4/qkv_proj_rope.py

coderabbitai · 2026-06-30T06:52:38Z

+    q_rope_cos_il = pl.create_tensor([t_dim, ROPE_DIM], dtype=pl.FP32)
+    q_rope_sin_signed = pl.create_tensor([t_dim, ROPE_DIM], dtype=pl.FP32)
+    q_rope_swap_idx = pl.create_tensor([Q_ROPE_T_TILE, ROPE_DIM], dtype=pl.INT32)


🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Allocate precompute tensors with the static T_MAX, not the dynamic t_dim.

q_rope_cos_il and q_rope_sin_signed are allocated with t_dim = pl.tensor.dim(x, 0), a runtime-derived dimension off the dynamic T_DYN. Every other GM allocation in this kernel (x_matmul, qr_fp32, qr_i8_matmul, q_proj_fp32/i32, kv_fp32) sizes the row dimension with the compile-time T_MAX, and q_rope_swap_idx here correctly uses the static Q_ROPE_T_TILE. Sizing an allocation with a dynamic dimension breaks the static-allocation contract; keep t_dim for the views/slices only.

🛠️ Proposed fix

- q_rope_cos_il = pl.create_tensor([t_dim, ROPE_DIM], dtype=pl.FP32) - q_rope_sin_signed = pl.create_tensor([t_dim, ROPE_DIM], dtype=pl.FP32) + q_rope_cos_il = pl.create_tensor([T_MAX, ROPE_DIM], dtype=pl.FP32) + q_rope_sin_signed = pl.create_tensor([T_MAX, ROPE_DIM], dtype=pl.FP32) q_rope_swap_idx = pl.create_tensor([Q_ROPE_T_TILE, ROPE_DIM], dtype=pl.INT32)

Based on learnings: "avoid passing dynamic dimension variables ... to pl.create_tensor() shape arguments. Tensor allocations must use compile-time static dimension values (e.g., use the compile-time batch parameter ...)."

#!/bin/bash # Confirm the file's create_tensor convention: row dim should be T_MAX, not t_dim. rg -nP 'pl\.create_tensor\(\s*\[' models/deepseek/v4/qkv_proj_rope.py # Show T_MAX definition / origin. rg -nP '\bT_MAX\b' models/deepseek/v4/qkv_proj_rope.py

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/qkv_proj_rope.py` around lines 269 - 271, `q_rope_cos_il` and `q_rope_sin_signed` in `qkv_proj_rope.py` are allocated with the runtime-derived `t_dim`, which violates the static allocation pattern used by this kernel. Update the `create_tensor` shapes in the `qkv_proj_rope` path to use the compile-time `T_MAX` for the row dimension, matching the other GM allocations like `x_matmul`, `qr_fp32`, `qr_i8_matmul`, and `q_proj_fp32/i32`, while keeping `t_dim` only for later views or slices; `q_rope_swap_idx` already follows the correct static sizing.

Source: Learnings

perf(dsv4 qkv): precompute q-head rope factors

bff20f4

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(dsv4 qkv): precompute q-head rope factors#648

perf(dsv4 qkv): precompute q-head rope factors#648
wangqin1723-max wants to merge 1 commit into
hw-native-sys:mainfrom
wangqin1723-max:perf/dsv4-qkv-rope-cs-precompute

wangqin1723-max commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

wangqin1723-max commented Jun 30, 2026

Summary

Results

Standalone qkv_proj_rope.py decode

Full decode_attention_hca.py

Correctness

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Standalone `qkv_proj_rope.py` decode

Full `decode_attention_hca.py`

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading