Skip to content

perf(dsv4 qkv): precompute q-head rope factors#648

Open
wangqin1723-max wants to merge 1 commit into
hw-native-sys:mainfrom
wangqin1723-max:perf/dsv4-qkv-rope-cs-precompute
Open

perf(dsv4 qkv): precompute q-head rope factors#648
wangqin1723-max wants to merge 1 commit into
hw-native-sys:mainfrom
wangqin1723-max:perf/dsv4-qkv-rope-cs-precompute

Conversation

@wangqin1723-max

Copy link
Copy Markdown
Contributor

Summary

Move the head-invariant q-head RoPE setup out of each q_head_rms_nope_rope task and into a single q_rope_cs task.

Before this change, each q-head task rebuilt the interleaved RoPE setup locally:

  • dup_idx = j >> 1
  • sign = [-1, +1, ...]
  • swap_idx = j ^ 1
  • duplicated/interleaved cos and sin

This PR precomputes:

  • q_rope_cos_il
  • q_rope_sin_signed
  • q_rope_swap_idx

Then q_head_rms_nope_rope only loads those factors and keeps the same j^1 gather rotation:

out[j] = inv_rms * (x[j] * cos_il[j] + x[j^1] * sign[j] * sin_il[j])

No kernel input is added. The implementation avoids the even/odd scatter path, which was measured much slower.

Results

Measured on a2a3, fixed --device 3, 3 runs each. Logs and copied swimlane JSON files are under /data/w00956228/newpto/pypto-lib/.

Standalone qkv_proj_rope.py decode

Command:

python models/deepseek/v4/qkv_proj_rope.py -p a2a3 --mode decode --enable-l2-swimlane
metric before avg (n=3) after avg (n=3) delta
Total Test Time 220.967us 221.673us +0.32%
swimlane wall 258.613us 262.353us +1.45%
q_head_rms_nope_rope Avg Exec 22.227us 19.130us -13.93%
q_head_rms_nope_rope span 34.420us 31.347us -8.93%
new q_rope_cs Avg Exec n/a 5.047us n/a

Per-run logs:

  • qkv_proj_rope_qhead_ropecs_before_r{1,2,3}_dev3.log
  • qkv_proj_rope_qhead_ropecs_after_r{1,2,3}_dev3.log
  • qkv_proj_rope_qhead_ropecs_dev3_summary.txt

Full decode_attention_hca.py

Command:

python models/deepseek/v4/decode_attention_hca.py -p a2a3 --enable-l2-swimlane 1

Total Test Time in the HCA text table is unreliable (0.00us) due to L2 swimlane skipped-record warnings, so wall is computed from merged_swimlane_*.json, excluding setup events.

metric before avg (n=3) after avg (n=3) delta
HCA swimlane wall 578.607us 569.740us -1.53%
q_head_rms_nope_rope Avg Exec 14.197us 11.007us -22.47%
q_head_rms_nope_rope span 16.867us 14.873us -11.82%
new q_rope_cs Avg Exec n/a 4.127us n/a

Per-run logs:

  • decode_attention_hca_qkv_ropecs_before_r{1,2,3}_dev3.log
  • decode_attention_hca_qkv_ropecs_after_r{1,2,3}_dev3.log
  • decode_attention_hca_qkv_ropecs_dev3_summary.txt

Correctness

All measured runs PASS.

Standalone qkv_proj_rope.py validates:

  • q PASS
  • kv PASS
  • qr PASS
  • qr_scale PASS

Full decode_attention_hca.py validates:

  • kv_cache PASS
  • x_out PASS

Also checked:

python -m py_compile models/deepseek/v4/qkv_proj_rope.py

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

In qkv_proj_rope.py, the interleaved RoPE index/sign and cos/sin dup-gather preparation is moved from inside each q_head_rms_nope_rope SPMD task into a single kernel-wide pl.spmd(1) precomputation block. Three tensors—q_rope_cos_il, q_rope_sin_signed, and q_rope_swap_idx—are computed once and reused per head. The rotation writeback is updated to use q_rope_sin_signed directly, removing the q_sign * q_sin_il multiplication.

Changes

RoPE Precomputation Refactor

Layer / File(s) Summary
Kernel-wide RoPE precompute and updated writeback
models/deepseek/v4/qkv_proj_rope.py
Adds a pl.spmd(1) block before the per-head loop to precompute q_rope_cos_il, q_rope_sin_signed (sin with sign folded in), and q_rope_swap_idx (j^1 swap index). The per-head loop slices these tensors instead of rebuilding index/sign/gathered values. The rotation writeback at line 332 drops the explicit q_sign * q_sin_il multiplication and uses q_rope_sin_signed directly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#480: Implements the same interleaved RoPE swap-gather index/sign logic in qkv_proj_rope.py that this PR moves out of the per-head task.
  • hw-native-sys/pypto-lib#525: Applies the same sign-folding-into-sin refactor pattern (removing separate sign multiplication) in a different kernel file.
  • hw-native-sys/pypto-lib#578: Modifies the same q_head_rms_nope_rope RoPE interleaved swap/cos/sin preparation and writeback logic in qkv_proj_rope.py.

Poem

🐇 Hop, hop—no more repeat!
I've gathered cos and sin so neat,
One block to rule the swap and sign,
Each head just slices down the line.
Less work per loop, the kernels gleam—
A rabbit's tidy precompute dream! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly matches the main change: precomputing q-head RoPE factors for DeepSeek v4 QKV performance.
Description check ✅ Passed The description accurately summarizes the refactor, motivation, and measured results, and it aligns with the code changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@models/deepseek/v4/qkv_proj_rope.py`:
- Around line 269-271: `q_rope_cos_il` and `q_rope_sin_signed` in
`qkv_proj_rope.py` are allocated with the runtime-derived `t_dim`, which
violates the static allocation pattern used by this kernel. Update the
`create_tensor` shapes in the `qkv_proj_rope` path to use the compile-time
`T_MAX` for the row dimension, matching the other GM allocations like
`x_matmul`, `qr_fp32`, `qr_i8_matmul`, and `q_proj_fp32/i32`, while keeping
`t_dim` only for later views or slices; `q_rope_swap_idx` already follows the
correct static sizing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 01af0ff6-6726-48a3-ab70-e3bd9c59e7b1

📥 Commits

Reviewing files that changed from the base of the PR and between 9d923d9 and bff20f4.

📒 Files selected for processing (1)
  • models/deepseek/v4/qkv_proj_rope.py

Comment on lines +269 to +271
q_rope_cos_il = pl.create_tensor([t_dim, ROPE_DIM], dtype=pl.FP32)
q_rope_sin_signed = pl.create_tensor([t_dim, ROPE_DIM], dtype=pl.FP32)
q_rope_swap_idx = pl.create_tensor([Q_ROPE_T_TILE, ROPE_DIM], dtype=pl.INT32)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Allocate precompute tensors with the static T_MAX, not the dynamic t_dim.

q_rope_cos_il and q_rope_sin_signed are allocated with t_dim = pl.tensor.dim(x, 0), a runtime-derived dimension off the dynamic T_DYN. Every other GM allocation in this kernel (x_matmul, qr_fp32, qr_i8_matmul, q_proj_fp32/i32, kv_fp32) sizes the row dimension with the compile-time T_MAX, and q_rope_swap_idx here correctly uses the static Q_ROPE_T_TILE. Sizing an allocation with a dynamic dimension breaks the static-allocation contract; keep t_dim for the views/slices only.

🛠️ Proposed fix
-    q_rope_cos_il = pl.create_tensor([t_dim, ROPE_DIM], dtype=pl.FP32)
-    q_rope_sin_signed = pl.create_tensor([t_dim, ROPE_DIM], dtype=pl.FP32)
+    q_rope_cos_il = pl.create_tensor([T_MAX, ROPE_DIM], dtype=pl.FP32)
+    q_rope_sin_signed = pl.create_tensor([T_MAX, ROPE_DIM], dtype=pl.FP32)
     q_rope_swap_idx = pl.create_tensor([Q_ROPE_T_TILE, ROPE_DIM], dtype=pl.INT32)

Based on learnings: "avoid passing dynamic dimension variables ... to pl.create_tensor() shape arguments. Tensor allocations must use compile-time static dimension values (e.g., use the compile-time batch parameter ...)."

#!/bin/bash
# Confirm the file's create_tensor convention: row dim should be T_MAX, not t_dim.
rg -nP 'pl\.create_tensor\(\s*\[' models/deepseek/v4/qkv_proj_rope.py
# Show T_MAX definition / origin.
rg -nP '\bT_MAX\b' models/deepseek/v4/qkv_proj_rope.py
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/qkv_proj_rope.py` around lines 269 - 271, `q_rope_cos_il`
and `q_rope_sin_signed` in `qkv_proj_rope.py` are allocated with the
runtime-derived `t_dim`, which violates the static allocation pattern used by
this kernel. Update the `create_tensor` shapes in the `qkv_proj_rope` path to
use the compile-time `T_MAX` for the row dimension, matching the other GM
allocations like `x_matmul`, `qr_fp32`, `qr_i8_matmul`, and `q_proj_fp32/i32`,
while keeping `t_dim` only for later views or slices; `q_rope_swap_idx` already
follows the correct static sizing.

Source: Learnings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant