perf(dsv4 qkv): precompute q-head rope factors#648
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
📝 WalkthroughWalkthroughIn ChangesRoPE Precomputation Refactor
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@models/deepseek/v4/qkv_proj_rope.py`:
- Around line 269-271: `q_rope_cos_il` and `q_rope_sin_signed` in
`qkv_proj_rope.py` are allocated with the runtime-derived `t_dim`, which
violates the static allocation pattern used by this kernel. Update the
`create_tensor` shapes in the `qkv_proj_rope` path to use the compile-time
`T_MAX` for the row dimension, matching the other GM allocations like
`x_matmul`, `qr_fp32`, `qr_i8_matmul`, and `q_proj_fp32/i32`, while keeping
`t_dim` only for later views or slices; `q_rope_swap_idx` already follows the
correct static sizing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 01af0ff6-6726-48a3-ab70-e3bd9c59e7b1
📒 Files selected for processing (1)
models/deepseek/v4/qkv_proj_rope.py
| q_rope_cos_il = pl.create_tensor([t_dim, ROPE_DIM], dtype=pl.FP32) | ||
| q_rope_sin_signed = pl.create_tensor([t_dim, ROPE_DIM], dtype=pl.FP32) | ||
| q_rope_swap_idx = pl.create_tensor([Q_ROPE_T_TILE, ROPE_DIM], dtype=pl.INT32) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift
Allocate precompute tensors with the static T_MAX, not the dynamic t_dim.
q_rope_cos_il and q_rope_sin_signed are allocated with t_dim = pl.tensor.dim(x, 0), a runtime-derived dimension off the dynamic T_DYN. Every other GM allocation in this kernel (x_matmul, qr_fp32, qr_i8_matmul, q_proj_fp32/i32, kv_fp32) sizes the row dimension with the compile-time T_MAX, and q_rope_swap_idx here correctly uses the static Q_ROPE_T_TILE. Sizing an allocation with a dynamic dimension breaks the static-allocation contract; keep t_dim for the views/slices only.
🛠️ Proposed fix
- q_rope_cos_il = pl.create_tensor([t_dim, ROPE_DIM], dtype=pl.FP32)
- q_rope_sin_signed = pl.create_tensor([t_dim, ROPE_DIM], dtype=pl.FP32)
+ q_rope_cos_il = pl.create_tensor([T_MAX, ROPE_DIM], dtype=pl.FP32)
+ q_rope_sin_signed = pl.create_tensor([T_MAX, ROPE_DIM], dtype=pl.FP32)
q_rope_swap_idx = pl.create_tensor([Q_ROPE_T_TILE, ROPE_DIM], dtype=pl.INT32)Based on learnings: "avoid passing dynamic dimension variables ... to pl.create_tensor() shape arguments. Tensor allocations must use compile-time static dimension values (e.g., use the compile-time batch parameter ...)."
#!/bin/bash
# Confirm the file's create_tensor convention: row dim should be T_MAX, not t_dim.
rg -nP 'pl\.create_tensor\(\s*\[' models/deepseek/v4/qkv_proj_rope.py
# Show T_MAX definition / origin.
rg -nP '\bT_MAX\b' models/deepseek/v4/qkv_proj_rope.py🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@models/deepseek/v4/qkv_proj_rope.py` around lines 269 - 271, `q_rope_cos_il`
and `q_rope_sin_signed` in `qkv_proj_rope.py` are allocated with the
runtime-derived `t_dim`, which violates the static allocation pattern used by
this kernel. Update the `create_tensor` shapes in the `qkv_proj_rope` path to
use the compile-time `T_MAX` for the row dimension, matching the other GM
allocations like `x_matmul`, `qr_fp32`, `qr_i8_matmul`, and `q_proj_fp32/i32`,
while keeping `t_dim` only for later views or slices; `q_rope_swap_idx` already
follows the correct static sizing.
Source: Learnings
Summary
Move the head-invariant q-head RoPE setup out of each
q_head_rms_nope_ropetask and into a singleq_rope_cstask.Before this change, each q-head task rebuilt the interleaved RoPE setup locally:
dup_idx = j >> 1sign = [-1, +1, ...]swap_idx = j ^ 1cosandsinThis PR precomputes:
q_rope_cos_ilq_rope_sin_signedq_rope_swap_idxThen
q_head_rms_nope_ropeonly loads those factors and keeps the samej^1gather rotation:No kernel input is added. The implementation avoids the even/odd scatter path, which was measured much slower.
Results
Measured on a2a3, fixed
--device 3, 3 runs each. Logs and copied swimlane JSON files are under/data/w00956228/newpto/pypto-lib/.Standalone
qkv_proj_rope.pydecodeCommand:
q_head_rms_nope_ropeAvg Execq_head_rms_nope_ropespanq_rope_csAvg ExecPer-run logs:
qkv_proj_rope_qhead_ropecs_before_r{1,2,3}_dev3.logqkv_proj_rope_qhead_ropecs_after_r{1,2,3}_dev3.logqkv_proj_rope_qhead_ropecs_dev3_summary.txtFull
decode_attention_hca.pyCommand:
Total Test Timein the HCA text table is unreliable (0.00us) due to L2 swimlane skipped-record warnings, so wall is computed frommerged_swimlane_*.json, excludingsetupevents.q_head_rms_nope_ropeAvg Execq_head_rms_nope_ropespanq_rope_csAvg ExecPer-run logs:
decode_attention_hca_qkv_ropecs_before_r{1,2,3}_dev3.logdecode_attention_hca_qkv_ropecs_after_r{1,2,3}_dev3.logdecode_attention_hca_qkv_ropecs_dev3_summary.txtCorrectness
All measured runs PASS.
Standalone
qkv_proj_rope.pyvalidates:qPASSkvPASSqrPASSqr_scalePASSFull
decode_attention_hca.pyvalidates:kv_cachePASSx_outPASSAlso checked: