Add: vector (AIV) row_sum variant for qwen3-14b K/V projection#586
Add: vector (AIV) row_sum variant for qwen3-14b K/V projection#586Inspiron-st wants to merge 1 commit into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe Qwen3-14B decode layer gains optional VECTOR/AIV "row-sum" code paths for both K and V projections. Two environment-variable toggles ( ChangesAIV Row-Sum K/V Projection for Qwen3-14B Decode
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces prototype toggles to offload the V and K projections to the VECTOR (AIV) unit using a row-sum/dot-product form to overlap with cube-resident projections. The review feedback correctly identifies critical issues with the newly introduced tiling constants: both K_RS_NV and V_RS_NV are set to 16 instead of 128. These incorrect values contradict the code comments and lead to severe runtime issues, including out-of-bounds array access during K-projection compilation/execution and untracked task dependencies causing race conditions in the V-projection.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| K_RS_NV = 16 | ||
| K_RS_NTILES = KV_HIDDEN // K_RS_NV # 8 N sub-tiles | ||
| K_RS_KC = 512 | ||
| K_RS_TPN = QKV_N_TILE // K_RS_NV # 4 AIV tasks per qk_norm N-tile | ||
| K_RS_PAD = QKV_OK - K_RS_TPN # 1 padding slot per N-tile (QKV_OK=5) |
There was a problem hiding this comment.
The constant K_RS_NV is set to 16, which contradicts the comment # 8 N sub-tiles and the preceding docstring stating K_RS_NV=128 -> K_RS_TPN=4 (<=5), K_RS_NTILES=8.
If K_RS_NV is 16, then:
K_RS_TPNbecomes512 // 16 = 32.K_RS_PADbecomes5 - 32 = -27.- The loop at lines 525-527 will attempt to write to
k_tile_tids[_kt * QKV_OK + _t]for_tup to 31. Sincek_tile_tidshas a size of only10(KV_ON * QKV_OK), this will cause an out-of-bounds array access error during compilation or execution.
Setting K_RS_NV to 128 resolves all these issues, making K_RS_TPN = 4, K_RS_PAD = 1, and K_RS_NTILES = 8, which perfectly aligns with the comments and constraints.
| K_RS_NV = 16 | |
| K_RS_NTILES = KV_HIDDEN // K_RS_NV # 8 N sub-tiles | |
| K_RS_KC = 512 | |
| K_RS_TPN = QKV_N_TILE // K_RS_NV # 4 AIV tasks per qk_norm N-tile | |
| K_RS_PAD = QKV_OK - K_RS_TPN # 1 padding slot per N-tile (QKV_OK=5) | |
| K_RS_NV = 128 | |
| K_RS_NTILES = KV_HIDDEN // K_RS_NV # 8 N sub-tiles | |
| K_RS_KC = 512 | |
| K_RS_TPN = QKV_N_TILE // K_RS_NV # 4 AIV tasks per qk_norm N-tile | |
| K_RS_PAD = QKV_OK - K_RS_TPN # 1 padding slot per N-tile (QKV_OK=5) |
| V_RS_NV = 16 | ||
| V_RS_NTILES = KV_HIDDEN // V_RS_NV # 8 N sub-tiles (<= 10 v_tile_tids slots) |
There was a problem hiding this comment.
The constant V_RS_NV is set to 16, which contradicts the comment # 8 N sub-tiles (<= 10 v_tile_tids slots) since KV_HIDDEN // 16 is 64 sub-tiles.
Furthermore, if V_RS_NV is 16, V_RS_NTILES becomes 64. Since v_tile_tids only has 10 slots (defined as KV_ON * QKV_OK), only the first 10 task IDs are tracked in v_tile_tids (lines 612-613), leaving the remaining 54 tasks completely untracked. This will cause downstream tasks (like rope_qkv) to execute before those 54 tasks have finished writing to v_proj, leading to race conditions and data corruption.
Setting V_RS_NV to 128 resolves this inconsistency and ensures all task dependencies are correctly tracked.
| V_RS_NV = 16 | |
| V_RS_NTILES = KV_HIDDEN // V_RS_NV # 8 N sub-tiles (<= 10 v_tile_tids slots) | |
| V_RS_NV = 128 | |
| V_RS_NTILES = KV_HIDDEN // V_RS_NV # 8 N sub-tiles (<= 10 v_tile_tids slots) |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@models/qwen3/14b/decode_layer.py`:
- Around line 174-176: The constant V_RS_NV is set to 16, which causes
V_RS_NTILES to compute to 64 sub-tiles instead of the documented 8 sub-tiles.
This exceeds the 10 slot budget (derived from KV_ON and QKV_OK) and creates a
data race where rope_qkv depends only on v_tile_tids[0..9] but leaves
rs_tids[10..63] unreferenced, causing those AIV tasks to complete without being
properly awaited and leaving v_proj unprotected. Change the value of V_RS_NV
from 16 to 128 to ensure V_RS_NTILES correctly computes to 8 and all sub-tile
references are properly tracked in the v_tile_tids slots.
- Around line 183-187: Change the constants K_RS_NV and V_RS_NV from their
current value of 16 to 128. This will fix the derived calculations: K_RS_TPN
will become 4 (valid, since it must be ≤ QKV_OK which is 5), K_RS_PAD will
become 1 (no longer negative), K_RS_NTILES will become 8, and V_RS_NTILES will
become 8 (within the ≤ 10 slots constraint). These changes ensure all dependent
array indices and loop unrolls throughout the decode layer remain valid and
within bounds, particularly at the locations where K_RS_PAD is used with
pl.unroll() and where k_tile_tids is indexed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1f734bec-d921-4aca-bea0-ca66bef0a8ff
📒 Files selected for processing (1)
models/qwen3/14b/decode_layer.py
| V_RS_NV = 16 | ||
| V_RS_NTILES = KV_HIDDEN // V_RS_NV # 8 N sub-tiles (<= 10 v_tile_tids slots) | ||
| V_RS_KC = 512 |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win
V_RS_NV = 16 likely wrong — same root cause as K_RS_NV.
The comment claims V_RS_NTILES is "8 N sub-tiles (<= 10 v_tile_tids slots)", but with V_RS_NV = 16 and KV_HIDDEN = 1024, V_RS_NTILES = 1024 // 16 = 64, which exceeds the KV_ON * QKV_OK = 10 slot budget. The fan-out at Lines 612-613 then leaves rs_tids[10..63] unreferenced, so rope_qkv (deps on v_tile_tids[0..9]) never waits on those AIV tasks — a data race on v_proj. To get the documented 8 sub-tiles, V_RS_NV should be 128.
🐛 Proposed fix
-V_RS_NV = 16
+V_RS_NV = 128
V_RS_NTILES = KV_HIDDEN // V_RS_NV # 8 N sub-tiles (<= 10 v_tile_tids slots)
V_RS_KC = 512🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@models/qwen3/14b/decode_layer.py` around lines 174 - 176, The constant
V_RS_NV is set to 16, which causes V_RS_NTILES to compute to 64 sub-tiles
instead of the documented 8 sub-tiles. This exceeds the 10 slot budget (derived
from KV_ON and QKV_OK) and creates a data race where rope_qkv depends only on
v_tile_tids[0..9] but leaves rs_tids[10..63] unreferenced, causing those AIV
tasks to complete without being properly awaited and leaving v_proj unprotected.
Change the value of V_RS_NV from 16 to 128 to ensure V_RS_NTILES correctly
computes to 8 and all sub-tile references are properly tracked in the
v_tile_tids slots.
| K_RS_NV = 16 | ||
| K_RS_NTILES = KV_HIDDEN // K_RS_NV # 8 N sub-tiles | ||
| K_RS_KC = 512 | ||
| K_RS_TPN = QKV_N_TILE // K_RS_NV # 4 AIV tasks per qk_norm N-tile | ||
| K_RS_PAD = QKV_OK - K_RS_TPN # 1 padding slot per N-tile (QKV_OK=5) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🔴 Critical
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Resolve KV_HIDDEN, QKV_N_TILE, QKV_OK (and contributing HEAD_DIM / NUM_KV_HEADS)
rg -nP '\b(KV_HIDDEN|QKV_N_TILE|QKV_OK|HEAD_DIM|NUM_KV_HEADS|KV_ON)\b\s*=' models/qwen3/14b/decode_layer.pyRepository: hw-native-sys/pypto-lib
Length of output: 912
🏁 Script executed:
# Check V_RS_NV at lines 174-176
sed -n '174,176p' models/qwen3/14b/decode_layer.py
# Check pl.unroll usage around line 528
sed -n '525,531p' models/qwen3/14b/decode_layer.pyRepository: hw-native-sys/pypto-lib
Length of output: 634
Fix K_RS_NV and V_RS_NV from 16 to 128.
The comments at lines 175 and 185 document intended values: K_RS_NV=128 → K_RS_TPN=4 (≤5), K_RS_NTILES=8 and V_RS_NTILES=8 (≤10 slots). The current code has both set to 16, breaking every derived constant:
With K_RS_NV = 16 and QKV_N_TILE = 512, QKV_OK = 5:
K_RS_TPN = 512 // 16 = 32(must be ≤QKV_OK = 5)K_RS_PAD = 5 - 32 = -27→pl.unroll(K_RS_PAD)at line 529 receives negative count (invalid)k_tile_tidshas onlyKV_ON * QKV_OK = 10slots, but line 527 indexes up to1 * 5 + 31 = 36(out of bounds)
With V_RS_NV = 16 and KV_HIDDEN = 1024:
V_RS_NTILES = 1024 // 16 = 64(violates stated≤ 10 v_tile_tids slotsin comment)
Correct value for both: 128.
K_RS_TPN = 512 // 128 = 4✓K_RS_PAD = 5 - 4 = 1✓K_RS_NTILES = 1024 // 128 = 8✓V_RS_NTILES = 1024 // 128 = 8✓
Fix
-V_RS_NV = 16
+V_RS_NV = 128
V_RS_NTILES = KV_HIDDEN // V_RS_NV # 8 N sub-tiles (<= 10 v_tile_tids slots)
-K_RS_NV = 16
+K_RS_NV = 128
K_RS_NTILES = KV_HIDDEN // K_RS_NV # 8 N sub-tiles
K_RS_KC = 512
K_RS_TPN = QKV_N_TILE // K_RS_NV # 4 AIV tasks per qk_norm N-tile🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@models/qwen3/14b/decode_layer.py` around lines 183 - 187, Change the
constants K_RS_NV and V_RS_NV from their current value of 16 to 128. This will
fix the derived calculations: K_RS_TPN will become 4 (valid, since it must be ≤
QKV_OK which is 5), K_RS_PAD will become 1 (no longer negative), K_RS_NTILES
will become 8, and V_RS_NTILES will become 8 (within the ≤ 10 slots constraint).
These changes ensure all dependent array indices and loop unrolls throughout the
decode layer remain valid and within bounds, particularly at the locations where
K_RS_PAD is used with pl.unroll() and where k_tile_tids is indexed.
…arison
BATCH=1 decode benchmark that extracts only Q/K/V projections
from decode_layer.py, supporting:
- Q: always Cube (SPMD split-K)
- K/V: toggle between Cube matmul and AIV VECTOR row_sum
(env vars K_PROJ_ON_AIV / V_PROJ_ON_AIV)
Key tiling: AIV path uses col_expand+mul+row_sum+reshape which
eliminates the per-column loop, SAFE_BATCH padding, and transposed
accumulator — 4 ops per K-block instead of 16×4.
NV=16 is the hardware minimum: the BF16 source tile [KC,NV] cast
to FP32 must satisfy row-major row-byte alignment >= 32 B,
i.e. NV*sizeof(BF16)=NV*2>=32 -> NV>=16. KC=1024 is the UB max
(F32 transposed tile [16,1024] = 64 KB).
4bc67c7 to
942d2a4
Compare
Summary
K_PROJ_ON_AIV/V_PROJ_ON_AIVthat run the qwen3-14b K/V projection on the VECTOR (AIV) unit as a dot-product / row_sum GEMM instead of the default cube matmul.K_RS_*tiling: qk_norm consumes k_proj per 512-wide N-tile, so each AIV task is fanned into that N-tile's dependency slots.Related Issues
N/A