Add: vector (AIV) row_sum variant for qwen3-14b K/V projection by Inspiron-st · Pull Request #586 · hw-native-sys/pypto-lib

Inspiron-st · 2026-06-23T09:37:39Z

Summary

Add env toggles K_PROJ_ON_AIV / V_PROJ_ON_AIV that run the qwen3-14b K/V projection on the VECTOR (AIV) unit as a dot-product / row_sum GEMM instead of the default cube matmul.
Each wk/wv K-block is staged to UB and transposed on-chip so every output column becomes a contiguous row (avoids the strided GM column loads the AIV TLOAD rejects); operands upcast to FP32 to match the cube path's bf16-in / fp32-accumulate.
K uses its own K_RS_* tiling: qk_norm consumes k_proj per 512-wide N-tile, so each AIV task is fanned into that N-tile's dependency slots.
Default behavior is unchanged (cube); both variants validated on device against the torch golden (ratio_allclose 3e-3, <=2% outliers).

Related Issues

N/A

coderabbitai · 2026-06-23T09:37:53Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b3aa8efa-3f23-4e13-87da-c502f758c857

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The Qwen3-14B decode layer gains optional VECTOR/AIV "row-sum" code paths for both K and V projections. Two environment-variable toggles (K_PROJ_ON_AIV, V_PROJ_ON_AIV) select the variant at trace/codegen time. New tiling constants define the AIV schedule, and the projection blocks are replaced with conditional branches between the AIV row-sum path and the original CUBE split-K + atomic-add path.

Changes

AIV Row-Sum K/V Projection for Qwen3-14B Decode

Layer / File(s)	Summary
Tiling constants and feature toggles `models/qwen3/14b/decode_layer.py`	Adds `V_RS_NV`, `V_RS_NTILES`, `V_RS_KC`, `K_RS_NV`, `K_RS_NTILES`, `K_RS_KC`, `K_RS_TPN`, `K_RS_PAD` tiling constants and `_V_PROJ_ON_AIV`/`_K_PROJ_ON_AIV` environment-variable toggles that statically select the projection implementation at trace/codegen time.
Conditional K and V projection implementations `models/qwen3/14b/decode_layer.py`	Replaces the prior cube-only K and V projection sections with conditional branches: the AIV path uses `pl.row_sum`/`pl.col_expand_mul` with an on-chip transpose and fans per-task IDs (with padding) into `k_tile_tids`/`v_tile_tids`; the fallback retains the original CUBE split-K + atomic-add matmul code.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

hw-native-sys/pypto-lib#449: Both PRs make substantive changes to models/qwen3/14b/decode_layer.py—this PR adds AIV row-sum K/V projection variants while #449 replaced decode_fwd with a manual-scope e2e kernel in the same file.

Suggested labels

enhancement

🐇 Hoppity-hop through the decode lane,
Row-sums and transposes, a new refrain!
K and V projections take two roads now,
AIV or CUBE — an env-var vow.
The rabbit computes with a satisfied grin,
Toggle the flag and let the fast path begin! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding a vector (AIV) row_sum variant for qwen3-14b K/V projections, which is the core purpose of this PR.
Description check	✅ Passed	The description is directly related to the changeset, explaining the environment toggles, implementation details of the AIV variant, tiling scheme, and validation results.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces prototype toggles to offload the V and K projections to the VECTOR (AIV) unit using a row-sum/dot-product form to overlap with cube-resident projections. The review feedback correctly identifies critical issues with the newly introduced tiling constants: both K_RS_NV and V_RS_NV are set to 16 instead of 128. These incorrect values contradict the code comments and lead to severe runtime issues, including out-of-bounds array access during K-projection compilation/execution and untracked task dependencies causing race conditions in the V-projection.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-23T09:39:37Z

+K_RS_NV = 16
+K_RS_NTILES = KV_HIDDEN // K_RS_NV       # 8 N sub-tiles
+K_RS_KC = 512
+K_RS_TPN = QKV_N_TILE // K_RS_NV         # 4 AIV tasks per qk_norm N-tile
+K_RS_PAD = QKV_OK - K_RS_TPN             # 1 padding slot per N-tile (QKV_OK=5)


The constant K_RS_NV is set to 16, which contradicts the comment # 8 N sub-tiles and the preceding docstring stating K_RS_NV=128 -> K_RS_TPN=4 (<=5), K_RS_NTILES=8.

If K_RS_NV is 16, then:

K_RS_TPN becomes 512 // 16 = 32.

K_RS_PAD becomes 5 - 32 = -27.

The loop at lines 525-527 will attempt to write to k_tile_tids[_kt * QKV_OK + _t] for _t up to 31. Since k_tile_tids has a size of only 10 (KV_ON * QKV_OK), this will cause an out-of-bounds array access error during compilation or execution.

Setting K_RS_NV to 128 resolves all these issues, making K_RS_TPN = 4, K_RS_PAD = 1, and K_RS_NTILES = 8, which perfectly aligns with the comments and constraints.

Suggested change

K_RS_NV = 16

K_RS_NTILES = KV_HIDDEN // K_RS_NV # 8 N sub-tiles

K_RS_KC = 512

K_RS_TPN = QKV_N_TILE // K_RS_NV # 4 AIV tasks per qk_norm N-tile

K_RS_PAD = QKV_OK - K_RS_TPN # 1 padding slot per N-tile (QKV_OK=5)

K_RS_NV = 128

K_RS_NTILES = KV_HIDDEN // K_RS_NV # 8 N sub-tiles

K_RS_KC = 512

K_RS_TPN = QKV_N_TILE // K_RS_NV # 4 AIV tasks per qk_norm N-tile

K_RS_PAD = QKV_OK - K_RS_TPN # 1 padding slot per N-tile (QKV_OK=5)

gemini-code-assist · 2026-06-23T09:39:37Z

+V_RS_NV = 16
+V_RS_NTILES = KV_HIDDEN // V_RS_NV  # 8 N sub-tiles (<= 10 v_tile_tids slots)


The constant V_RS_NV is set to 16, which contradicts the comment # 8 N sub-tiles (<= 10 v_tile_tids slots) since KV_HIDDEN // 16 is 64 sub-tiles.

Furthermore, if V_RS_NV is 16, V_RS_NTILES becomes 64. Since v_tile_tids only has 10 slots (defined as KV_ON * QKV_OK), only the first 10 task IDs are tracked in v_tile_tids (lines 612-613), leaving the remaining 54 tasks completely untracked. This will cause downstream tasks (like rope_qkv) to execute before those 54 tasks have finished writing to v_proj, leading to race conditions and data corruption.

Setting V_RS_NV to 128 resolves this inconsistency and ensures all task dependencies are correctly tracked.

Suggested change

V_RS_NV = 16

V_RS_NTILES = KV_HIDDEN // V_RS_NV # 8 N sub-tiles (<= 10 v_tile_tids slots)

V_RS_NV = 128

V_RS_NTILES = KV_HIDDEN // V_RS_NV # 8 N sub-tiles (<= 10 v_tile_tids slots)

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@models/qwen3/14b/decode_layer.py`:
- Around line 174-176: The constant V_RS_NV is set to 16, which causes
V_RS_NTILES to compute to 64 sub-tiles instead of the documented 8 sub-tiles.
This exceeds the 10 slot budget (derived from KV_ON and QKV_OK) and creates a
data race where rope_qkv depends only on v_tile_tids[0..9] but leaves
rs_tids[10..63] unreferenced, causing those AIV tasks to complete without being
properly awaited and leaving v_proj unprotected. Change the value of V_RS_NV
from 16 to 128 to ensure V_RS_NTILES correctly computes to 8 and all sub-tile
references are properly tracked in the v_tile_tids slots.
- Around line 183-187: Change the constants K_RS_NV and V_RS_NV from their
current value of 16 to 128. This will fix the derived calculations: K_RS_TPN
will become 4 (valid, since it must be ≤ QKV_OK which is 5), K_RS_PAD will
become 1 (no longer negative), K_RS_NTILES will become 8, and V_RS_NTILES will
become 8 (within the ≤ 10 slots constraint). These changes ensure all dependent
array indices and loop unrolls throughout the decode layer remain valid and
within bounds, particularly at the locations where K_RS_PAD is used with
pl.unroll() and where k_tile_tids is indexed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1f734bec-d921-4aca-bea0-ca66bef0a8ff

📥 Commits

Reviewing files that changed from the base of the PR and between b51a34c and 4bc67c7.

📒 Files selected for processing (1)

models/qwen3/14b/decode_layer.py

coderabbitai · 2026-06-23T09:42:42Z

+V_RS_NV = 16
+V_RS_NTILES = KV_HIDDEN // V_RS_NV  # 8 N sub-tiles (<= 10 v_tile_tids slots)
+V_RS_KC = 512


🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win

V_RS_NV = 16 likely wrong — same root cause as K_RS_NV.

The comment claims V_RS_NTILES is "8 N sub-tiles (<= 10 v_tile_tids slots)", but with V_RS_NV = 16 and KV_HIDDEN = 1024, V_RS_NTILES = 1024 // 16 = 64, which exceeds the KV_ON * QKV_OK = 10 slot budget. The fan-out at Lines 612-613 then leaves rs_tids[10..63] unreferenced, so rope_qkv (deps on v_tile_tids[0..9]) never waits on those AIV tasks — a data race on v_proj. To get the documented 8 sub-tiles, V_RS_NV should be 128.

🐛 Proposed fix

-V_RS_NV = 16 +V_RS_NV = 128 V_RS_NTILES = KV_HIDDEN // V_RS_NV # 8 N sub-tiles (<= 10 v_tile_tids slots) V_RS_KC = 512

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/qwen3/14b/decode_layer.py` around lines 174 - 176, The constant V_RS_NV is set to 16, which causes V_RS_NTILES to compute to 64 sub-tiles instead of the documented 8 sub-tiles. This exceeds the 10 slot budget (derived from KV_ON and QKV_OK) and creates a data race where rope_qkv depends only on v_tile_tids[0..9] but leaves rs_tids[10..63] unreferenced, causing those AIV tasks to complete without being properly awaited and leaving v_proj unprotected. Change the value of V_RS_NV from 16 to 128 to ensure V_RS_NTILES correctly computes to 8 and all sub-tile references are properly tracked in the v_tile_tids slots.

coderabbitai · 2026-06-23T09:42:42Z

+K_RS_NV = 16
+K_RS_NTILES = KV_HIDDEN // K_RS_NV       # 8 N sub-tiles
+K_RS_KC = 512
+K_RS_TPN = QKV_N_TILE // K_RS_NV         # 4 AIV tasks per qk_norm N-tile
+K_RS_PAD = QKV_OK - K_RS_TPN             # 1 padding slot per N-tile (QKV_OK=5)


🎯 Functional Correctness | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Resolve KV_HIDDEN, QKV_N_TILE, QKV_OK (and contributing HEAD_DIM / NUM_KV_HEADS) rg -nP '\b(KV_HIDDEN|QKV_N_TILE|QKV_OK|HEAD_DIM|NUM_KV_HEADS|KV_ON)\b\s*=' models/qwen3/14b/decode_layer.py

Repository: hw-native-sys/pypto-lib

Length of output: 912

🏁 Script executed:

# Check V_RS_NV at lines 174-176 sed -n '174,176p' models/qwen3/14b/decode_layer.py # Check pl.unroll usage around line 528 sed -n '525,531p' models/qwen3/14b/decode_layer.py

Repository: hw-native-sys/pypto-lib

Length of output: 634

Fix K_RS_NV and V_RS_NV from 16 to 128.

The comments at lines 175 and 185 document intended values: K_RS_NV=128 → K_RS_TPN=4 (≤5), K_RS_NTILES=8 and V_RS_NTILES=8 (≤10 slots). The current code has both set to 16, breaking every derived constant:

With K_RS_NV = 16 and QKV_N_TILE = 512, QKV_OK = 5:

K_RS_TPN = 512 // 16 = 32 (must be ≤ QKV_OK = 5)

K_RS_PAD = 5 - 32 = -27 → pl.unroll(K_RS_PAD) at line 529 receives negative count (invalid)

k_tile_tids has only KV_ON * QKV_OK = 10 slots, but line 527 indexes up to 1 * 5 + 31 = 36 (out of bounds)

With V_RS_NV = 16 and KV_HIDDEN = 1024:

V_RS_NTILES = 1024 // 16 = 64 (violates stated ≤ 10 v_tile_tids slots in comment)

Correct value for both: 128.

K_RS_TPN = 512 // 128 = 4 ✓

K_RS_PAD = 5 - 4 = 1 ✓

K_RS_NTILES = 1024 // 128 = 8 ✓

V_RS_NTILES = 1024 // 128 = 8 ✓

Fix

-V_RS_NV = 16 +V_RS_NV = 128 V_RS_NTILES = KV_HIDDEN // V_RS_NV # 8 N sub-tiles (<= 10 v_tile_tids slots) -K_RS_NV = 16 +K_RS_NV = 128 K_RS_NTILES = KV_HIDDEN // K_RS_NV # 8 N sub-tiles K_RS_KC = 512 K_RS_TPN = QKV_N_TILE // K_RS_NV # 4 AIV tasks per qk_norm N-tile

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/qwen3/14b/decode_layer.py` around lines 183 - 187, Change the constants K_RS_NV and V_RS_NV from their current value of 16 to 128. This will fix the derived calculations: K_RS_TPN will become 4 (valid, since it must be ≤ QKV_OK which is 5), K_RS_PAD will become 1 (no longer negative), K_RS_NTILES will become 8, and V_RS_NTILES will become 8 (within the ≤ 10 slots constraint). These changes ensure all dependent array indices and loop unrolls throughout the decode layer remain valid and within bounds, particularly at the locations where K_RS_PAD is used with pl.unroll() and where k_tile_tids is indexed.

…arison BATCH=1 decode benchmark that extracts only Q/K/V projections from decode_layer.py, supporting: - Q: always Cube (SPMD split-K) - K/V: toggle between Cube matmul and AIV VECTOR row_sum (env vars K_PROJ_ON_AIV / V_PROJ_ON_AIV) Key tiling: AIV path uses col_expand+mul+row_sum+reshape which eliminates the per-column loop, SAFE_BATCH padding, and transposed accumulator — 4 ops per K-block instead of 16×4. NV=16 is the hardware minimum: the BF16 source tile [KC,NV] cast to FP32 must satisfy row-major row-byte alignment >= 32 B, i.e. NV*sizeof(BF16)=NV*2>=32 -> NV>=16. KC=1024 is the UB max (F32 transposed tile [16,1024] = 64 KB).

gemini-code-assist Bot reviewed Jun 23, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Inspiron-st force-pushed the feat/qwen3-14b-kv-proj-aiv-rowsum branch from 4bc67c7 to 942d2a4 Compare June 26, 2026 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add: vector (AIV) row_sum variant for qwen3-14b K/V projection#586

Add: vector (AIV) row_sum variant for qwen3-14b K/V projection#586
Inspiron-st wants to merge 1 commit into
hw-native-sys:mainfrom
Inspiron-st:feat/qwen3-14b-kv-proj-aiv-rowsum

Inspiron-st commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		V_RS_NV = 16
		V_RS_NTILES = KV_HIDDEN // V_RS_NV # 8 N sub-tiles (<= 10 v_tile_tids slots)

Uh oh!

Conversation

Inspiron-st commented Jun 23, 2026

Summary

Related Issues

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading