Skip to content

Add: vector (AIV) row_sum variant for qwen3-14b K/V projection#586

Open
Inspiron-st wants to merge 1 commit into
hw-native-sys:mainfrom
Inspiron-st:feat/qwen3-14b-kv-proj-aiv-rowsum
Open

Add: vector (AIV) row_sum variant for qwen3-14b K/V projection#586
Inspiron-st wants to merge 1 commit into
hw-native-sys:mainfrom
Inspiron-st:feat/qwen3-14b-kv-proj-aiv-rowsum

Conversation

@Inspiron-st

Copy link
Copy Markdown
Collaborator

Summary

  • Add env toggles K_PROJ_ON_AIV / V_PROJ_ON_AIV that run the qwen3-14b K/V projection on the VECTOR (AIV) unit as a dot-product / row_sum GEMM instead of the default cube matmul.
  • Each wk/wv K-block is staged to UB and transposed on-chip so every output column becomes a contiguous row (avoids the strided GM column loads the AIV TLOAD rejects); operands upcast to FP32 to match the cube path's bf16-in / fp32-accumulate.
  • K uses its own K_RS_* tiling: qk_norm consumes k_proj per 512-wide N-tile, so each AIV task is fanned into that N-tile's dependency slots.
  • Default behavior is unchanged (cube); both variants validated on device against the torch golden (ratio_allclose 3e-3, <=2% outliers).

Related Issues

N/A

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b3aa8efa-3f23-4e13-87da-c502f758c857

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The Qwen3-14B decode layer gains optional VECTOR/AIV "row-sum" code paths for both K and V projections. Two environment-variable toggles (K_PROJ_ON_AIV, V_PROJ_ON_AIV) select the variant at trace/codegen time. New tiling constants define the AIV schedule, and the projection blocks are replaced with conditional branches between the AIV row-sum path and the original CUBE split-K + atomic-add path.

Changes

AIV Row-Sum K/V Projection for Qwen3-14B Decode

Layer / File(s) Summary
Tiling constants and feature toggles
models/qwen3/14b/decode_layer.py
Adds V_RS_NV, V_RS_NTILES, V_RS_KC, K_RS_NV, K_RS_NTILES, K_RS_KC, K_RS_TPN, K_RS_PAD tiling constants and _V_PROJ_ON_AIV/_K_PROJ_ON_AIV environment-variable toggles that statically select the projection implementation at trace/codegen time.
Conditional K and V projection implementations
models/qwen3/14b/decode_layer.py
Replaces the prior cube-only K and V projection sections with conditional branches: the AIV path uses pl.row_sum/pl.col_expand_mul with an on-chip transpose and fans per-task IDs (with padding) into k_tile_tids/v_tile_tids; the fallback retains the original CUBE split-K + atomic-add matmul code.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#449: Both PRs make substantive changes to models/qwen3/14b/decode_layer.py—this PR adds AIV row-sum K/V projection variants while #449 replaced decode_fwd with a manual-scope e2e kernel in the same file.

Suggested labels

enhancement

🐇 Hoppity-hop through the decode lane,
Row-sums and transposes, a new refrain!
K and V projections take two roads now,
AIV or CUBE — an env-var vow.
The rabbit computes with a satisfied grin,
Toggle the flag and let the fast path begin! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding a vector (AIV) row_sum variant for qwen3-14b K/V projections, which is the core purpose of this PR.
Description check ✅ Passed The description is directly related to the changeset, explaining the environment toggles, implementation details of the AIV variant, tiling scheme, and validation results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces prototype toggles to offload the V and K projections to the VECTOR (AIV) unit using a row-sum/dot-product form to overlap with cube-resident projections. The review feedback correctly identifies critical issues with the newly introduced tiling constants: both K_RS_NV and V_RS_NV are set to 16 instead of 128. These incorrect values contradict the code comments and lead to severe runtime issues, including out-of-bounds array access during K-projection compilation/execution and untracked task dependencies causing race conditions in the V-projection.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread models/qwen3/14b/decode_layer.py Outdated
Comment on lines +183 to +187
K_RS_NV = 16
K_RS_NTILES = KV_HIDDEN // K_RS_NV # 8 N sub-tiles
K_RS_KC = 512
K_RS_TPN = QKV_N_TILE // K_RS_NV # 4 AIV tasks per qk_norm N-tile
K_RS_PAD = QKV_OK - K_RS_TPN # 1 padding slot per N-tile (QKV_OK=5)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The constant K_RS_NV is set to 16, which contradicts the comment # 8 N sub-tiles and the preceding docstring stating K_RS_NV=128 -> K_RS_TPN=4 (<=5), K_RS_NTILES=8.

If K_RS_NV is 16, then:

  1. K_RS_TPN becomes 512 // 16 = 32.
  2. K_RS_PAD becomes 5 - 32 = -27.
  3. The loop at lines 525-527 will attempt to write to k_tile_tids[_kt * QKV_OK + _t] for _t up to 31. Since k_tile_tids has a size of only 10 (KV_ON * QKV_OK), this will cause an out-of-bounds array access error during compilation or execution.

Setting K_RS_NV to 128 resolves all these issues, making K_RS_TPN = 4, K_RS_PAD = 1, and K_RS_NTILES = 8, which perfectly aligns with the comments and constraints.

Suggested change
K_RS_NV = 16
K_RS_NTILES = KV_HIDDEN // K_RS_NV # 8 N sub-tiles
K_RS_KC = 512
K_RS_TPN = QKV_N_TILE // K_RS_NV # 4 AIV tasks per qk_norm N-tile
K_RS_PAD = QKV_OK - K_RS_TPN # 1 padding slot per N-tile (QKV_OK=5)
K_RS_NV = 128
K_RS_NTILES = KV_HIDDEN // K_RS_NV # 8 N sub-tiles
K_RS_KC = 512
K_RS_TPN = QKV_N_TILE // K_RS_NV # 4 AIV tasks per qk_norm N-tile
K_RS_PAD = QKV_OK - K_RS_TPN # 1 padding slot per N-tile (QKV_OK=5)

Comment thread models/qwen3/14b/decode_layer.py Outdated
Comment on lines +174 to +175
V_RS_NV = 16
V_RS_NTILES = KV_HIDDEN // V_RS_NV # 8 N sub-tiles (<= 10 v_tile_tids slots)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The constant V_RS_NV is set to 16, which contradicts the comment # 8 N sub-tiles (<= 10 v_tile_tids slots) since KV_HIDDEN // 16 is 64 sub-tiles.

Furthermore, if V_RS_NV is 16, V_RS_NTILES becomes 64. Since v_tile_tids only has 10 slots (defined as KV_ON * QKV_OK), only the first 10 task IDs are tracked in v_tile_tids (lines 612-613), leaving the remaining 54 tasks completely untracked. This will cause downstream tasks (like rope_qkv) to execute before those 54 tasks have finished writing to v_proj, leading to race conditions and data corruption.

Setting V_RS_NV to 128 resolves this inconsistency and ensures all task dependencies are correctly tracked.

Suggested change
V_RS_NV = 16
V_RS_NTILES = KV_HIDDEN // V_RS_NV # 8 N sub-tiles (<= 10 v_tile_tids slots)
V_RS_NV = 128
V_RS_NTILES = KV_HIDDEN // V_RS_NV # 8 N sub-tiles (<= 10 v_tile_tids slots)

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@models/qwen3/14b/decode_layer.py`:
- Around line 174-176: The constant V_RS_NV is set to 16, which causes
V_RS_NTILES to compute to 64 sub-tiles instead of the documented 8 sub-tiles.
This exceeds the 10 slot budget (derived from KV_ON and QKV_OK) and creates a
data race where rope_qkv depends only on v_tile_tids[0..9] but leaves
rs_tids[10..63] unreferenced, causing those AIV tasks to complete without being
properly awaited and leaving v_proj unprotected. Change the value of V_RS_NV
from 16 to 128 to ensure V_RS_NTILES correctly computes to 8 and all sub-tile
references are properly tracked in the v_tile_tids slots.
- Around line 183-187: Change the constants K_RS_NV and V_RS_NV from their
current value of 16 to 128. This will fix the derived calculations: K_RS_TPN
will become 4 (valid, since it must be ≤ QKV_OK which is 5), K_RS_PAD will
become 1 (no longer negative), K_RS_NTILES will become 8, and V_RS_NTILES will
become 8 (within the ≤ 10 slots constraint). These changes ensure all dependent
array indices and loop unrolls throughout the decode layer remain valid and
within bounds, particularly at the locations where K_RS_PAD is used with
pl.unroll() and where k_tile_tids is indexed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1f734bec-d921-4aca-bea0-ca66bef0a8ff

📥 Commits

Reviewing files that changed from the base of the PR and between b51a34c and 4bc67c7.

📒 Files selected for processing (1)
  • models/qwen3/14b/decode_layer.py

Comment thread models/qwen3/14b/decode_layer.py Outdated
Comment on lines +174 to +176
V_RS_NV = 16
V_RS_NTILES = KV_HIDDEN // V_RS_NV # 8 N sub-tiles (<= 10 v_tile_tids slots)
V_RS_KC = 512

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win

V_RS_NV = 16 likely wrong — same root cause as K_RS_NV.

The comment claims V_RS_NTILES is "8 N sub-tiles (<= 10 v_tile_tids slots)", but with V_RS_NV = 16 and KV_HIDDEN = 1024, V_RS_NTILES = 1024 // 16 = 64, which exceeds the KV_ON * QKV_OK = 10 slot budget. The fan-out at Lines 612-613 then leaves rs_tids[10..63] unreferenced, so rope_qkv (deps on v_tile_tids[0..9]) never waits on those AIV tasks — a data race on v_proj. To get the documented 8 sub-tiles, V_RS_NV should be 128.

🐛 Proposed fix
-V_RS_NV = 16
+V_RS_NV = 128
 V_RS_NTILES = KV_HIDDEN // V_RS_NV  # 8 N sub-tiles (<= 10 v_tile_tids slots)
 V_RS_KC = 512
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/qwen3/14b/decode_layer.py` around lines 174 - 176, The constant
V_RS_NV is set to 16, which causes V_RS_NTILES to compute to 64 sub-tiles
instead of the documented 8 sub-tiles. This exceeds the 10 slot budget (derived
from KV_ON and QKV_OK) and creates a data race where rope_qkv depends only on
v_tile_tids[0..9] but leaves rs_tids[10..63] unreferenced, causing those AIV
tasks to complete without being properly awaited and leaving v_proj unprotected.
Change the value of V_RS_NV from 16 to 128 to ensure V_RS_NTILES correctly
computes to 8 and all sub-tile references are properly tracked in the
v_tile_tids slots.

Comment thread models/qwen3/14b/decode_layer.py Outdated
Comment on lines +183 to +187
K_RS_NV = 16
K_RS_NTILES = KV_HIDDEN // K_RS_NV # 8 N sub-tiles
K_RS_KC = 512
K_RS_TPN = QKV_N_TILE // K_RS_NV # 4 AIV tasks per qk_norm N-tile
K_RS_PAD = QKV_OK - K_RS_TPN # 1 padding slot per N-tile (QKV_OK=5)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Resolve KV_HIDDEN, QKV_N_TILE, QKV_OK (and contributing HEAD_DIM / NUM_KV_HEADS)
rg -nP '\b(KV_HIDDEN|QKV_N_TILE|QKV_OK|HEAD_DIM|NUM_KV_HEADS|KV_ON)\b\s*=' models/qwen3/14b/decode_layer.py

Repository: hw-native-sys/pypto-lib

Length of output: 912


🏁 Script executed:

# Check V_RS_NV at lines 174-176
sed -n '174,176p' models/qwen3/14b/decode_layer.py

# Check pl.unroll usage around line 528
sed -n '525,531p' models/qwen3/14b/decode_layer.py

Repository: hw-native-sys/pypto-lib

Length of output: 634


Fix K_RS_NV and V_RS_NV from 16 to 128.

The comments at lines 175 and 185 document intended values: K_RS_NV=128 → K_RS_TPN=4 (≤5), K_RS_NTILES=8 and V_RS_NTILES=8 (≤10 slots). The current code has both set to 16, breaking every derived constant:

With K_RS_NV = 16 and QKV_N_TILE = 512, QKV_OK = 5:

  • K_RS_TPN = 512 // 16 = 32 (must be ≤ QKV_OK = 5)
  • K_RS_PAD = 5 - 32 = -27pl.unroll(K_RS_PAD) at line 529 receives negative count (invalid)
  • k_tile_tids has only KV_ON * QKV_OK = 10 slots, but line 527 indexes up to 1 * 5 + 31 = 36 (out of bounds)

With V_RS_NV = 16 and KV_HIDDEN = 1024:

  • V_RS_NTILES = 1024 // 16 = 64 (violates stated ≤ 10 v_tile_tids slots in comment)

Correct value for both: 128.

  • K_RS_TPN = 512 // 128 = 4
  • K_RS_PAD = 5 - 4 = 1
  • K_RS_NTILES = 1024 // 128 = 8
  • V_RS_NTILES = 1024 // 128 = 8
Fix
-V_RS_NV = 16
+V_RS_NV = 128
 V_RS_NTILES = KV_HIDDEN // V_RS_NV  # 8 N sub-tiles (<= 10 v_tile_tids slots)

-K_RS_NV = 16
+K_RS_NV = 128
 K_RS_NTILES = KV_HIDDEN // K_RS_NV       # 8 N sub-tiles
 K_RS_KC = 512
 K_RS_TPN = QKV_N_TILE // K_RS_NV         # 4 AIV tasks per qk_norm N-tile
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/qwen3/14b/decode_layer.py` around lines 183 - 187, Change the
constants K_RS_NV and V_RS_NV from their current value of 16 to 128. This will
fix the derived calculations: K_RS_TPN will become 4 (valid, since it must be ≤
QKV_OK which is 5), K_RS_PAD will become 1 (no longer negative), K_RS_NTILES
will become 8, and V_RS_NTILES will become 8 (within the ≤ 10 slots constraint).
These changes ensure all dependent array indices and loop unrolls throughout the
decode layer remain valid and within bounds, particularly at the locations where
K_RS_PAD is used with pl.unroll() and where k_tile_tids is indexed.

…arison

BATCH=1 decode benchmark that extracts only Q/K/V projections
from decode_layer.py, supporting:
  - Q: always Cube (SPMD split-K)
  - K/V: toggle between Cube matmul and AIV VECTOR row_sum
    (env vars K_PROJ_ON_AIV / V_PROJ_ON_AIV)

Key tiling: AIV path uses col_expand+mul+row_sum+reshape which
eliminates the per-column loop, SAFE_BATCH padding, and transposed
accumulator — 4 ops per K-block instead of 16×4.

NV=16 is the hardware minimum: the BF16 source tile [KC,NV] cast
to FP32 must satisfy row-major row-byte alignment >= 32 B,
i.e. NV*sizeof(BF16)=NV*2>=32 -> NV>=16. KC=1024 is the UB max
(F32 transposed tile [16,1024] = 64 KB).
@Inspiron-st Inspiron-st force-pushed the feat/qwen3-14b-kv-proj-aiv-rowsum branch from 4bc67c7 to 942d2a4 Compare June 26, 2026 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants