Add Qwen3 14B A8W8 kernels by vegetabledoww · Pull Request #642 · hw-native-sys/pypto-lib

vegetabledoww · 2026-06-29T09:07:16Z

Add the Qwen3-14B A8W8 prefill_hidden and decode_fwd PyPTO kernels used by the native serving path.

Keep the existing BF16 Qwen3 path isolated while wiring the A8W8 kernel constants, runner support, and golden runner compile fixes needed by the generated callables.

Slim the lib-side delivery by removing obsolete standalone golden/smoke harnesses and unused JIT entry points after the scheduling path moved to pypto-serving. The debug-stage branches remain available for future numerical diagnosis.

Tracking issue: [Tracking] Qwen3-14B A8W8 decode kernels and TPOT optimization #665
Serving-side PR: Add native Qwen3 14B A8W8 serving path pypto-serving#48
Serving-side tracking issue: [Feature] Add native Qwen3-14B A8W8 serving path pypto-serving#52

coderabbitai · 2026-06-29T09:07:45Z

📝 Walkthrough

Walkthrough

Extends golden/runner.py with L3 runtime compatibility shims, a _compile_jit_with_compat fallback helper, graceful rebuild handling, and a save_actual_data parameter in run/run_jit. Adds two new Qwen3-14B A8W8 JIT kernel files (decode_layer_a8w8.py, prefill_fwd_a8w8.py) totaling ~2800 lines, and fixes a pl.tensor.set_validshape API call in rms_lm_head.py.

Changes

Golden Runner Compatibility and save_actual_data

Layer / File(s)	Summary
L3 and JIT compatibility patch helpers `golden/runner.py`	Adds `re`/`importlib.util` imports and internal helpers for: C++ bitcast attribute patching, Python orchestration SSA rewriting, `Worker.chip_contexts` shim, `submit_next_level` wrapping, and per-kernel `block_dim`/`aicpu_thread_num` propagation from generated `kernel_config.py` files.
Wiring compat helpers + `_compile_jit_with_compat` + tests `golden/runner.py`, `tests/golden/test_runner.py`	Calls L3 shims in `_try_l3_dispatch`, applies patch helpers in `run()` before execution, adds graceful `ModuleNotFoundError` fallback in `_setup_runtime_dir`, introduces `_compile_jit_with_compat` (prefers `fn.compile`, falls back to `_bind_args`+`_compile_to_program`+`pypto.ir.compile`), and covers both paths in `TestJitCompileCompat`.
`save_actual_data` in `run()` and `run_jit()` `golden/runner.py`	Adds `save_actual_data: bool = False` to both public signatures, updates docstrings, and persists runtime outputs to `work_dir/data/actual` when `save_data` is enabled and either no `golden_data` dir is set or `save_actual_data` is `True`.

Qwen3-14B A8W8 Decode Kernel

Layer / File(s)	Summary
Architecture constants, shapes, and assertions `models/qwen3/14b/decode_layer_a8w8.py`	Defines module header, debug-stage config, all fixed architecture constants, derived tensor/cache/tiling shapes, and alignment assertions.
`_decode_layer`: RMSNorm, QKV projection, RoPE, KV cache `models/qwen3/14b/decode_layer_a8w8.py`	Implements `_decode_layer` entrypoint: tensor allocation, input RMSNorm, INT8 activation quantization, Q/K/V INT8 matmul+dequant, RoPE, paged KV cache INT8 write with per-row scales, and `fa_work_table` construction.
Fused paged attention and online softmax `models/qwen3/14b/decode_layer_a8w8.py`	Creates accumulator scratch, dequantizes K-cache tiles, executes `fa_fused` for per-block softmax partials, then reduces via `online_softmax` to produce `attn_out`.
Out-projection, MLP, debug branches, and `decode_fwd` `models/qwen3/14b/decode_layer_a8w8.py`	Implements split-K out-projection, residual add, post-RMSNorm, MLP gate/up/SiLU/down split-K accumulation, debug-stage branches, final output consolidation, and the 40-layer `decode_fwd` wrapper.

Qwen3-14B A8W8 Prefill Kernel and rms_lm_head Fix

Layer / File(s)	Summary
Constants, `prefill_layer` signature, and dynamic bindings `models/qwen3/14b/prefill_fwd_a8w8.py`	Defines module header, dimension/tiling constants, debug-stage config, and `prefill_layer` exported kernel signature with all dynamic tensor shapes.
RMSNorm, activation quantization, QKV projections, and head RMSNorm `models/qwen3/14b/prefill_fwd_a8w8.py`	Outer batch/token-tile iteration, input RMSNorm, INT8 activation quantization, A8W8 Q projection with dequant, shared K/V projections, and per-head Q/K RMSNorm.
RoPE, KV cache update, and chunked causal attention `models/qwen3/14b/prefill_fwd_a8w8.py`	Applies RoPE, writes quantized K/V into paged cache with per-row scales, and runs chunked causal attention with online softmax and on-the-fly K/V dequantization.
Out-projection, RMSNorm, MLP, debug stages, and per-layer return `models/qwen3/14b/prefill_fwd_a8w8.py`	A8W8 output projection with residual add and post-RMSNorm, debug-stage branches for intermediate tensor writes, non-debug MLP (SiLU gate, up, down projections), and per-layer return.
`prefill_hidden_a8w8` wrapper and `rms_lm_head` API fix `models/qwen3/14b/prefill_fwd_a8w8.py`, `models/qwen3/14b/rms_lm_head.py`	`prefill_hidden_a8w8` kernel: token-major tiling copy, multi-layer `prefill_layer` iteration, and final hidden-state write. Fixes `pl.set_validshape` → `pl.tensor.set_validshape` in `rms_lm_head.py`.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

hw-native-sys/pypto-lib#222: Introduces the run_jit() golden harness that this PR extends with save_actual_data and _compile_jit_with_compat.
hw-native-sys/pypto-lib#484: Refactors run_jit to use the public JITFunction.compile API, directly preceding this PR's fallback compatibility layer for when that API is absent.

Suggested labels

enhancement

🐇 A rabbit writes tensors in INT8 today,
With paged KV cache tucked neatly away.
save_actual_data now writes what we ran,
And compat shims patch every generated scan.
Forty decode layers, prefill in a loop—
The warren compiles, and the kernels regroup! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 48.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title succinctly states the main change: adding Qwen3 14B A8W8 kernels.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The description matches the changeset, covering the new Qwen3 A8W8 kernels, runner support, and related cleanup.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces full-layer decode and prefill forward passes for Qwen3-14B with A8W8 quantization, alongside compatibility shims and patching utilities in the test runner. The review feedback focuses on performance optimizations in the prefill implementation, suggesting the vectorization of row-by-row dequantization loops (for Q, K, and V projections) and the SiLU activation loop to eliminate scalar loop overhead on the NPU. Additionally, the reviewer recommends removing redundant pre-zeroing of the SiLU tile and simplifying a conditional boolean assignment in the runner utility.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-29T09:09:13Z

+                            q_deq = pl.create_tensor([TOK_TILE, Q_OUT_CHUNK], dtype=pl.FP32)
+                            for q_deq_ti in pl.range(TOK_TILE):
+                                q_deq_row = pl.slice(q_deq_weighted, [1, Q_OUT_CHUNK], [q_deq_ti, 0])
+                                q_deq_scale = pl.read(act_scales, q_deq_ti)
+                                q_deq = pl.assemble(q_deq, pl.mul(q_deq_row, q_deq_scale), [q_deq_ti, 0])


The row-by-row dequantization loop can be completely vectorized using elementwise multiplication with a reshaped act_scales tensor. This avoids the overhead of slicing, reading, and assembling row-by-row on the NPU, significantly improving performance.

Suggested change

q_deq = pl.create_tensor([TOK_TILE, Q_OUT_CHUNK], dtype=pl.FP32)

for q_deq_ti in pl.range(TOK_TILE):

q_deq_row = pl.slice(q_deq_weighted, [1, Q_OUT_CHUNK], [q_deq_ti, 0])

q_deq_scale = pl.read(act_scales, q_deq_ti)

q_deq = pl.assemble(q_deq, pl.mul(q_deq_row, q_deq_scale), [q_deq_ti, 0])

q_deq = pl.mul(q_deq_weighted, pl.reshape(act_scales, [TOK_TILE, 1]))

gemini-code-assist · 2026-06-29T09:09:13Z

+                            k_deq = pl.create_tensor([TOK_TILE, KV_OUT_CHUNK], dtype=pl.FP32)
+                            for k_deq_ti in pl.range(TOK_TILE):
+                                k_deq_row = pl.slice(k_deq_weighted, [1, KV_OUT_CHUNK], [k_deq_ti, 0])
+                                k_deq_scale = pl.read(act_scales, k_deq_ti)
+                                k_deq = pl.assemble(k_deq, pl.mul(k_deq_row, k_deq_scale), [k_deq_ti, 0])


The row-by-row dequantization loop for the K projection can be vectorized using elementwise multiplication with a reshaped act_scales tensor, eliminating the scalar loop overhead.

Suggested change

k_deq = pl.create_tensor([TOK_TILE, KV_OUT_CHUNK], dtype=pl.FP32)

for k_deq_ti in pl.range(TOK_TILE):

k_deq_row = pl.slice(k_deq_weighted, [1, KV_OUT_CHUNK], [k_deq_ti, 0])

k_deq_scale = pl.read(act_scales, k_deq_ti)

k_deq = pl.assemble(k_deq, pl.mul(k_deq_row, k_deq_scale), [k_deq_ti, 0])

k_deq = pl.mul(k_deq_weighted, pl.reshape(act_scales, [TOK_TILE, 1]))

gemini-code-assist · 2026-06-29T09:09:13Z

+                            v_deq = pl.create_tensor([TOK_TILE, KV_OUT_CHUNK], dtype=pl.FP32)
+                            for v_deq_ti in pl.range(TOK_TILE):
+                                v_deq_row = pl.slice(v_deq_weighted, [1, KV_OUT_CHUNK], [v_deq_ti, 0])
+                                v_deq_scale = pl.read(act_scales, v_deq_ti)
+                                v_deq = pl.assemble(v_deq, pl.mul(v_deq_row, v_deq_scale), [v_deq_ti, 0])


The row-by-row dequantization loop for the V projection can be vectorized using elementwise multiplication with a reshaped act_scales tensor, eliminating the scalar loop overhead.

Suggested change

v_deq = pl.create_tensor([TOK_TILE, KV_OUT_CHUNK], dtype=pl.FP32)

for v_deq_ti in pl.range(TOK_TILE):

v_deq_row = pl.slice(v_deq_weighted, [1, KV_OUT_CHUNK], [v_deq_ti, 0])

v_deq_scale = pl.read(act_scales, v_deq_ti)

v_deq = pl.assemble(v_deq, pl.mul(v_deq_row, v_deq_scale), [v_deq_ti, 0])

v_deq = pl.mul(v_deq_weighted, pl.reshape(act_scales, [TOK_TILE, 1]))

gemini-code-assist · 2026-06-29T09:09:13Z

+                        for debug_kb in pl.range(HIDDEN_BLOCKS):
+                            debug_mlp_k0 = debug_kb * K_CHUNK
+                            with pl.at(level=pl.Level.CORE_GROUP, name_hint="debug_mlp_silu_out"):
+                                debug_mlp_chunk = pl.slice(mlp_silu_tile, [TOK_TILE, K_CHUNK], [0, debug_mlp_k0])
+                                out = pl.assemble(out, debug_mlp_chunk, [token_p0, debug_mlp_k0])
+
+                    if DEBUG_STAGE_ID == 10:
+                        pass
+                    elif DEBUG_STAGE_ID == 12:
+                        debug_down_partial0_tensor = pl.create_tensor([TOK_TILE, K_CHUNK], dtype=pl.FP32)
+                        with pl.at(level=pl.Level.CORE_GROUP, name_hint="debug_down_partial0"):
+                            debug_mlp_chunk0 = pl.slice(mlp_down_tile, [TOK_TILE, MLP_OUT_CHUNK], [0, 0])
+                            debug_w_down_chunk0 = pl.slice(w_down, [MLP_OUT_CHUNK, K_CHUNK], [layer_inter_base, 0])
+                            debug_down_partial0 = pl.matmul(
+                                debug_mlp_chunk0,


The SiLU activation loop can be fully vectorized across the 2D tensor chunks without row-by-row slicing. Additionally, pre-zeroing the mlp_silu_tile with mlp_zero is redundant because pl.assemble completely overwrites the destination slice.

Suggested change

for debug_kb in pl.range(HIDDEN_BLOCKS):

debug_mlp_k0 = debug_kb * K_CHUNK

with pl.at(level=pl.Level.CORE_GROUP, name_hint="debug_mlp_silu_out"):

debug_mlp_chunk = pl.slice(mlp_silu_tile, [TOK_TILE, K_CHUNK], [0, debug_mlp_k0])

out = pl.assemble(out, debug_mlp_chunk, [token_p0, debug_mlp_k0])

if DEBUG_STAGE_ID == 10:

pass

elif DEBUG_STAGE_ID == 12:

debug_down_partial0_tensor = pl.create_tensor([TOK_TILE, K_CHUNK], dtype=pl.FP32)

with pl.at(level=pl.Level.CORE_GROUP, name_hint="debug_down_partial0"):

debug_mlp_chunk0 = pl.slice(mlp_down_tile, [TOK_TILE, MLP_OUT_CHUNK], [0, 0])

debug_w_down_chunk0 = pl.slice(w_down, [MLP_OUT_CHUNK, K_CHUNK], [layer_inter_base, 0])

debug_down_partial0 = pl.matmul(

debug_mlp_chunk0,

mlp_silu_sigmoid = pl.recip(pl.add(pl.exp(pl.neg(gate_acc)), 1.0))

mlp_silu_chunk = pl.mul(pl.mul(gate_acc, mlp_silu_sigmoid), up_acc)

mlp_silu_tile = pl.assemble(

mlp_silu_tile,

pl.cast(mlp_silu_chunk, target_type=pl.BF16),

[0, mlp_out_o0],

)

gemini-code-assist · 2026-06-29T09:09:13Z

+    if hasattr(Worker, "chip_contexts"):
+        chip_contexts_installed = True
+    else:
+        chip_contexts_installed = False


This conditional assignment can be simplified to a direct boolean assignment.

chip_contexts_installed = hasattr(Worker, "chip_contexts")

coderabbitai

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@golden/runner.py`:
- Line 317: Ruff B010 is flagging the constant-name setattr usage in the Worker
setup. Replace the setattr-based property assignment in the code that defines
chip_contexts, and the similar assignment mentioned for the other location, with
direct class attribute assignment while preserving the same property behavior.
Use the existing Worker and property(_chip_contexts) symbols to locate the
affected spots and update both occurrences consistently.
- Around line 804-805: The docstring for the data-saving behavior is incomplete:
`save_actual_data` is documented only for the `golden_data` path, but `Runner`
also writes `data/actual` in the non-golden path even when it is false. Update
the documentation near the `Runner`/data persistence logic to describe both
branches clearly, or adjust the condition so `data/actual` is only saved when
intended; make sure the explanation matches the behavior in the relevant
`save_actual_data` handling and `golden_data` flow.
- Around line 391-395: The import handling in runner.py is too broad because the
ModuleNotFoundError around rebuild_kernel_cpp_from_pto also hides missing
transitive dependencies inside pypto.runtime.debug.pto_rebuild. Update the
fallback in the code path that imports rebuild_kernel_cpp_from_pto so it only
returns work_dir when the debug module itself is unavailable, and let other
import failures surface instead of silently reusing stale artifacts.
- Around line 1100-1101: run_jit() is missing the same generated-artifact
patching that run() applies, so JIT-compiled or reloaded runtime_dir artifacts
can still fail on bitcast/host_orch compatibility. After
_compile_jit_with_compat() and after any runtime_dir reload logic in run_jit(),
invoke the same patch helpers used by run() on the compiled output_dir /
work_dir before proceeding. Use the existing run(), _compile_jit_with_compat(),
and runtime_dir handling paths as the reference points so the JIT flow matches
the non-JIT artifact normalization.
- Around line 965-975: The JIT fallback compile path in runner.py is dropping
distributed settings, so entries that rely on L3/distributed mode are compiled
with the wrong configuration. Update the ir.compile call in the fallback that
rebuilds the ir.Program to forward compile_cfg["distributed_config"] (or the
equivalent distributed config value) alongside the existing run_config fields,
using the ir.compile entry point and the surrounding fallback logic as the place
to fix it.

In `@models/qwen3/14b/decode_layer_a8w8.py`:
- Line 193: The comments in the affected decode layer sections use ambiguous
multiplication glyphs that Ruff flags as RUF003. Update the marked comment text
in decode_layer_a8w8.py to replace each non-ASCII multiplication symbol with
plain ASCII wording like x or by, and apply the same cleanup in the other
flagged comment blocks near the related SEQ_TILE/head references.
- Around line 19-25: The copied harness docstring in decode_layer_a8w8.py is
stale and conflicts with the current serving path. Update or remove the header
block near the decode harness so it no longer says the program does not compile
and no longer references obsolete standalone script names; instead, describe the
native serving flow through decode_fwd and keep the docstring aligned with the
current kernel entrypoint and behavior.

In `@models/qwen3/14b/prefill_fwd_a8w8.py`:
- Around line 1120-1125: The final writeback in the prefill path is assembling a
full token tile into dynamic output even when the last tile is partial. Update
the writeback logic around out_chunk, out_out_quant_chunk_bf16, and pl.assemble
in prefill_fwd_a8w8 to trim the chunk with the same set_validshape handling used
for resid1_chunk before any out writeback, and make sure the same fix is applied
in the debug branches as well.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 25cbdbf0-d212-4a1a-b664-b3666d82adc1

📥 Commits

Reviewing files that changed from the base of the PR and between fd651c0 and c0e0128.

📒 Files selected for processing (5)

golden/runner.py
models/qwen3/14b/decode_layer_a8w8.py
models/qwen3/14b/prefill_fwd_a8w8.py
models/qwen3/14b/rms_lm_head.py
tests/golden/test_runner.py

coderabbitai · 2026-06-29T09:17:40Z

+        def _chip_contexts(self: Any) -> list[Any]:  # noqa: ANN001 - runtime compatibility shim
+            return []
+
+        setattr(Worker, "chip_contexts", property(_chip_contexts))


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Use direct attribute assignment to satisfy Ruff B010.

The constant-name setattr() calls are flagged by Ruff and can be replaced without changing behavior.

Proposed fix

- setattr(Worker, "chip_contexts", property(_chip_contexts)) + Worker.chip_contexts = property(_chip_contexts) - setattr(_submit_next_level_compat, "_pypto_legacy_chip_callable_compat", True) + _submit_next_level_compat._pypto_legacy_chip_callable_compat = True

Also applies to: 339-339

🧰 Tools

🪛 Ruff (0.15.18)

[warning] 317-317: Do not call setattr with a constant attribute value. It is not any safer than normal property access.

Replace setattr with assignment

(B010)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@golden/runner.py` at line 317, Ruff B010 is flagging the constant-name setattr usage in the Worker setup. Replace the setattr-based property assignment in the code that defines chip_contexts, and the similar assignment mentioned for the other location, with direct class attribute assignment while preserving the same property behavior. Use the existing Worker and property(_chip_contexts) symbols to locate the affected spots and update both occurrences consistently.

Source: Linters/SAST tools

coderabbitai · 2026-06-29T09:17:40Z

+    try:
+        from pypto.runtime.debug.pto_rebuild import rebuild_kernel_cpp_from_pto
+    except ModuleNotFoundError:
+        print("[runtime_only] pypto.runtime.debug unavailable; using existing runtime artifacts", flush=True)
+        return work_dir


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Avoid swallowing transitive import failures.

This catches any ModuleNotFoundError raised while importing pto_rebuild, including missing dependencies inside that module, and then silently reuses stale artifacts. Only fall back when the debug module itself is unavailable.

Proposed fix

- except ModuleNotFoundError: + except ModuleNotFoundError as e: + if e.name not in {"pypto.runtime.debug", "pypto.runtime.debug.pto_rebuild"}: + raise print("[runtime_only] pypto.runtime.debug unavailable; using existing runtime artifacts", flush=True) return work_dir

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

from pypto.runtime.debug.pto_rebuild import rebuild_kernel_cpp_from_pto

except ModuleNotFoundError:

print("[runtime_only] pypto.runtime.debug unavailable; using existing runtime artifacts", flush=True)

return work_dir

try:

from pypto.runtime.debug.pto_rebuild import rebuild_kernel_cpp_from_pto

except ModuleNotFoundError as e:

if e.name not in {"pypto.runtime.debug", "pypto.runtime.debug.pto_rebuild"}:

raise

print("[runtime_only] pypto.runtime.debug unavailable; using existing runtime artifacts", flush=True)

return work_dir

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@golden/runner.py` around lines 391 - 395, The import handling in runner.py is too broad because the ModuleNotFoundError around rebuild_kernel_cpp_from_pto also hides missing transitive dependencies inside pypto.runtime.debug.pto_rebuild. Update the fallback in the code path that imports rebuild_kernel_cpp_from_pto so it only returns work_dir when the debug module itself is unavailable, and let other import failures surface instead of silently reusing stale artifacts.

coderabbitai · 2026-06-29T09:17:40Z

+        save_actual_data: When True with *golden_data*, also persist runtime
+            outputs to ``{work_dir}/data/actual`` for downstream consumers.


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Clarify when actual outputs are saved.

The implementation saves data/actual by default when golden_data is not used, even if save_actual_data=False; the docstring only describes the golden_data + save_actual_data=True case. Please document both branches or tighten the condition.

Also applies to: 1030-1031

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@golden/runner.py` around lines 804 - 805, The docstring for the data-saving behavior is incomplete: `save_actual_data` is documented only for the `golden_data` path, but `Runner` also writes `data/actual` in the non-golden path even when it is false. Update the documentation near the `Runner`/data persistence logic to describe both branches clearly, or adjust the condition so `data/actual` is only saved when intended; make sure the explanation matches the behavior in the relevant `save_actual_data` handling and `golden_data` flow.

coderabbitai · 2026-06-29T09:17:40Z

+    return ir.compile(
+        program,
+        output_dir=run_config.save_kernels_dir,
+        strategy=run_config.strategy,
+        backend_type=run_config.backend_type,
+        dump_passes=run_config.dump_passes,
+        diagnostic_phase=run_config.diagnostic_phase,
+        disabled_diagnostics=run_config.disabled_diagnostics,
+        platform=run_config.platform,
+        profiling=run_config.compile_profiling,
+    )


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify the PyPTO compile config contract without executing repo code. rg -n "distributed_config|def compile|class RunConfig" -C3 .

Repository: hw-native-sys/pypto-lib

Length of output: 11268

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== golden/runner.py around run_jit / fallback ==" sed -n '930,990p' golden/runner.py echo echo "== golden/runner.py around JIT docs / compile_cfg handling ==" sed -n '990,1145p' golden/runner.py echo echo "== tests in tests/golden/test_runner.py around JIT / distributed_config ==" sed -n '108,140p' tests/golden/test_runner.py sed -n '1228,1264p' tests/golden/test_runner.py

Repository: hw-native-sys/pypto-lib

Length of output: 13393

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== files under pypto-related paths ==" git ls-files | rg '(^|/)(pypto|golden|tests/golden)/' echo echo "== search for ir.compile / distributed_config in source and tests ==" rg -n "ir\.compile\(|distributed_config|DistributedCompiledProgram|_maybe_reload_l3|_compile_jit_with_compat" golden tests -C 2

Repository: hw-native-sys/pypto-lib

Length of output: 13597

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== golden/runner.py around the non-JIT compile path ==" sed -n '836,874p' golden/runner.py echo echo "== tests for the fallback branch of _compile_jit_with_compat ==" sed -n '140,190p' tests/golden/test_runner.py

Repository: hw-native-sys/pypto-lib

Length of output: 4483

Forward distributed_config in golden/runner.py:965-975.
The JIT fallback rebuilds an ir.Program and calls ir.compile() without passing compile_cfg["distributed_config"], so JIT entries that need L3/distributed settings will compile in the wrong mode.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@golden/runner.py` around lines 965 - 975, The JIT fallback compile path in runner.py is dropping distributed settings, so entries that rely on L3/distributed mode are compiled with the wrong configuration. Update the ir.compile call in the fallback that rebuilds the ir.Program to forward compile_cfg["distributed_config"] (or the equivalent distributed config value) alongside the existing run_config fields, using the ir.compile entry point and the surrounding fallback logic as the place to fix it.

coderabbitai · 2026-06-29T09:17:40Z

+            compiled = _compile_jit_with_compat(fn, dummy_args, cfg)
            work_dir = Path(compiled.output_dir)


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Apply the generated-artifact patches in run_jit() too.

run() patches the compiled/runtime directory before execution, but run_jit() skips those helpers after JIT compilation or runtime_dir reload. JIT L3 artifacts can still hit the same bitcast/host_orch compatibility failures.

Proposed fix

if compile_only: total = time.time() - start print(f"[RUN] PASS ({total:.2f}s)", flush=True) return RunResult(passed=True, execution_time=total, work_dir=work_dir) + + if work_dir is not None: + _patch_aicore_bitcast_helpers(work_dir) + _patch_l3_single_submit_host_orch(work_dir) + _patch_l3_host_orch_ssa_aliases(work_dir) # Generate Inputs

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

compiled = _compile_jit_with_compat(fn, dummy_args, cfg)

work_dir = Path(compiled.output_dir)

compiled = _compile_jit_with_compat(fn, dummy_args, cfg)

work_dir = Path(compiled.output_dir)

if work_dir is not None:

_patch_aicore_bitcast_helpers(work_dir)

_patch_l3_single_submit_host_orch(work_dir)

_patch_l3_host_orch_ssa_aliases(work_dir)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@golden/runner.py` around lines 1100 - 1101, run_jit() is missing the same generated-artifact patching that run() applies, so JIT-compiled or reloaded runtime_dir artifacts can still fail on bitcast/host_orch compatibility. After _compile_jit_with_compat() and after any runtime_dir reload logic in run_jit(), invoke the same patch helpers used by run() on the compiled output_dir / work_dir before proceeding. Use the existing run(), _compile_jit_with_compat(), and runtime_dir handling paths as the reference points so the JIT flow matches the non-JIT artifact normalization.

coderabbitai · 2026-06-29T09:17:40Z

+EXPECTED / INTENT program (the dense block-level load balancer). NOTE: this does
+NOT compile on the current toolchain — the data-dependent ``pl.read`` scalar that
+feeds the store offset (``g_base + sb * Q_HEAD_PAD``) trips a PTO codegen
+limitation (``GetOrCreateTensorView`` / ptoas ``index vs i64``; see
+``KNOWN_ISSUES.md``). It is written to capture the desired structure; the
+affine fallback that DOES compile lives in
+``qwen3_manual_scope_fused_kvsplit_static.py`` (coprime-stride, ~1.9x balance).


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Refresh the copied harness docstring.

This new serving kernel is expected to compile/run via decode_fwd, but the header still says the program does not compile and points users at obsolete standalone script names. Please update or remove this stale usage block so it matches the PR’s native serving path.

Also applies to: 52-55

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/qwen3/14b/decode_layer_a8w8.py` around lines 19 - 25, The copied harness docstring in decode_layer_a8w8.py is stale and conflicts with the current serving path. Update or remove the header block near the decode harness so it no longer says the program does not compile and no longer references obsolete standalone script names; instead, describe the native serving flow through decode_fwd and keep the docstring aligned with the current kernel entrypoint and behavior.

coderabbitai · 2026-06-29T09:17:40Z

+# ragged seq_lens, at the cost of more fa_fused tasks
+# (BATCH * (NUM_KV_HEADS // 2) * KV_SPLITS).
+# Dispatch unit = ONE seq block (TOKENS_PER_SPLIT == SEQ_TILE). Every fa_fused
+# work item is then a single SEQ_TILE block (×2 heads) — equal cost regardless of


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Replace ambiguous multiplication-sign glyphs flagged by Ruff.

Ruff reports RUF003 on these comment lines. Use plain x/by wording to keep lint clean.

Proposed cleanup

-# Dispatch unit = ONE seq block (TOKENS_PER_SPLIT == SEQ_TILE). Every fa_fused -# work item is then a single SEQ_TILE block (×2 heads) +# Dispatch unit = ONE seq block (TOKENS_PER_SPLIT == SEQ_TILE). Every fa_fused +# work item is then a single SEQ_TILE block (by 2 heads)

Apply the same replacement to the other flagged comment lines.

Also applies to: 215-215, 227-227, 817-817

🧰 Tools

🪛 Ruff (0.15.18)

[warning] 193-193: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF003)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/qwen3/14b/decode_layer_a8w8.py` at line 193, The comments in the affected decode layer sections use ambiguous multiplication glyphs that Ruff flags as RUF003. Update the marked comment text in decode_layer_a8w8.py to replace each non-ASCII multiplication symbol with plain ASCII wording like x or by, and apply the same cleanup in the other flagged comment blocks near the related SEQ_TILE/head references.

Source: Linters/SAST tools

coderabbitai · 2026-06-29T09:17:40Z

+                                    out_chunk = pl.add(
+                                        down_acc,
+                                        pl.slice(resid1_tile, [TOK_TILE, K_CHUNK], [0, down_proj_d0]),
+                                    )
+                                    out_out_quant_chunk_bf16 = pl.cast(out_chunk, target_type=pl.BF16)
+                                    out = pl.assemble(out, out_out_quant_chunk_bf16, [token_p0, down_proj_d0])


🗄️ Data Integrity & Integration | 🔴 Critical | ⚡ Quick win

Trim the final partial tile before assembling into out.

Line 1125 writes a full [TOK_TILE, K_CHUNK] chunk into dynamic out; when valid_tok < TOK_TILE, this can write invalid rows past the packed prefill output. Apply the same set_validshape trim used for resid1_chunk before all out writebacks, including debug branches.

🐛 Proposed fix for the final writeback

- out_out_quant_chunk_bf16 = pl.cast(out_chunk, target_type=pl.BF16) - out = pl.assemble(out, out_out_quant_chunk_bf16, [token_p0, down_proj_d0]) + out_out_quant_chunk_bf16 = pl.cast(out_chunk, target_type=pl.BF16) + out_out_quant_chunk_valid = pl.tensor.set_validshape( + out_out_quant_chunk_bf16, + valid_tok, + K_CHUNK, + ) + out = pl.assemble(out, out_out_quant_chunk_valid, [token_p0, down_proj_d0])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

out_chunk = pl.add(

down_acc,

pl.slice(resid1_tile, [TOK_TILE, K_CHUNK], [0, down_proj_d0]),

)

out_out_quant_chunk_bf16 = pl.cast(out_chunk, target_type=pl.BF16)

out = pl.assemble(out, out_out_quant_chunk_bf16, [token_p0, down_proj_d0])

out_chunk = pl.add(

down_acc,

pl.slice(resid1_tile, [TOK_TILE, K_CHUNK], [0, down_proj_d0]),

)

out_out_quant_chunk_bf16 = pl.cast(out_chunk, target_type=pl.BF16)

out_out_quant_chunk_valid = pl.tensor.set_validshape(

out_out_quant_chunk_bf16,

valid_tok,

K_CHUNK,

)

out = pl.assemble(out, out_out_quant_chunk_valid, [token_p0, down_proj_d0])

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/qwen3/14b/prefill_fwd_a8w8.py` around lines 1120 - 1125, The final writeback in the prefill path is assembling a full token tile into dynamic output even when the last tile is partial. Update the writeback logic around out_chunk, out_out_quant_chunk_bf16, and pl.assemble in prefill_fwd_a8w8 to trim the chunk with the same set_validshape handling used for resid1_chunk before any out writeback, and make sure the same fix is applied in the debug branches as well.

Add an explicit batched Q RoPE path guarded by QWEN_A8W8_Q_ROPE_BATCH_EXPLICIT and keep fused QK norm dependencies explicit so the A8W8 decode fast path preserves text quality during generation.

gemini-code-assist Bot reviewed Jun 29, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

Add Qwen3 14B A8W8 kernels

c7ada3a

vegetabledoww force-pushed the qwen3-a8w8-stage4-generate branch from c0e0128 to c7ada3a Compare July 1, 2026 02:12

Optimize A8W8 decode RoPE quality path

d220b6a

Add an explicit batched Q RoPE path guarded by QWEN_A8W8_Q_ROPE_BATCH_EXPLICIT and keep fused QK norm dependencies explicit so the A8W8 decode fast path preserves text quality during generation.

This was referenced Jul 1, 2026

[Tracking] Qwen3-14B A8W8 decode kernels and TPOT optimization #665

Open

[Feature] Add native Qwen3-14B A8W8 serving path hw-native-sys/pypto-serving#52

Open

Add native Qwen3 14B A8W8 serving path hw-native-sys/pypto-serving#48

Open

		save_actual_data: When True with golden_data, also persist runtime
		outputs to ``{work_dir}/data/actual`` for downstream consumers.

		compiled = _compile_jit_with_compat(fn, dummy_args, cfg)
		work_dir = Path(compiled.output_dir)

Uh oh!

Conversation

vegetabledoww commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vegetabledoww commented Jun 29, 2026 •

edited

Loading

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading