perf(dsv4 hc_pre): tune MoE AICore path by wangqin1723-max · Pull Request #661 · hw-native-sys/pypto-lib

wangqin1723-max · 2026-07-01T08:46:23Z

Summary

Builds on perf(dsv4 hc_pre): split-K decode path (~1.8x), keep fused prefill #652 and tunes hc_pre for the MoE inline path: raises decode split-K fanout, widens fused prefill D_TILE, and computes post before pre/mix_x to shorten Vec live ranges.
Same-card MoE AICore end-to-end profiling over three runs improved from baseline 676.61 us mean to 655.87 us mean, a 3.07% reduction.
Golden validation passed for the measured MoE runs; ruff check --config ruff.toml models/deepseek/v4/hc_pre.py and python tests/lint/check_english_only.py passed locally.

Related Issues

Depends on #652.

Decode (T = B*S = 8) ran the fused single-spmd hc_pre on one core (1 of 24 AIC, 1 of 48 AIV): one token-tile is one spmd block. Dispatch on T at runtime so each regime gets its own tiling: T <= LINEAR_T_TILE -> _hc_pre_decode (split-K + per-axis fan-out) else -> _hc_pre_prefill (the fused single-task, hw-native-sys#533) _hc_pre_decode mirrors hc_head's pure-AIC split-K: cast x -> x_fp32, split the K=16384 projection into LINEAR_OK slices that atomic-add FP32 partials (1 cube task -> LINEAR_OK), fan the cast over K and mix_x over D, and keep the 20-iter Sinkhorn as its own serial scope (a latency floor). Prefill keeps the fused task: its token-tiles already fill the chip, so the decode fan-out would only add AICPU dispatch overhead. hc_pre inlines into each decode/prefill attention kernel, so each context compiles only its branch. Device a2a3 (910B), golden-validated both modes, best-of-N: decode 125us -> ~70us (~1.8x; matmul 40->7us, 42us BF16 pad removed) prefill 147us -> ~143us (flat, no regression)

comb_sinkhorn loads each comb group HC_PAD-wide at offset k*HC_MULT, so group 3 spans cols [12:20]; the HC_MULT*HC_MULT=16-wide alloc made that load descriptor exceed the tensor (valid_shapes bounded the real transfer to [12:16], but the descriptor itself was out of bounds). Allocate 32-wide like mixes_raw so every group descriptor stays in-bounds. (gemini review)

The split-K decode path was green on the a2a3 device but regressed both simulators (which were green on main): - a5sim: allow_early_resolve emits set_allow_early_resolve, which the a5 L0TaskArgs (Arg<32,16>) has no member for -> orchestration C++ compile error in every kernel that inlines hc_pre. Drop the flag; it is a scheduling hint the fused / pre-fusion paths never used. - a2a3sim (and a5sim at runtime): assemble(atomic=Add) is not modeled by the simulators, so the split-K partials did not accumulate -- decode outputs were 75-96% wrong on sim while the device passed. Replace the atomic-add with plain-write partials into mixes_partial + a reduce scope that sums the LINEAR_OK slices per token-tile. a2a3 golden re-validated both modes. Decode best-of-N ~80us (was ~70us with the atomic-add: the reduce scope costs ~10us, but is correct on device and sim).

Decode: revert the split-K accumulation from the sim-safe reduce back to assemble(atomic=Add) (~80us -> ~70us). The a2a3sim / a5sim simulators do not model the atomic accumulate, so those two sim CI checks are skipped for hc_pre; the a2a3 device path is golden-correct (hc_head takes the same approach). Prefill: the decode/prefill dispatch means every prefill tile is full, so the matmul reads x_flat directly in static LINEAR_T_TILE tiles and the old BF16 16-row pad scratch (a ~35us redundant x_flat->x_matmul copy) is removed: prefill ~143us -> ~85us. Guarded by an assert that the prefill token count tiles evenly by LINEAR_T_TILE (the clean dynamic-valid_shape form is ptoas blocked in the mixed cube+vec kernel). a2a3 golden re-validated both modes (decode B4S2, prefill B1S128).

Ruff B007 (CodeRabbit review); the comb_sinkhorn pl.pipeline counter is intentionally unused. No behavior change.

…rop moot prefill assert Rebased onto hw-native-sys#653, which removed the M-axis pad from the fused hc_pre via valid_shape+fillpad. The prefill path now uses that (the conflict resolution), so the PREFILL % LINEAR_T_TILE assert (needed only by the static-slice variant) is unnecessary. Refresh the docstring: vs the pad-free fused baseline, decode ~75us -> ~68us (split-K parallelizes the 1-cube matmul); prefill ~unchanged (~87us, same fused path hw-native-sys#653 already optimized).

- Raise decode split-K fanout to improve small-T AICore fill - Widen fused prefill D tile to reduce mix_x loop count - Compute post before pre/mix_x to shorten Vec live ranges

coderabbitai · 2026-07-01T08:46:50Z

📝 Walkthrough

Walkthrough

The hc_pre kernel in models/deepseek/v4/hc_pre.py is refactored from a single fused implementation into a runtime-dispatched design: a new _hc_pre_decode split-K kernel handles small-T decode, the existing _hc_pre_prefill handles large-T prefill, with matching golden reference and print string updates.

Changes

hc_pre decode/prefill refactor

Layer / File(s)	Summary
New split-K decode kernel `models/deepseek/v4/hc_pre.py`	Adds module docstring describing decode vs prefill design and a new `_hc_pre_decode` kernel: BF16→FP32 cast, RMS norm, split-K atomic-add HC projection, then in-kernel `pre`/`post`/`comb` (Sinkhorn)/`mix_x` computation returning `x_mixed`.
Prefill epilogue reordering `models/deepseek/v4/hc_pre.py`	`_hc_pre_prefill` now computes/stores `post` from `mixes_gm` before deriving `pre` (kept in Vec) for later use by `mix_x`; minor Sinkhorn pipeline loop line adjustment.
Runtime dispatch wrapper, golden reference, and print header `models/deepseek/v4/hc_pre.py`	`hc_pre` now branches on `t_dim_sel` vs `LINEAR_T_TILE` to call `_hc_pre_decode` or `_hc_pre_prefill`; `golden_hc_pre` uses `RMS_K_CHUNK`/`LINEAR_K_CHUNK`; mode print header drops "1spmd" wording.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant hc_pre as hc_pre wrapper
    participant Decode as _hc_pre_decode
    participant Prefill as _hc_pre_prefill

    Caller->>hc_pre: call hc_pre(x, ...)
    hc_pre->>hc_pre: t_dim_sel = pl.tensor.dim(x, 0)
    alt t_dim_sel <= LINEAR_T_TILE
        hc_pre->>Decode: dispatch decode path
        Decode->>Decode: cast BF16->FP32, RMS norm
        Decode->>Decode: split-K atomic-add projection
        Decode->>Decode: compute pre/post/comb/mix_x
        Decode-->>hc_pre: x_mixed
    else t_dim_sel > LINEAR_T_TILE
        hc_pre->>Prefill: dispatch prefill path
        Prefill->>Prefill: compute post from mixes_gm
        Prefill->>Prefill: compute pre in Vec
        Prefill->>Prefill: compute mix_x
        Prefill-->>hc_pre: x_mixed
    end
    hc_pre-->>Caller: return x_mixed

Possibly related PRs

hw-native-sys/pypto-lib#522: Both refactor hc_pre.py's t_dim dispatch and decode vs prefill paths plus matching golden_hc_pre updates.
hw-native-sys/pypto-lib#533: Both restructure the fused pre/post/comb/mix_x pipeline in hc_pre.py and update the golden reference accordingly.
hw-native-sys/pypto-lib#545: Both modify hc_pre.py's epilogue Sinkhorn comb and mix_x computation logic.

Suggested labels: enhancement

Poem

A rabbit hops through split-K rows,
Decode and prefill, two paths it knows,
RMS hums, atomics add,
Sinkhorn spins till the mix is glad. 🐇
"1spmd" hops away, retired,
New chunks aligned as the golden required!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is concise and accurately summarizes the performance tuning of the hc_pre MoE AICore path.
Description check	✅ Passed	The description clearly matches the code changes and performance validation described in the pull request.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist · 2026-07-01T08:52:32Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@models/deepseek/v4/hc_pre.py`:
- Around line 84-87: The comment in hc_pre.py is stale because prefill no longer
goes through the split-K LINEAR_OK path; update the surrounding documentation
near the K=HC_DIM reduction logic to describe the current dispatch behavior
accurately. Remove the claim that prefill packs OK*8 tasks into ~24-wide waves
and instead note that large T routes to _hc_pre_prefill with the fused
single-matmul-per-token-tile path, while LINEAR_OK only applies to
decode/small-T behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1d2b0afe-73f0-4acc-8566-bf78dccc3cb7

📥 Commits

Reviewing files that changed from the base of the PR and between 57772f3 and b9d9dc1.

📒 Files selected for processing (1)

models/deepseek/v4/hc_pre.py

coderabbitai · 2026-07-01T09:06:07Z

+# Split the K=HC_DIM reduction into LINEAR_OK slices that atomic-add their FP32
+# partials, filling idle cubes at small T (decode: 1 token-tile -> LINEAR_OK
+# cube tasks) and shortening each task's matmul_acc chain. Higher OK fills more
+# decode cubes; prefill (8 token-tiles) packs OK*8 tasks into waves of ~24.


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Stale comment: prefill no longer uses the split-K/LINEAR_OK path.

The dispatch routes large T to _hc_pre_prefill, which keeps the fused single-matmul-per-token-tile path and never references LINEAR_OK. The trailing claim that "prefill (8 token-tiles) packs OK*8 tasks into waves of ~24" describes the earlier unified split-K design (the "briefly using atomic split-K" commit) and now contradicts the module docstring at Line 38-42. This risks misleading anyone tuning LINEAR_OK into thinking it affects prefill.

📝 Suggested comment fix

# Split the K=HC_DIM reduction into LINEAR_OK slices that atomic-add their FP32 # partials, filling idle cubes at small T (decode: 1 token-tile -> LINEAR_OK # cube tasks) and shortening each task's matmul_acc chain. Higher OK fills more -# decode cubes; prefill (8 token-tiles) packs OK*8 tasks into waves of ~24. +# decode cubes. Split-K is decode-only; _hc_pre_prefill keeps the fused +# single-matmul-per-token-tile path and does not use LINEAR_OK.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Split the K=HC_DIM reduction into LINEAR_OK slices that atomic-add their FP32

# partials, filling idle cubes at small T (decode: 1 token-tile -> LINEAR_OK

# cube tasks) and shortening each task's matmul_acc chain. Higher OK fills more

# decode cubes; prefill (8 token-tiles) packs OK*8 tasks into waves of ~24.

# Split the K=HC_DIM reduction into LINEAR_OK slices that atomic-add their FP32

# partials, filling idle cubes at small T (decode: 1 token-tile -> LINEAR_OK

# cube tasks) and shortening each task's matmul_acc chain. Higher OK fills more

# decode cubes. Split-K is decode-only; _hc_pre_prefill keeps the fused

# single-matmul-per-token-tile path and does not use LINEAR_OK.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/hc_pre.py` around lines 84 - 87, The comment in hc_pre.py is stale because prefill no longer goes through the split-K LINEAR_OK path; update the surrounding documentation near the K=HC_DIM reduction logic to describe the current dispatch behavior accurately. Remove the claim that prefill packs OK*8 tasks into ~24-wide waves and instead note that large T routes to _hc_pre_prefill with the fused single-matmul-per-token-tile path, while LINEAR_OK only applies to decode/small-T behavior.

Hzfengsy and others added 7 commits June 30, 2026 21:58

style(dsv4 hc_pre): rename unused Sinkhorn loop var sk_it -> _sk_it

219378c

Ruff B007 (CodeRabbit review); the comb_sinkhorn pl.pipeline counter is intentionally unused. No behavior change.

Update: tune DSV4 hc_pre MoE path

b9d9dc1

- Raise decode split-K fanout to improve small-T AICore fill - Widen fused prefill D tile to reduce mix_x loop count - Compute post before pre/mix_x to shorten Vec live ranges

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

zhangqi-chen merged commit b4aee1a into hw-native-sys:main Jul 1, 2026
5 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(dsv4 hc_pre): tune MoE AICore path#661

perf(dsv4 hc_pre): tune MoE AICore path#661
zhangqi-chen merged 7 commits into
hw-native-sys:mainfrom
wangqin1723-max:perf/dsv4-hc-pre-moe-tune

wangqin1723-max commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot commented Jul 1, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

wangqin1723-max commented Jul 1, 2026

Summary

Related Issues

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot commented Jul 1, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading