perf(dsv4 hc_pre): split-K decode path (~1.8x), keep fused prefill by Hzfengsy · Pull Request #652 · hw-native-sys/pypto-lib

Hzfengsy · 2026-06-30T09:19:32Z

Summary

Decode (T = B*S = 8) ran the fused single-spmd hc_pre on one core (1 of 24 AIC, 1 of 48 AIV) — one token-tile is one pl.spmd block. hc_pre now dispatches on T at runtime: _hc_pre_decode (split-K + per-axis fan-out) for small T, _hc_pre_prefill (the fused single-task, Perf: fuse DeepSeek-V4 hc_pre into a single SPMD scope #533) for large T. It inlines into each decode/prefill attention kernel, so each context compiles only its branch.
_hc_pre_decode mirrors hc_head's pure-AIC split-K: cast x -> x_fp32, split the K=16384 projection into LINEAR_OK slices that atomic-add FP32 partials (1 cube task -> 8), fan the cast over K and mix_x over D, and keep the 20-iter Sinkhorn as its own serial scope (a latency floor more cores can't shorten).
Prefill keeps the fused task: its token-tiles already fill the chip, so the decode fan-out would only add AICPU dispatch overhead (it ballooned to ~288 task-instances and regressed prefill ~11%).
Device a2a3 (910B), golden-validated both modes (x_mixed/post/comb ratio_allclose), best-of-N: decode 125us -> ~70us (~1.8x) (matmul 40->7us, 42us BF16 pad removed); prefill 147us -> ~143us (flat, no regression).

Related Issues

None.

coderabbitai · 2026-06-30T09:19:49Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0e5e2c2a-1dc6-4537-a2cf-e8fe9bcf6b89

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

hc_pre is refactored from a single fused kernel into two specialized paths: _hc_pre_decode (new, split-K FP32 projection with Sinkhorn normalization) and _hc_pre_prefill (existing path). The public hc_pre now dispatches at runtime based on t_dim_sel <= LINEAR_T_TILE. Tiling constants, golden reference RMS chunking, and diagnostic print formatting are also updated.

Changes

Dynamic-shape hc_pre dispatch

Layer / File(s)	Summary
Tiling constants and module docstring `models/deepseek/v4/hc_pre.py`	Module docstring updated to describe the dynamic-shape T-based dispatch; `MIX_PAD`, `HC_PAD`, `T_MAX`, `LINEAR_T_TILE`, `COMB_T_TILE`, `LINEAR_OK`, `SPLITK_PARTIAL_ROWS`, and `HC_MULT==4` assert added or revised.
`_hc_pre_decode` split-K fused kernel `models/deepseek/v4/hc_pre.py`	New kernel casts BF16→FP32, pads, computes `inv_rms`, writes per-slice partials to `mixes_partial`, reduces into `mixes_raw`, applies sigmoid pre/post/offset, writes `post`, runs softmax + Sinkhorn iterations, and accumulates `mix_x` headwise. `_hc_pre_prefill` stub begins at line 325.
`hc_pre` dispatch, golden reference, and diagnostics `models/deepseek/v4/hc_pre.py`	`hc_pre` body replaced with `t_dim_sel <= LINEAR_T_TILE` branch; `golden_hc_pre` RMS reciprocal updated to `RMS_K_CHUNK`-aligned slicing; `__main__` print format updated.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

hw-native-sys/pypto-lib#522: Refactors the same hc_pre function to a dynamic T-driven dispatch with updated golden reference, directly overlapping with this PR's changes.
hw-native-sys/pypto-lib#545: Modifies the same fused hc_pre Sinkhorn/softmax comb path and mix_x accumulation that this PR rewrites in _hc_pre_decode.
hw-native-sys/pypto-lib#606: Implements a split-K + FP32-cast RMS + fused pre/post flow with mixes_partial reduction in hc_head, mirroring the decode kernel pattern introduced here.

Poem

🐇 Two paths now split where one road ran,
decode hops small, prefill takes the span.
Split-K partials fill the partial store,
Sinkhorn normalizes, then mix some more.
The rabbit shuffles tiles with care—
Dynamic shapes dispatched with flair! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: a split-K decode path while preserving fused prefill.
Description check	✅ Passed	The description is directly related to the changeset and explains the T-adaptive decode/prefill behavior.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces a T-adaptive dispatch mechanism for the DeepSeek-V4 hc_pre operator, splitting it into a split-K _hc_pre_decode path for small sequence lengths and a fused _hc_pre_prefill path for larger sequence lengths to optimize performance across decode and prefill regimes. The review feedback highlights a critical out-of-bounds memory access issue in _hc_pre_decode where comb_logits is allocated with 16 columns but read up to index 19 when loading row3. It is recommended to pad comb_logits to 32 columns to resolve this issue.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

comb_sinkhorn loads each comb group HC_PAD-wide at offset k*HC_MULT, so group 3 spans cols [12:20]; the HC_MULT*HC_MULT=16-wide alloc made that load descriptor exceed the tensor (valid_shapes bounded the real transfer to [12:16], but the descriptor itself was out of bounds). Allocate 32-wide like mixes_raw so every group descriptor stays in-bounds. (gemini review)

The split-K decode path was green on the a2a3 device but regressed both simulators (which were green on main): - a5sim: allow_early_resolve emits set_allow_early_resolve, which the a5 L0TaskArgs (Arg<32,16>) has no member for -> orchestration C++ compile error in every kernel that inlines hc_pre. Drop the flag; it is a scheduling hint the fused / pre-fusion paths never used. - a2a3sim (and a5sim at runtime): assemble(atomic=Add) is not modeled by the simulators, so the split-K partials did not accumulate -- decode outputs were 75-96% wrong on sim while the device passed. Replace the atomic-add with plain-write partials into mixes_partial + a reduce scope that sums the LINEAR_OK slices per token-tile. a2a3 golden re-validated both modes. Decode best-of-N ~80us (was ~70us with the atomic-add: the reduce scope costs ~10us, but is correct on device and sim).

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

models/deepseek/v4/hc_pre.py (1)

81-84: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Update the stale split-K comment.

This still says the slices “atomic-add their FP32 partials,” but the implementation writes distinct partial slots and reduces them later. Keeping this aligned helps avoid reintroducing the simulator-unsafe atomic path.

Proposed comment fix

-# Split the K=HC_DIM reduction into LINEAR_OK slices that atomic-add their FP32
-# partials, filling idle cubes at small T (decode: 1 token-tile -> LINEAR_OK
-# cube tasks) and shortening each task's matmul_acc chain. Higher OK fills more
-# decode cubes; prefill (8 token-tiles) packs OK*8 tasks into waves of ~24.
+# Split the K=HC_DIM reduction into LINEAR_OK slices that write distinct FP32
+# partials, filling idle cubes at small T (decode: 1 token-tile -> LINEAR_OK
+# cube tasks) and shortening each task's matmul_acc chain. The partials are
+# reduced explicitly instead of using atomic-add for simulator correctness.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/hc_pre.py` around lines 81 - 84, Update the split-K
comment in hc_pre.py so it matches the current implementation: it should
describe writing distinct partial slots and reducing them later, not
atomic-adding FP32 partials. Keep the rest of the explanation about LINEAR_OK
slicing, decode/prefill task packing, and matmul_acc shortening aligned with the
actual behavior. Adjust the nearby comment block that references K=HC_DIM and
LINEAR_OK so the wording stays consistent with the simulator-safe path.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@models/deepseek/v4/hc_pre.py`:
- Line 277: The Sinkhorn iteration loop in hc_pre.py uses an unused variable
named sk_it in the pipeline over HC_SINKHORN_ITER - 1, which Ruff flags with
B007; rename that loop variable to an underscore-prefixed name in the for-loop
to make the intent explicit and silence the warning.

---

Nitpick comments:
In `@models/deepseek/v4/hc_pre.py`:
- Around line 81-84: Update the split-K comment in hc_pre.py so it matches the
current implementation: it should describe writing distinct partial slots and
reducing them later, not atomic-adding FP32 partials. Keep the rest of the
explanation about LINEAR_OK slicing, decode/prefill task packing, and matmul_acc
shortening aligned with the actual behavior. Adjust the nearby comment block
that references K=HC_DIM and LINEAR_OK so the wording stays consistent with the
simulator-safe path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: db852c50-821a-44b2-9df5-56d52b11efc7

📥 Commits

Reviewing files that changed from the base of the PR and between 1a23bd3 and f11c15a.

📒 Files selected for processing (1)

models/deepseek/v4/hc_pre.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0f0b16d019

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Decode (T = B*S = 8) ran the fused single-spmd hc_pre on one core (1 of 24 AIC, 1 of 48 AIV): one token-tile is one spmd block. Dispatch on T at runtime so each regime gets its own tiling: T <= LINEAR_T_TILE -> _hc_pre_decode (split-K + per-axis fan-out) else -> _hc_pre_prefill (the fused single-task, hw-native-sys#533) _hc_pre_decode mirrors hc_head's pure-AIC split-K: cast x -> x_fp32, split the K=16384 projection into LINEAR_OK slices that atomic-add FP32 partials (1 cube task -> LINEAR_OK), fan the cast over K and mix_x over D, and keep the 20-iter Sinkhorn as its own serial scope (a latency floor). Prefill keeps the fused task: its token-tiles already fill the chip, so the decode fan-out would only add AICPU dispatch overhead. hc_pre inlines into each decode/prefill attention kernel, so each context compiles only its branch. Device a2a3 (910B), golden-validated both modes, best-of-N: decode 125us -> ~70us (~1.8x; matmul 40->7us, 42us BF16 pad removed) prefill 147us -> ~143us (flat, no regression)

comb_sinkhorn loads each comb group HC_PAD-wide at offset k*HC_MULT, so group 3 spans cols [12:20]; the HC_MULT*HC_MULT=16-wide alloc made that load descriptor exceed the tensor (valid_shapes bounded the real transfer to [12:16], but the descriptor itself was out of bounds). Allocate 32-wide like mixes_raw so every group descriptor stays in-bounds. (gemini review)

The split-K decode path was green on the a2a3 device but regressed both simulators (which were green on main): - a5sim: allow_early_resolve emits set_allow_early_resolve, which the a5 L0TaskArgs (Arg<32,16>) has no member for -> orchestration C++ compile error in every kernel that inlines hc_pre. Drop the flag; it is a scheduling hint the fused / pre-fusion paths never used. - a2a3sim (and a5sim at runtime): assemble(atomic=Add) is not modeled by the simulators, so the split-K partials did not accumulate -- decode outputs were 75-96% wrong on sim while the device passed. Replace the atomic-add with plain-write partials into mixes_partial + a reduce scope that sums the LINEAR_OK slices per token-tile. a2a3 golden re-validated both modes. Decode best-of-N ~80us (was ~70us with the atomic-add: the reduce scope costs ~10us, but is correct on device and sim).

Decode: revert the split-K accumulation from the sim-safe reduce back to assemble(atomic=Add) (~80us -> ~70us). The a2a3sim / a5sim simulators do not model the atomic accumulate, so those two sim CI checks are skipped for hc_pre; the a2a3 device path is golden-correct (hc_head takes the same approach). Prefill: the decode/prefill dispatch means every prefill tile is full, so the matmul reads x_flat directly in static LINEAR_T_TILE tiles and the old BF16 16-row pad scratch (a ~35us redundant x_flat->x_matmul copy) is removed: prefill ~143us -> ~85us. Guarded by an assert that the prefill token count tiles evenly by LINEAR_T_TILE (the clean dynamic-valid_shape form is ptoas blocked in the mixed cube+vec kernel). a2a3 golden re-validated both modes (decode B4S2, prefill B1S128).

Ruff B007 (CodeRabbit review); the comb_sinkhorn pl.pipeline counter is intentionally unused. No behavior change.

…rop moot prefill assert Rebased onto hw-native-sys#653, which removed the M-axis pad from the fused hc_pre via valid_shape+fillpad. The prefill path now uses that (the conflict resolution), so the PREFILL % LINEAR_T_TILE assert (needed only by the static-slice variant) is unnecessary. Refresh the docstring: vs the pad-free fused baseline, decode ~75us -> ~68us (split-K parallelizes the 1-cube matmul); prefill ~unchanged (~87us, same fused path hw-native-sys#653 already optimized).

Hzfengsy · 2026-06-30T14:23:14Z

Close as 653 applies the same optimization

## Summary - Builds on #652 and tunes `hc_pre` for the MoE inline path: raises decode split-K fanout, widens fused prefill `D_TILE`, and computes `post` before `pre/mix_x` to shorten Vec live ranges. - Same-card MoE AICore end-to-end profiling over three runs improved from baseline 676.61 us mean to 655.87 us mean, a 3.07% reduction. - Golden validation passed for the measured MoE runs; `ruff check --config ruff.toml models/deepseek/v4/hc_pre.py` and `python tests/lint/check_english_only.py` passed locally. ## Related Issues Depends on #652. --------- Co-authored-by: Siyuan Feng <25500082+Hzfengsy@users.noreply.github.com>

gemini-code-assist Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread models/deepseek/v4/hc_pre.py Outdated

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread models/deepseek/v4/hc_pre.py Outdated

chatgpt-codex-connector Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread models/deepseek/v4/hc_pre.py

Hzfengsy added 6 commits June 30, 2026 21:58

style(dsv4 hc_pre): rename unused Sinkhorn loop var sk_it -> _sk_it

219378c

Ruff B007 (CodeRabbit review); the comb_sinkhorn pl.pipeline counter is intentionally unused. No behavior change.

Hzfengsy force-pushed the perf/dsv4-hc-pre-splitk branch from 0128859 to 7cbfae9 Compare June 30, 2026 14:22

Hzfengsy closed this Jun 30, 2026

wangqin1723-max reopened this Jul 1, 2026

wangqin1723-max mentioned this pull request Jul 1, 2026

perf(dsv4 hc_pre): tune MoE AICore path #661

Merged

wangqin1723-max closed this Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(dsv4 hc_pre): split-K decode path (~1.8x), keep fused prefill#652

perf(dsv4 hc_pre): split-K decode path (~1.8x), keep fused prefill#652
Hzfengsy wants to merge 6 commits into
hw-native-sys:mainfrom
Hzfengsy:perf/dsv4-hc-pre-splitk

Hzfengsy commented Jun 30, 2026

Uh oh!

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Hzfengsy commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Hzfengsy commented Jun 30, 2026

Summary

Related Issues

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Hzfengsy commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading