Skip to content

perf(dsv4 hc_pre): split-K decode path (~1.8x), keep fused prefill#652

Closed
Hzfengsy wants to merge 6 commits into
hw-native-sys:mainfrom
Hzfengsy:perf/dsv4-hc-pre-splitk
Closed

perf(dsv4 hc_pre): split-K decode path (~1.8x), keep fused prefill#652
Hzfengsy wants to merge 6 commits into
hw-native-sys:mainfrom
Hzfengsy:perf/dsv4-hc-pre-splitk

Conversation

@Hzfengsy

Copy link
Copy Markdown
Member

Summary

  • Decode (T = B*S = 8) ran the fused single-spmd hc_pre on one core (1 of 24 AIC, 1 of 48 AIV) — one token-tile is one pl.spmd block. hc_pre now dispatches on T at runtime: _hc_pre_decode (split-K + per-axis fan-out) for small T, _hc_pre_prefill (the fused single-task, Perf: fuse DeepSeek-V4 hc_pre into a single SPMD scope #533) for large T. It inlines into each decode/prefill attention kernel, so each context compiles only its branch.
  • _hc_pre_decode mirrors hc_head's pure-AIC split-K: cast x -> x_fp32, split the K=16384 projection into LINEAR_OK slices that atomic-add FP32 partials (1 cube task -> 8), fan the cast over K and mix_x over D, and keep the 20-iter Sinkhorn as its own serial scope (a latency floor more cores can't shorten).
  • Prefill keeps the fused task: its token-tiles already fill the chip, so the decode fan-out would only add AICPU dispatch overhead (it ballooned to ~288 task-instances and regressed prefill ~11%).
  • Device a2a3 (910B), golden-validated both modes (x_mixed/post/comb ratio_allclose), best-of-N: decode 125us -> ~70us (~1.8x) (matmul 40->7us, 42us BF16 pad removed); prefill 147us -> ~143us (flat, no regression).

Related Issues

None.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0e5e2c2a-1dc6-4537-a2cf-e8fe9bcf6b89

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

hc_pre is refactored from a single fused kernel into two specialized paths: _hc_pre_decode (new, split-K FP32 projection with Sinkhorn normalization) and _hc_pre_prefill (existing path). The public hc_pre now dispatches at runtime based on t_dim_sel <= LINEAR_T_TILE. Tiling constants, golden reference RMS chunking, and diagnostic print formatting are also updated.

Changes

Dynamic-shape hc_pre dispatch

Layer / File(s) Summary
Tiling constants and module docstring
models/deepseek/v4/hc_pre.py
Module docstring updated to describe the dynamic-shape T-based dispatch; MIX_PAD, HC_PAD, T_MAX, LINEAR_T_TILE, COMB_T_TILE, LINEAR_OK, SPLITK_PARTIAL_ROWS, and HC_MULT==4 assert added or revised.
_hc_pre_decode split-K fused kernel
models/deepseek/v4/hc_pre.py
New kernel casts BF16→FP32, pads, computes inv_rms, writes per-slice partials to mixes_partial, reduces into mixes_raw, applies sigmoid pre/post/offset, writes post, runs softmax + Sinkhorn iterations, and accumulates mix_x headwise. _hc_pre_prefill stub begins at line 325.
hc_pre dispatch, golden reference, and diagnostics
models/deepseek/v4/hc_pre.py
hc_pre body replaced with t_dim_sel <= LINEAR_T_TILE branch; golden_hc_pre RMS reciprocal updated to RMS_K_CHUNK-aligned slicing; __main__ print format updated.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#522: Refactors the same hc_pre function to a dynamic T-driven dispatch with updated golden reference, directly overlapping with this PR's changes.
  • hw-native-sys/pypto-lib#545: Modifies the same fused hc_pre Sinkhorn/softmax comb path and mix_x accumulation that this PR rewrites in _hc_pre_decode.
  • hw-native-sys/pypto-lib#606: Implements a split-K + FP32-cast RMS + fused pre/post flow with mixes_partial reduction in hc_head, mirroring the decode kernel pattern introduced here.

Poem

🐇 Two paths now split where one road ran,
decode hops small, prefill takes the span.
Split-K partials fill the partial store,
Sinkhorn normalizes, then mix some more.
The rabbit shuffles tiles with care—
Dynamic shapes dispatched with flair! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: a split-K decode path while preserving fused prefill.
Description check ✅ Passed The description is directly related to the changeset and explains the T-adaptive decode/prefill behavior.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a T-adaptive dispatch mechanism for the DeepSeek-V4 hc_pre operator, splitting it into a split-K _hc_pre_decode path for small sequence lengths and a fused _hc_pre_prefill path for larger sequence lengths to optimize performance across decode and prefill regimes. The review feedback highlights a critical out-of-bounds memory access issue in _hc_pre_decode where comb_logits is allocated with 16 columns but read up to index 19 when loading row3. It is recommended to pad comb_logits to 32 columns to resolve this issue.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread models/deepseek/v4/hc_pre.py Outdated
Hzfengsy added a commit to Hzfengsy/pypto-lib that referenced this pull request Jun 30, 2026
comb_sinkhorn loads each comb group HC_PAD-wide at offset k*HC_MULT, so
group 3 spans cols [12:20]; the HC_MULT*HC_MULT=16-wide alloc made that
load descriptor exceed the tensor (valid_shapes bounded the real transfer
to [12:16], but the descriptor itself was out of bounds). Allocate 32-wide
like mixes_raw so every group descriptor stays in-bounds. (gemini review)
Hzfengsy added a commit to Hzfengsy/pypto-lib that referenced this pull request Jun 30, 2026
The split-K decode path was green on the a2a3 device but regressed both
simulators (which were green on main):

- a5sim: allow_early_resolve emits set_allow_early_resolve, which the a5
  L0TaskArgs (Arg<32,16>) has no member for -> orchestration C++ compile error
  in every kernel that inlines hc_pre. Drop the flag; it is a scheduling hint
  the fused / pre-fusion paths never used.

- a2a3sim (and a5sim at runtime): assemble(atomic=Add) is not modeled by the
  simulators, so the split-K partials did not accumulate -- decode outputs were
  75-96% wrong on sim while the device passed. Replace the atomic-add with
  plain-write partials into mixes_partial + a reduce scope that sums the
  LINEAR_OK slices per token-tile.

a2a3 golden re-validated both modes. Decode best-of-N ~80us (was ~70us with the
atomic-add: the reduce scope costs ~10us, but is correct on device and sim).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
models/deepseek/v4/hc_pre.py (1)

81-84: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Update the stale split-K comment.

This still says the slices “atomic-add their FP32 partials,” but the implementation writes distinct partial slots and reduces them later. Keeping this aligned helps avoid reintroducing the simulator-unsafe atomic path.

Proposed comment fix
-# Split the K=HC_DIM reduction into LINEAR_OK slices that atomic-add their FP32
-# partials, filling idle cubes at small T (decode: 1 token-tile -> LINEAR_OK
-# cube tasks) and shortening each task's matmul_acc chain. Higher OK fills more
-# decode cubes; prefill (8 token-tiles) packs OK*8 tasks into waves of ~24.
+# Split the K=HC_DIM reduction into LINEAR_OK slices that write distinct FP32
+# partials, filling idle cubes at small T (decode: 1 token-tile -> LINEAR_OK
+# cube tasks) and shortening each task's matmul_acc chain. The partials are
+# reduced explicitly instead of using atomic-add for simulator correctness.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/hc_pre.py` around lines 81 - 84, Update the split-K
comment in hc_pre.py so it matches the current implementation: it should
describe writing distinct partial slots and reducing them later, not
atomic-adding FP32 partials. Keep the rest of the explanation about LINEAR_OK
slicing, decode/prefill task packing, and matmul_acc shortening aligned with the
actual behavior. Adjust the nearby comment block that references K=HC_DIM and
LINEAR_OK so the wording stays consistent with the simulator-safe path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@models/deepseek/v4/hc_pre.py`:
- Line 277: The Sinkhorn iteration loop in hc_pre.py uses an unused variable
named sk_it in the pipeline over HC_SINKHORN_ITER - 1, which Ruff flags with
B007; rename that loop variable to an underscore-prefixed name in the for-loop
to make the intent explicit and silence the warning.

---

Nitpick comments:
In `@models/deepseek/v4/hc_pre.py`:
- Around line 81-84: Update the split-K comment in hc_pre.py so it matches the
current implementation: it should describe writing distinct partial slots and
reducing them later, not atomic-adding FP32 partials. Keep the rest of the
explanation about LINEAR_OK slicing, decode/prefill task packing, and matmul_acc
shortening aligned with the actual behavior. Adjust the nearby comment block
that references K=HC_DIM and LINEAR_OK so the wording stays consistent with the
simulator-safe path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: db852c50-821a-44b2-9df5-56d52b11efc7

📥 Commits

Reviewing files that changed from the base of the PR and between 1a23bd3 and f11c15a.

📒 Files selected for processing (1)
  • models/deepseek/v4/hc_pre.py

Comment thread models/deepseek/v4/hc_pre.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0f0b16d019

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread models/deepseek/v4/hc_pre.py
Hzfengsy added 6 commits June 30, 2026 21:58
Decode (T = B*S = 8) ran the fused single-spmd hc_pre on one core (1 of 24
AIC, 1 of 48 AIV): one token-tile is one spmd block. Dispatch on T at
runtime so each regime gets its own tiling:

  T <= LINEAR_T_TILE -> _hc_pre_decode   (split-K + per-axis fan-out)
  else               -> _hc_pre_prefill  (the fused single-task, hw-native-sys#533)

_hc_pre_decode mirrors hc_head's pure-AIC split-K: cast x -> x_fp32, split
the K=16384 projection into LINEAR_OK slices that atomic-add FP32 partials
(1 cube task -> LINEAR_OK), fan the cast over K and mix_x over D, and keep
the 20-iter Sinkhorn as its own serial scope (a latency floor). Prefill
keeps the fused task: its token-tiles already fill the chip, so the decode
fan-out would only add AICPU dispatch overhead. hc_pre inlines into each
decode/prefill attention kernel, so each context compiles only its branch.

Device a2a3 (910B), golden-validated both modes, best-of-N:
  decode  125us -> ~70us (~1.8x; matmul 40->7us, 42us BF16 pad removed)
  prefill 147us -> ~143us (flat, no regression)
comb_sinkhorn loads each comb group HC_PAD-wide at offset k*HC_MULT, so
group 3 spans cols [12:20]; the HC_MULT*HC_MULT=16-wide alloc made that
load descriptor exceed the tensor (valid_shapes bounded the real transfer
to [12:16], but the descriptor itself was out of bounds). Allocate 32-wide
like mixes_raw so every group descriptor stays in-bounds. (gemini review)
The split-K decode path was green on the a2a3 device but regressed both
simulators (which were green on main):

- a5sim: allow_early_resolve emits set_allow_early_resolve, which the a5
  L0TaskArgs (Arg<32,16>) has no member for -> orchestration C++ compile error
  in every kernel that inlines hc_pre. Drop the flag; it is a scheduling hint
  the fused / pre-fusion paths never used.

- a2a3sim (and a5sim at runtime): assemble(atomic=Add) is not modeled by the
  simulators, so the split-K partials did not accumulate -- decode outputs were
  75-96% wrong on sim while the device passed. Replace the atomic-add with
  plain-write partials into mixes_partial + a reduce scope that sums the
  LINEAR_OK slices per token-tile.

a2a3 golden re-validated both modes. Decode best-of-N ~80us (was ~70us with the
atomic-add: the reduce scope costs ~10us, but is correct on device and sim).
Decode: revert the split-K accumulation from the sim-safe reduce back to
assemble(atomic=Add) (~80us -> ~70us). The a2a3sim / a5sim simulators do not
model the atomic accumulate, so those two sim CI checks are skipped for hc_pre;
the a2a3 device path is golden-correct (hc_head takes the same approach).

Prefill: the decode/prefill dispatch means every prefill tile is full, so the
matmul reads x_flat directly in static LINEAR_T_TILE tiles and the old BF16
16-row pad scratch (a ~35us redundant x_flat->x_matmul copy) is removed:
prefill ~143us -> ~85us. Guarded by an assert that the prefill token count
tiles evenly by LINEAR_T_TILE (the clean dynamic-valid_shape form is ptoas
blocked in the mixed cube+vec kernel).

a2a3 golden re-validated both modes (decode B4S2, prefill B1S128).
Ruff B007 (CodeRabbit review); the comb_sinkhorn pl.pipeline counter is
intentionally unused. No behavior change.
…rop moot prefill assert

Rebased onto hw-native-sys#653, which removed the M-axis pad from the fused hc_pre via
valid_shape+fillpad. The prefill path now uses that (the conflict resolution),
so the PREFILL % LINEAR_T_TILE assert (needed only by the static-slice variant)
is unnecessary. Refresh the docstring: vs the pad-free fused baseline, decode
~75us -> ~68us (split-K parallelizes the 1-cube matmul); prefill ~unchanged
(~87us, same fused path hw-native-sys#653 already optimized).
@Hzfengsy Hzfengsy force-pushed the perf/dsv4-hc-pre-splitk branch from 0128859 to 7cbfae9 Compare June 30, 2026 14:22
@Hzfengsy Hzfengsy closed this Jun 30, 2026
@Hzfengsy

Copy link
Copy Markdown
Member Author

Close as 653 applies the same optimization

zhangqi-chen pushed a commit that referenced this pull request Jul 1, 2026
## Summary
- Builds on #652 and tunes `hc_pre` for the MoE inline path: raises
decode split-K fanout, widens fused prefill `D_TILE`, and computes
`post` before `pre/mix_x` to shorten Vec live ranges.
- Same-card MoE AICore end-to-end profiling over three runs improved
from baseline 676.61 us mean to 655.87 us mean, a 3.07% reduction.
- Golden validation passed for the measured MoE runs; `ruff check
--config ruff.toml models/deepseek/v4/hc_pre.py` and `python
tests/lint/check_english_only.py` passed locally.

## Related Issues
Depends on #652.

---------

Co-authored-by: Siyuan Feng <25500082+Hzfengsy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants