perf(dsv4 hc_pre): split-K decode path (~1.8x), keep fused prefill#652
perf(dsv4 hc_pre): split-K decode path (~1.8x), keep fused prefill#652Hzfengsy wants to merge 6 commits into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthrough
ChangesDynamic-shape hc_pre dispatch
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a T-adaptive dispatch mechanism for the DeepSeek-V4 hc_pre operator, splitting it into a split-K _hc_pre_decode path for small sequence lengths and a fused _hc_pre_prefill path for larger sequence lengths to optimize performance across decode and prefill regimes. The review feedback highlights a critical out-of-bounds memory access issue in _hc_pre_decode where comb_logits is allocated with 16 columns but read up to index 19 when loading row3. It is recommended to pad comb_logits to 32 columns to resolve this issue.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
comb_sinkhorn loads each comb group HC_PAD-wide at offset k*HC_MULT, so group 3 spans cols [12:20]; the HC_MULT*HC_MULT=16-wide alloc made that load descriptor exceed the tensor (valid_shapes bounded the real transfer to [12:16], but the descriptor itself was out of bounds). Allocate 32-wide like mixes_raw so every group descriptor stays in-bounds. (gemini review)
The split-K decode path was green on the a2a3 device but regressed both simulators (which were green on main): - a5sim: allow_early_resolve emits set_allow_early_resolve, which the a5 L0TaskArgs (Arg<32,16>) has no member for -> orchestration C++ compile error in every kernel that inlines hc_pre. Drop the flag; it is a scheduling hint the fused / pre-fusion paths never used. - a2a3sim (and a5sim at runtime): assemble(atomic=Add) is not modeled by the simulators, so the split-K partials did not accumulate -- decode outputs were 75-96% wrong on sim while the device passed. Replace the atomic-add with plain-write partials into mixes_partial + a reduce scope that sums the LINEAR_OK slices per token-tile. a2a3 golden re-validated both modes. Decode best-of-N ~80us (was ~70us with the atomic-add: the reduce scope costs ~10us, but is correct on device and sim).
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
models/deepseek/v4/hc_pre.py (1)
81-84: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUpdate the stale split-K comment.
This still says the slices “atomic-add their FP32 partials,” but the implementation writes distinct partial slots and reduces them later. Keeping this aligned helps avoid reintroducing the simulator-unsafe atomic path.
Proposed comment fix
-# Split the K=HC_DIM reduction into LINEAR_OK slices that atomic-add their FP32 -# partials, filling idle cubes at small T (decode: 1 token-tile -> LINEAR_OK -# cube tasks) and shortening each task's matmul_acc chain. Higher OK fills more -# decode cubes; prefill (8 token-tiles) packs OK*8 tasks into waves of ~24. +# Split the K=HC_DIM reduction into LINEAR_OK slices that write distinct FP32 +# partials, filling idle cubes at small T (decode: 1 token-tile -> LINEAR_OK +# cube tasks) and shortening each task's matmul_acc chain. The partials are +# reduced explicitly instead of using atomic-add for simulator correctness.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/hc_pre.py` around lines 81 - 84, Update the split-K comment in hc_pre.py so it matches the current implementation: it should describe writing distinct partial slots and reducing them later, not atomic-adding FP32 partials. Keep the rest of the explanation about LINEAR_OK slicing, decode/prefill task packing, and matmul_acc shortening aligned with the actual behavior. Adjust the nearby comment block that references K=HC_DIM and LINEAR_OK so the wording stays consistent with the simulator-safe path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@models/deepseek/v4/hc_pre.py`:
- Line 277: The Sinkhorn iteration loop in hc_pre.py uses an unused variable
named sk_it in the pipeline over HC_SINKHORN_ITER - 1, which Ruff flags with
B007; rename that loop variable to an underscore-prefixed name in the for-loop
to make the intent explicit and silence the warning.
---
Nitpick comments:
In `@models/deepseek/v4/hc_pre.py`:
- Around line 81-84: Update the split-K comment in hc_pre.py so it matches the
current implementation: it should describe writing distinct partial slots and
reducing them later, not atomic-adding FP32 partials. Keep the rest of the
explanation about LINEAR_OK slicing, decode/prefill task packing, and matmul_acc
shortening aligned with the actual behavior. Adjust the nearby comment block
that references K=HC_DIM and LINEAR_OK so the wording stays consistent with the
simulator-safe path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: db852c50-821a-44b2-9df5-56d52b11efc7
📒 Files selected for processing (1)
models/deepseek/v4/hc_pre.py
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0f0b16d019
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Decode (T = B*S = 8) ran the fused single-spmd hc_pre on one core (1 of 24 AIC, 1 of 48 AIV): one token-tile is one spmd block. Dispatch on T at runtime so each regime gets its own tiling: T <= LINEAR_T_TILE -> _hc_pre_decode (split-K + per-axis fan-out) else -> _hc_pre_prefill (the fused single-task, hw-native-sys#533) _hc_pre_decode mirrors hc_head's pure-AIC split-K: cast x -> x_fp32, split the K=16384 projection into LINEAR_OK slices that atomic-add FP32 partials (1 cube task -> LINEAR_OK), fan the cast over K and mix_x over D, and keep the 20-iter Sinkhorn as its own serial scope (a latency floor). Prefill keeps the fused task: its token-tiles already fill the chip, so the decode fan-out would only add AICPU dispatch overhead. hc_pre inlines into each decode/prefill attention kernel, so each context compiles only its branch. Device a2a3 (910B), golden-validated both modes, best-of-N: decode 125us -> ~70us (~1.8x; matmul 40->7us, 42us BF16 pad removed) prefill 147us -> ~143us (flat, no regression)
comb_sinkhorn loads each comb group HC_PAD-wide at offset k*HC_MULT, so group 3 spans cols [12:20]; the HC_MULT*HC_MULT=16-wide alloc made that load descriptor exceed the tensor (valid_shapes bounded the real transfer to [12:16], but the descriptor itself was out of bounds). Allocate 32-wide like mixes_raw so every group descriptor stays in-bounds. (gemini review)
The split-K decode path was green on the a2a3 device but regressed both simulators (which were green on main): - a5sim: allow_early_resolve emits set_allow_early_resolve, which the a5 L0TaskArgs (Arg<32,16>) has no member for -> orchestration C++ compile error in every kernel that inlines hc_pre. Drop the flag; it is a scheduling hint the fused / pre-fusion paths never used. - a2a3sim (and a5sim at runtime): assemble(atomic=Add) is not modeled by the simulators, so the split-K partials did not accumulate -- decode outputs were 75-96% wrong on sim while the device passed. Replace the atomic-add with plain-write partials into mixes_partial + a reduce scope that sums the LINEAR_OK slices per token-tile. a2a3 golden re-validated both modes. Decode best-of-N ~80us (was ~70us with the atomic-add: the reduce scope costs ~10us, but is correct on device and sim).
Decode: revert the split-K accumulation from the sim-safe reduce back to assemble(atomic=Add) (~80us -> ~70us). The a2a3sim / a5sim simulators do not model the atomic accumulate, so those two sim CI checks are skipped for hc_pre; the a2a3 device path is golden-correct (hc_head takes the same approach). Prefill: the decode/prefill dispatch means every prefill tile is full, so the matmul reads x_flat directly in static LINEAR_T_TILE tiles and the old BF16 16-row pad scratch (a ~35us redundant x_flat->x_matmul copy) is removed: prefill ~143us -> ~85us. Guarded by an assert that the prefill token count tiles evenly by LINEAR_T_TILE (the clean dynamic-valid_shape form is ptoas blocked in the mixed cube+vec kernel). a2a3 golden re-validated both modes (decode B4S2, prefill B1S128).
Ruff B007 (CodeRabbit review); the comb_sinkhorn pl.pipeline counter is intentionally unused. No behavior change.
…rop moot prefill assert Rebased onto hw-native-sys#653, which removed the M-axis pad from the fused hc_pre via valid_shape+fillpad. The prefill path now uses that (the conflict resolution), so the PREFILL % LINEAR_T_TILE assert (needed only by the static-slice variant) is unnecessary. Refresh the docstring: vs the pad-free fused baseline, decode ~75us -> ~68us (split-K parallelizes the 1-cube matmul); prefill ~unchanged (~87us, same fused path hw-native-sys#653 already optimized).
0128859 to
7cbfae9
Compare
|
Close as 653 applies the same optimization |
## Summary - Builds on #652 and tunes `hc_pre` for the MoE inline path: raises decode split-K fanout, widens fused prefill `D_TILE`, and computes `post` before `pre/mix_x` to shorten Vec live ranges. - Same-card MoE AICore end-to-end profiling over three runs improved from baseline 676.61 us mean to 655.87 us mean, a 3.07% reduction. - Golden validation passed for the measured MoE runs; `ruff check --config ruff.toml models/deepseek/v4/hc_pre.py` and `python tests/lint/check_english_only.py` passed locally. ## Related Issues Depends on #652. --------- Co-authored-by: Siyuan Feng <25500082+Hzfengsy@users.noreply.github.com>
Summary
T = B*S = 8) ran the fused single-spmdhc_preon one core (1 of 24 AIC, 1 of 48 AIV) — one token-tile is onepl.spmdblock.hc_prenow dispatches onTat runtime:_hc_pre_decode(split-K + per-axis fan-out) for small T,_hc_pre_prefill(the fused single-task, Perf: fuse DeepSeek-V4 hc_pre into a single SPMD scope #533) for large T. It inlines into each decode/prefill attention kernel, so each context compiles only its branch._hc_pre_decodemirrorshc_head's pure-AIC split-K: castx -> x_fp32, split theK=16384projection intoLINEAR_OKslices that atomic-add FP32 partials (1 cube task -> 8), fan the cast over K andmix_xover D, and keep the 20-iter Sinkhorn as its own serial scope (a latency floor more cores can't shorten).x_mixed/post/combratio_allclose), best-of-N: decode 125us -> ~70us (~1.8x) (matmul 40->7us, 42us BF16 pad removed); prefill 147us -> ~143us (flat, no regression).Related Issues
None.