dsv4 MoE: gate perf + drop EP=1, inline dispatch/combine by zhangqi-chen · Pull Request #660 · hw-native-sys/pypto-lib

zhangqi-chen · 2026-07-01T07:35:35Z

Summary

perf(dsv4 gate): route the gate over pl.spmd, fan the gate matmul across experts (1 core/block), read bf16 x_norm with QUANT_TILE=256, and widen the ffn-norm D_TILE to 256.
refactor(MoE): remove dispatch.py and combine.py; move the distributed dispatch/combine definitions into moe.py (renamed to moe's N_LOCAL/N_RANKS, added X_STAGE_ROWS).
Drop the single-card EP=1 path (moe_ep1, moe_ep1_test, golden_moe_ep1, build_tensor_specs_ep1, the *_ep1 dispatch/combine variants); --ep choices are now (2, 4, 8) and __main__ only runs the distributed l3_moe path.
Remove the now-dead EP_ROUTING_GLOBAL flag: gate always routes over the full global expert set (N_EXPERTS = M.n_routed_experts), which moe shrinks to 32*EP before import, so the distributed path is bit-for-bit unchanged.

Related Issues

N/A

coderabbitai · 2026-07-01T07:35:58Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c1c47527-1410-4ec2-b5af-145b5bf2cd9d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR removes the single-card DeepSeek-V4 MoE EP=1 path entirely (deleting combine.py and moe_ep1-related code), restricts EP world size to {2,4,8}, and introduces an in-file distributed dispatch/combine implementation in moe.py. Separately, gate.py's tiling constants, gate matmul logic, and CLI options are refactored, and EP_ROUTING_GLOBAL is removed from config.py.

Changes

EP1 removal and distributed MoE dispatch/combine

Layer / File(s)	Summary
Config flag removal `models/deepseek/v4/config.py`	`EP_ROUTING_GLOBAL` flag and its comment are removed.
MoE EP world size restriction and setup `models/deepseek/v4/moe.py`	`--ep` choices restricted to `(2,4,8)`, `EP_WORLD_SIZE` set directly, EP1 dispatch/combine imports removed, new distributed protocol constants and preconditions added.
Distributed dispatch implementation `models/deepseek/v4/moe.py`	New in-file `dispatch()` builds routing histograms, uses epoch-scoped notify/wait barriers, computes prefix offsets, and remote-stores payloads into receive windows.
Distributed combine implementation `models/deepseek/v4/moe.py`	New in-file `combine()` remote-puts rows into `routed_y_buf` keyed by `r_route`, synchronizes via barrier, and reduces FP32 accumulation into `ffn_out`; related comments updated.
EP1 path removal and run_jit wiring `models/deepseek/v4/moe.py`, `models/deepseek/v4/combine.py`	Deletes `combine.py` module and `moe_ep1`/`moe_ep1_test`/golden/fixture helpers; replaces `EP==1` branching with a single `run_jit` call to `l3_moe`.

Estimated code review effort: 4 (Complex) | ~60 minutes

Gate routing refactor

Layer / File(s)	Summary
Gate config and tiling constants `models/deepseek/v4/gate.py`	Removes `EP_ROUTING_GLOBAL`-conditional `N_EXPERTS`; updates `D_TILE`, `QUANT_TILE`, adds `GATE_N_TILE` with divisibility assertion.
Gate compute body refactor and CLI options `models/deepseek/v4/gate.py`	Restructures RMSNorm/quantization/matmul paths using `pl.spmd` and tiled expert-score buffers with padded-column handling; updates `--layer-id` and `--enable-l2-swimlane` CLI defaults.

Estimated code review effort: 4 (Complex) | ~45 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Rank as dispatch/combine (rank)
  participant Peer as Remote rank
  participant Window as Receive window / routed_y_buf

  Rank->>Rank: build routing histogram
  Rank->>Peer: notify(count_done, exact counts)
  Rank->>Peer: wait(count_done)
  Rank->>Window: remote-store x/scale/weights/r_route
  Rank->>Peer: notify(data_done)
  Window->>Rank: copy rows into output tensors

  Rank->>Window: remote-put recv_y keyed by r_route
  Rank->>Peer: notify(combine_done)
  Rank->>Peer: wait(combine_done)
  Window->>Rank: reduce TOPK contributions (FP32)
  Rank->>Rank: write BF16 ffn_out

Possibly related PRs

hw-native-sys/pypto-lib#280: Refactors the same MoE combine logic being removed here into a split moe_dispatch/moe_combine module.
hw-native-sys/pypto-lib#496: Uses distributed EP2 combine composed with moe_ep, directly coupled to the dispatch/combine rework in moe.py.
hw-native-sys/pypto-lib#524: Modifies the same gate.py tiling/quantization constants and matmul tiling logic touched here.

Suggested labels: enhancement

Poem

Hop, hop, the EP1 path is gone,
Combine.py fades with the dawn,
Ranks now dance in tiled delight,
Dispatch and combine sync just right,
Gate tiles gleam in sixteen-wide rows —
This bunny cheers where distributed code flows! 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main changes: gate performance work, removing EP=1, and inlining dispatch/combine.
Description check	✅ Passed	The description matches the changeset and accurately describes the gate refactor, dispatch/combine move, and EP=1 removal.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request removes the legacy single-card (EP=1) execution path and its associated standalone files (combine.py and dispatch.py), refactoring the DeepSeek-V4 MoE implementation to focus entirely on multi-rank distributed execution. The dispatch and combine kernels are now inlined directly into moe.py, and the gate.py module has been optimized by transitioning from pl.parallel to pl.spmd loops and parallelizing the gate matmul over expert columns. Feedback on the changes suggests further optimizing the gate matmul loop in gate.py by replacing pl.range with pl.pipeline to enable pipelining of global memory loads on Ascend hardware.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-07-01T07:38:14Z

+        for kb in pl.range(0, D // GATE_D_TILE):
+            gd_kd = kb * GATE_D_TILE
+            gd_x = x_norm_gate_buf[t1 : t1 + GATE_M_TILE, gd_kd : gd_kd + GATE_D_TILE]
+            gd_w = gate_w[n0 : n0 + GATE_N_TILE, gd_kd : gd_kd + GATE_D_TILE]
+            if gd_kd == 0:
+                gate_logits_tile = pl.matmul(gd_x, gd_w, out_dtype=pl.FP32, b_trans=True)
+            else:
+                gate_logits_tile = pl.matmul_acc(gate_logits_tile, gd_x, gd_w, b_trans=True)


The gate matmul loop currently uses pl.range to iterate over the hidden dimension chunks. On CANN/Ascend hardware, using pl.pipeline with a conditional check inside the loop is preferred over a standard range loop. This allows the compiler to pipeline the global memory loads (gd_x and gd_w) and overlap them with the matrix multiplication computation, significantly improving performance on the critical path.

Suggested change

for kb in pl.range(0, D // GATE_D_TILE):

gd_kd = kb * GATE_D_TILE

gd_x = x_norm_gate_buf[t1 : t1 + GATE_M_TILE, gd_kd : gd_kd + GATE_D_TILE]

gd_w = gate_w[n0 : n0 + GATE_N_TILE, gd_kd : gd_kd + GATE_D_TILE]

if gd_kd == 0:

gate_logits_tile = pl.matmul(gd_x, gd_w, out_dtype=pl.FP32, b_trans=True)

else:

gate_logits_tile = pl.matmul_acc(gate_logits_tile, gd_x, gd_w, b_trans=True)

for gd_kd in pl.pipeline(0, D, GATE_D_TILE, stage=2):

gd_x = x_norm_gate_buf[t1 : t1 + GATE_M_TILE, gd_kd : gd_kd + GATE_D_TILE]

gd_w = gate_w[n0 : n0 + GATE_N_TILE, gd_kd : gd_kd + GATE_D_TILE]

if gd_kd == 0:

gate_logits_tile = pl.matmul(gd_x, gd_w, out_dtype=pl.FP32, b_trans=True)

else:

gate_logits_tile = pl.matmul_acc(gate_logits_tile, gd_x, gd_w, b_trans=True)

References

In PyPTO on CANN/Ascend hardware, keeping a conditional check (e.g., if db == 0) inside a pl.pipeline loop can be preferred over peeling the first iteration. This allows the first chunk's load to overlap with the rest of the pipeline rather than running as an un-pipelined prologue, provided the compiler successfully pipelines the loop-index branch.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@models/deepseek/v4/moe.py`:
- Around line 178-206: The dynamic routing in moe.py uses `RECV_MAX` as a
per-expert slot stride, but `payload_push` can still compute a `slot` from
`cursor[bucket] + my_slot_at_dst[bucket]` that exceeds the allocated capacity.
Add an explicit capacity check or hard fail in the `payload_push` loop before
computing `row` so `slot` never crosses `RECV_MAX`, and apply the same guard in
the corresponding remote store/copy path referenced by the same routing logic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8b087a29-a20d-46c0-ace3-d4e47ddb406f

📥 Commits

Reviewing files that changed from the base of the PR and between b92fe93 and 561da20.

📒 Files selected for processing (5)

models/deepseek/v4/combine.py
models/deepseek/v4/config.py
models/deepseek/v4/dispatch.py
models/deepseek/v4/gate.py
models/deepseek/v4/moe.py

💤 Files with no reviewable changes (3)

models/deepseek/v4/dispatch.py
models/deepseek/v4/combine.py
models/deepseek/v4/config.py

coderabbitai · 2026-07-01T07:42:59Z

+        for e in pl.range(N_LOCAL):
+            acc = pl.const(0, pl.INT32)
+            for s in pl.range(N_RANKS):
+                acc = acc + pl.read(pub_counts, [s * N_RANKS + my_rank, e])
+            pl.write(recv_count_out, [e, 0], acc)
+
+        # ---------- payload_push: 4 channels per (t, k) ----------
+        cursor = pl.array.create(N_RANKS * N_LOCAL, pl.INT32)
+        for d in pl.range(N_RANKS):
+            for e in pl.range(N_LOCAL):
+                cursor[d * N_LOCAL + e] = 0
+
+        # Pad tiles, zero-initialised once; only column 0 is overwritten per
+        # push (UB tile + remote_store is the proven path for runtime-computed
+        # scalars — a GM pack table written by scalar pl.write corrupts).
+        scale_tile = pl.tile.full([1, W_PAD], dtype=pl.FP32, value=0.0)
+        w_tile = pl.tile.full([1, W_PAD], dtype=pl.FP32, value=0.0)
+        idx_tile = pl.tile.full([1, IDX_PAD], dtype=pl.INT32, value=0)
+
+        for t in pl.range(active_tokens):
+            for k in pl.range(TOPK):
+                eid = pl.read(indices, [t, k])
+                dst = eid // N_LOCAL
+                loc_e = eid - dst * N_LOCAL
+                bucket = dst * N_LOCAL + loc_e
+                cur_val = cursor[bucket]
+                slot_off = my_slot_at_dst[bucket]
+                slot = slot_off + cur_val
+                row = loc_e * RECV_MAX + slot


🗄️ Data Integrity & Integration | 🔴 Critical | ⚡ Quick win

Guard RECV_MAX before using dynamic route counts as slots.

A skewed gate can make acc/slot exceed RECV_MAX; then row = loc_e * RECV_MAX + slot writes into another expert’s region or past the distributed window. Add a hard capacity contract here: provision worst-case capacity, enforce gate capacity, or fail before the remote stores and scalar copies.

Also applies to: 258-264

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/moe.py` around lines 178 - 206, The dynamic routing in moe.py uses `RECV_MAX` as a per-expert slot stride, but `payload_push` can still compute a `slot` from `cursor[bucket] + my_slot_at_dst[bucket]` that exceeds the allocated capacity. Add an explicit capacity check or hard fail in the `payload_push` loop before computing `row` so `slot` never crosses `RECV_MAX`, and apply the same guard in the corresponding remote store/copy path referenced by the same routing logic.

…tiles Convert the four pl.parallel+pl.at loops (ffn_norm / x_norm_quant / gate / route) to pl.spmd. Fan the gate matmul over expert columns (GATE_N_TILE=16) so each spmd block lands on its own cube core; move the [N_EXPERTS,SCORE_PAD) pad-init to a separate gate_pad_init scope since a conditional write to an internal buffer inside the spmd body breaks orch return mapping. x_norm_quant reads the bf16 x_norm output instead of the fp32 buffer (same values, half the bytes) and widens QUANT_TILE 32->256; ffn_norm widens D_TILE 128->256. decode_layer a2a3 ep2 (layer 10 / CSA, sort routing) swimlane, per layer: gate 69->~18us, x_norm_quant 43->~12us, ffn_norm 17.7->~11.4us. Standalone gate.py (hash + sort) and decode_layer x_next/kv_cache PASS. Also: default --layer-id 10 (matches decode_layer's CSA layer) and switch --enable-l2-swimlane to the int 0/1/2 form.

- Remove dispatch.py and combine.py; move the distributed dispatch and combine definitions into moe.py, renaming N_LOCAL_EXPERTS/EP_WORLD_SIZE to moe's N_LOCAL/N_RANKS and adding the X_STAGE_ROWS constant. - Delete the single-card EP=1 path (moe_ep1, moe_ep1_test, golden_moe_ep1, build_tensor_specs_ep1) and the *_ep1 dispatch/combine variants; --ep choices are now (2, 4, 8) and __main__ only runs the distributed l3_moe path. - Remove the now-dead EP_ROUTING_GLOBAL flag: gate.py always routes over the full global expert set (N_EXPERTS = M.n_routed_experts), which moe shrinks to 32*EP before import, so the distributed path is unchanged. - Drop the dead EP_RANK / EXPERTS_START_IDX constants left over from the removed EP=1 dispatch.

gemini-code-assist Bot reviewed Jul 1, 2026

View reviewed changes

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

zhangqi-chen force-pushed the perf-moe branch from 561da20 to 4353460 Compare July 1, 2026 07:46

zhangqi-chen added 2 commits July 1, 2026 15:50

zhangqi-chen force-pushed the perf-moe branch from 4353460 to 1d873a0 Compare July 1, 2026 07:51

zhangqi-chen merged commit 2637cf6 into hw-native-sys:main Jul 1, 2026
5 of 7 checks passed

zhangqi-chen deleted the perf-moe branch July 1, 2026 08:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dsv4 MoE: gate perf + drop EP=1, inline dispatch/combine#660

dsv4 MoE: gate perf + drop EP=1, inline dispatch/combine#660
zhangqi-chen merged 2 commits into
hw-native-sys:mainfrom
zhangqi-chen:perf-moe

zhangqi-chen commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zhangqi-chen commented Jul 1, 2026

Summary

Related Issues

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading