[feat] Add Qwen3 MoE true-on-policy parity by maocheng23 · Pull Request #30 · radixark/Megatron-LM

maocheng23 · 2026-05-01T07:48:43Z

Summary

Adds the Qwen3-MoE true-on-policy Megatron implementation on top of the dense true-on-policy stack.

This PR is stacked on feat/true_on_policy_qwen_dense; the intended review surface is now the direct SGLang MoE path plus the small dense-stack compatibility fixes needed for MoE parity.

This is one of three tightly-coupled MoE PRs that should land together because they share the qwen3_moe_true_on_policy_v1 contract.

Companion PRs, must land in lockstep:

SGLang fork-side review: [feat] Init true on policy with qwen_moe maocheng23/sglang#3
Miles: [feat] Init true on policy with qwen_moe miles#1059

Stacked on dense, must land first:

Megatron-LM dense: [feat] Init true on policy with qwen_dense #29
SGLang dense: [feat] Init true on policy with qwen_dense sgl-project/sglang#23961
Miles dense: [feat] Init true on policy with qwen_dense miles#1052

Target

Bit-identical logprob parity between SGLang rollout and Megatron training for scored response tokens on Qwen3-30B-A3B MoE, with differentiable Megatron backward.

Previously validated on H200 x8 at Megatron TP=1/EP=4/CP=2/PP=1 with SGLang 2 engines TP=4/EP=4: train_rollout_logprob_abs_diff = 0.0 for the checked steps, with non-zero gradients across attention, embeddings, layernorms, MoE experts, MoE routers, and output layer.

Dense-To-MoE Delta

The dense implementation remains the base contract. This PR now keeps the MoE delta narrow:

Direct MoE forward hook: MoELayer calls the true-on-policy extension at the top of the route phase. When the qwen3_moe_true_on_policy_v1 contract requires qwen3_moe_sglang_math, the extension uses the direct SGLang EP path or raises; unsupported layouts no longer silently fall back based on helper/function existence or EP-size-derived predicates.
Router/top-k parity: the extension computes the simple Qwen3 route locally with SGLang stable_topk_softmax, using Megatron router weights but without modifying Megatron's generic router implementation.
Expert parity: no-grad/reference uses SGLang fused_experts; grad-enabled training uses the same SGLang forward wrapped by a local PyTorch autograd function and Triton backward kernels.
EP parity: local-masked expert ids plus fixed-tree EP reduction mirror SGLang's expert-parallel behavior.
Dense carry-over: matmul, RMSNorm, and residual/checkpointing behavior stay aligned with the dense stack.

Main Files

miles_megatron_plugins/true_on_policy/moe_layer_ext.py
- Policy-driven direct SGLang local-masked EP path, global padded EP gather, stable top-k routing, padding compaction, and EP reduction.
miles_megatron_plugins/true_on_policy/moe_experts.py
- Provides SGLang expert weight layout helpers and the no-grad local-masked fused_experts call.
megatron/core/transformer/moe/sgl_fused_moe/
- PyTorch autograd wrapper and Triton kernels for the differentiable SGLang fused MoE training path.
megatron/core/transformer/moe/moe_layer.py
- Minimal top-level hook into the true-on-policy extension; generic router, dispatcher, and combine code are left on the dense base behavior.
megatron/core/models/gpt/gpt_layer_specs.py
- Splits dense vs MoE sharded-state key remapping so dense MLP checkpoint assumptions are not applied to MoE layers.
megatron/core/optimizer/cpu_offloading/hybrid_optimizer.py
- Refreshes stale CPU copies for offloaded params when weights are loaded/refreshed.

Removed During Cleanup

Removed the straight-through path that computed SGLang exact output and Megatron MoE output in the same forward.
Removed the native Megatron GroupedMLP fallback dtype patch.
Removed generic router.py, moe_utils.py, and token_dispatcher.py parity edits; the direct path no longer relies on those fallback routes.
Removed function-existence probing from the direct true-on-policy MoE path; required SGLang wiring now fails loudly if missing.
Removed unused ordered-combine and rollout-context helper modules.

Validation

Remote H200 x8 validation from recovery/qwen3_moe_clean/journal/2026-04-30-moe-onpolicy-normal-validation.md:

Mode	Step	rollout_logp	train_logp	grad_norm
Full deterministic	0	-0.2502	-0.2502	0.0342
Full deterministic	1	-0.2335	-0.2335	0.0465
Fast decode, no fusion	0	-0.2452	-0.2452	0.0391
Fast decode, no fusion	1	-0.2350	-0.2350	0.0318
Fast decode, no fusion	2	-0.2467	-0.2467	0.0459

Current local smoke checks for the latest PR updates:

git diff --check
PYENV_VERSION=system python3 -m py_compile ... on changed Python files
PR-specific debug dump plumbing removed from the reviewable path

CPU/unit coverage:

tests/unit_tests/extension/test_sglang_extension.py
tests/unit_tests/extension/test_sglang_moe_fast_topk_route.py

Known Constraints

Router aux-loss and z-loss objectives are explicitly rejected by the direct SGLang MoE path. Keep those objectives disabled for this contract until ownership/scaling support is added.
The fused MoE backward path currently assumes the default Triton tile config used by this path. Kernel-config overrides should get a dedicated parity check before being treated as supported.
The contract still requires MoE permute fusion disabled.
DeepEP and broader TP+EP+SP layouts beyond the validated TP=1/EP=4/CP=2 setup remain out of scope for this PR.

Test Plan

Local source sanity: whitespace diff check and Python compile for changed Python files.
GPU exact-zero E2E gate at TP=1/EP=4/CP=2, full deterministic path, validated locally.
GPU exact-zero E2E gate at TP=1/EP=4/CP=2, fast decode with no permute fusion, validated locally.
CPU unit tests pass in CI.
GPU E2E replay in CI or reviewer-owned environment.
Longer 100-step on/off-policy comparison run.

Squash merge of the dense true-on-policy Megatron branch. Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn> Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn> Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- forward.py: single canonical sglang_moe_forward (102 lines) - autograd.py: pure backward wrapper calling shared forward (182 lines) - moe_experts.py: weight-layout adapter only, no forward logic (36 lines) - moe_layer_ext.py: one linear orchestration path (301 lines) - Remove verbose RuntimeError guards, replace with asserts - Remove weight caching (premature optimization) - Consolidate two parallel forward paths into one Net: -574 lines, structurally impossible for forward paths to diverge. Co-authored-by: Cursor <cursoragent@cursor.com>

This was referenced May 1, 2026

[feat] Init true on policy with qwen_moe radixark/miles#1059

Draft

[feat] Init true on policy with qwen_moe maocheng23/sglang#3

Draft

maocheng23 force-pushed the feat/true_on_policy_qwen_moe branch from c63c77f to 9c390a1 Compare May 5, 2026 01:51

maocheng23 mentioned this pull request May 7, 2026

[feat] Add SP and PP kernels for qwen_moe true on policy #31

Open

2 tasks

maocheng23 force-pushed the feat/true_on_policy_qwen_dense branch 2 times, most recently from 9546575 to 57258c8 Compare May 18, 2026 06:15

maocheng23 force-pushed the feat/true_on_policy_qwen_moe branch 3 times, most recently from ab103f8 to 9c6c2a9 Compare May 19, 2026 18:44

maocheng23 changed the title ~~[feat] Init true on policy with qwen_moe~~ [feat] Add Qwen3 MoE true-on-policy parity May 21, 2026

maocheng23 and others added 19 commits May 22, 2026 19:02

Improve MoE true-on-policy training path

995fad7

Remove MoE straight-through fallback

e5ab72a

Drop GroupedMLP fallback dtype patch

ef0fd3e

Prune unused MoE parity fallback paths

6a88277

Make true-on-policy MoE hook policy driven

ea3ce92

Clarify true-on-policy MoE dispatch gate

666ad16

Align MoE direct predicate with true-on-policy contract

24088bf

Simplify MoE true-on-policy mode gate

cfe1c38

Drive MoE kernel gate from true-on-policy contract

c32f169

Align MoE contract gate with direct SGLang path

a143c75

Remove true-on-policy debug dump plumbing

dcf88ac

Keep GRPO loss unchanged for MoE parity

5954723

Remove batch-invariant escape hatch

1ef3a0e

Share SGLang MoE forward path

ae93354

maocheng23 and others added 3 commits May 22, 2026 19:02

Fix SGLang MoE weight adapter init

1e562d7

Update SGLang fused MoE imports

b0679e2

maocheng23 force-pushed the feat/true_on_policy_qwen_moe branch from 7ac2a80 to b0679e2 Compare May 23, 2026 02:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add Qwen3 MoE true-on-policy parity#30

[feat] Add Qwen3 MoE true-on-policy parity#30
maocheng23 wants to merge 23 commits into
feat/true_on_policy_qwen_densefrom
feat/true_on_policy_qwen_moe

maocheng23 commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maocheng23 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Target

Dense-To-MoE Delta

Main Files

Removed During Cleanup

Validation

Known Constraints

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maocheng23 commented May 1, 2026 •

edited

Loading