Skip to content

[feat] Add Qwen3 MoE true-on-policy parity#30

Draft
maocheng23 wants to merge 23 commits into
feat/true_on_policy_qwen_densefrom
feat/true_on_policy_qwen_moe
Draft

[feat] Add Qwen3 MoE true-on-policy parity#30
maocheng23 wants to merge 23 commits into
feat/true_on_policy_qwen_densefrom
feat/true_on_policy_qwen_moe

Conversation

@maocheng23
Copy link
Copy Markdown

@maocheng23 maocheng23 commented May 1, 2026

Summary

Adds the Qwen3-MoE true-on-policy Megatron implementation on top of the dense true-on-policy stack.

This PR is stacked on feat/true_on_policy_qwen_dense; the intended review surface is now the direct SGLang MoE path plus the small dense-stack compatibility fixes needed for MoE parity.

This is one of three tightly-coupled MoE PRs that should land together because they share the qwen3_moe_true_on_policy_v1 contract.

Companion PRs, must land in lockstep:

Stacked on dense, must land first:

Target

Bit-identical logprob parity between SGLang rollout and Megatron training for scored response tokens on Qwen3-30B-A3B MoE, with differentiable Megatron backward.

Previously validated on H200 x8 at Megatron TP=1/EP=4/CP=2/PP=1 with SGLang 2 engines TP=4/EP=4: train_rollout_logprob_abs_diff = 0.0 for the checked steps, with non-zero gradients across attention, embeddings, layernorms, MoE experts, MoE routers, and output layer.

Dense-To-MoE Delta

The dense implementation remains the base contract. This PR now keeps the MoE delta narrow:

  • Direct MoE forward hook: MoELayer calls the true-on-policy extension at the top of the route phase. When the qwen3_moe_true_on_policy_v1 contract requires qwen3_moe_sglang_math, the extension uses the direct SGLang EP path or raises; unsupported layouts no longer silently fall back based on helper/function existence or EP-size-derived predicates.
  • Router/top-k parity: the extension computes the simple Qwen3 route locally with SGLang stable_topk_softmax, using Megatron router weights but without modifying Megatron's generic router implementation.
  • Expert parity: no-grad/reference uses SGLang fused_experts; grad-enabled training uses the same SGLang forward wrapped by a local PyTorch autograd function and Triton backward kernels.
  • EP parity: local-masked expert ids plus fixed-tree EP reduction mirror SGLang's expert-parallel behavior.
  • Dense carry-over: matmul, RMSNorm, and residual/checkpointing behavior stay aligned with the dense stack.

Main Files

  • miles_megatron_plugins/true_on_policy/moe_layer_ext.py
    • Policy-driven direct SGLang local-masked EP path, global padded EP gather, stable top-k routing, padding compaction, and EP reduction.
  • miles_megatron_plugins/true_on_policy/moe_experts.py
    • Provides SGLang expert weight layout helpers and the no-grad local-masked fused_experts call.
  • megatron/core/transformer/moe/sgl_fused_moe/
    • PyTorch autograd wrapper and Triton kernels for the differentiable SGLang fused MoE training path.
  • megatron/core/transformer/moe/moe_layer.py
    • Minimal top-level hook into the true-on-policy extension; generic router, dispatcher, and combine code are left on the dense base behavior.
  • megatron/core/models/gpt/gpt_layer_specs.py
    • Splits dense vs MoE sharded-state key remapping so dense MLP checkpoint assumptions are not applied to MoE layers.
  • megatron/core/optimizer/cpu_offloading/hybrid_optimizer.py
    • Refreshes stale CPU copies for offloaded params when weights are loaded/refreshed.

Removed During Cleanup

  • Removed the straight-through path that computed SGLang exact output and Megatron MoE output in the same forward.
  • Removed the native Megatron GroupedMLP fallback dtype patch.
  • Removed generic router.py, moe_utils.py, and token_dispatcher.py parity edits; the direct path no longer relies on those fallback routes.
  • Removed function-existence probing from the direct true-on-policy MoE path; required SGLang wiring now fails loudly if missing.
  • Removed unused ordered-combine and rollout-context helper modules.

Validation

Remote H200 x8 validation from recovery/qwen3_moe_clean/journal/2026-04-30-moe-onpolicy-normal-validation.md:

Mode Step rollout_logp train_logp abs_diff grad_norm
Full deterministic 0 -0.2502 -0.2502 0.0 0.0342
Full deterministic 1 -0.2335 -0.2335 0.0 0.0465
Fast decode, no fusion 0 -0.2452 -0.2452 0.0 0.0391
Fast decode, no fusion 1 -0.2350 -0.2350 0.0 0.0318
Fast decode, no fusion 2 -0.2467 -0.2467 0.0 0.0459

Current local smoke checks for the latest PR updates:

  • git diff --check
  • PYENV_VERSION=system python3 -m py_compile ... on changed Python files
  • PR-specific debug dump plumbing removed from the reviewable path

CPU/unit coverage:

  • tests/unit_tests/extension/test_sglang_extension.py
  • tests/unit_tests/extension/test_sglang_moe_fast_topk_route.py

Known Constraints

  • Router aux-loss and z-loss objectives are explicitly rejected by the direct SGLang MoE path. Keep those objectives disabled for this contract until ownership/scaling support is added.
  • The fused MoE backward path currently assumes the default Triton tile config used by this path. Kernel-config overrides should get a dedicated parity check before being treated as supported.
  • The contract still requires MoE permute fusion disabled.
  • DeepEP and broader TP+EP+SP layouts beyond the validated TP=1/EP=4/CP=2 setup remain out of scope for this PR.

Test Plan

  • Local source sanity: whitespace diff check and Python compile for changed Python files.
  • GPU exact-zero E2E gate at TP=1/EP=4/CP=2, full deterministic path, validated locally.
  • GPU exact-zero E2E gate at TP=1/EP=4/CP=2, fast decode with no permute fusion, validated locally.
  • CPU unit tests pass in CI.
  • GPU E2E replay in CI or reviewer-owned environment.
  • Longer 100-step on/off-policy comparison run.

Squash merge of the dense true-on-policy Megatron branch.

Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn>
Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maocheng23 maocheng23 force-pushed the feat/true_on_policy_qwen_moe branch 3 times, most recently from ab103f8 to 9c6c2a9 Compare May 19, 2026 18:44
@maocheng23 maocheng23 changed the title [feat] Init true on policy with qwen_moe [feat] Add Qwen3 MoE true-on-policy parity May 21, 2026
maocheng23 and others added 19 commits May 22, 2026 19:02
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn>
Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn>
Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn>
Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn>
Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn>
Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
maocheng23 and others added 3 commits May 22, 2026 19:02
- forward.py: single canonical sglang_moe_forward (102 lines)
- autograd.py: pure backward wrapper calling shared forward (182 lines)
- moe_experts.py: weight-layout adapter only, no forward logic (36 lines)
- moe_layer_ext.py: one linear orchestration path (301 lines)
- Remove verbose RuntimeError guards, replace with asserts
- Remove weight caching (premature optimization)
- Consolidate two parallel forward paths into one

Net: -574 lines, structurally impossible for forward paths to diverge.
Co-authored-by: Cursor <cursoragent@cursor.com>
@maocheng23 maocheng23 force-pushed the feat/true_on_policy_qwen_moe branch from 7ac2a80 to b0679e2 Compare May 23, 2026 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant