[WIP] Add bitwise parity test for MoE EP by wwwjn · Pull Request #3172 · pytorch/torchtitan

wwwjn · 2026-04-30T03:46:29Z

Stack from ghstack (oldest at bottom):

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
python scripts/rl/create_debug_moe_ckpt.py
torchrun --nproc_per_node=4 -m pytest \
torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. [ghstack-poisoned]

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. ghstack-source-id: 65e6b7a Pull Request resolved: #3172

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. [ghstack-poisoned]

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. ghstack-source-id: 4f3aef7 Pull Request resolved: #3172

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. [ghstack-poisoned]

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. ghstack-source-id: 99cccd6 Pull Request resolved: #3172

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. [ghstack-poisoned]

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. ghstack-source-id: 6cdd70a Pull Request resolved: #3172

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. [ghstack-poisoned]

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. ghstack-source-id: d4e678c Pull Request resolved: #3172

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. [ghstack-poisoned]

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. ghstack-source-id: 1c0e85c Pull Request resolved: #3172

…cher (#3193) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #3172 * #3171 * #3142 * __->__ #3193 Today the dispatcher's _split_along_sp() raises when num_tokens (bs * slen) is not divisible by sp_size (= TP degree). Real workloads with varlen prompts can land on non-divisible totals and crash the MoE forward. Pad inside dispatch(): round num_tokens up to the next multiple of sp_size, padding x and top_scores with zeros and selected_experts_indices with 0 (so pad rows route deterministically to expert 0 with zero score). combine() reads metadata.original_num_tokens to size the scatter_add buffer at the padded length and slices the pad rows off before returning. When sp_size == 1 or input is already divisible, behavior is bitwise identical to today. Pad tokens are numerically inert: - Zero scores -> contribution to scatter_add is exactly zero either before or after expert compute (independent of score_before_experts). - Pad indices fall in [original, padded), which is sliced off after scatter_add, so they never appear in the returned output. Trainer/generator can pad by different amounts depending on their batch shapes; the unpadded portions remain bitwise identical. - TorchAOTokenDispatcher inherits dispatch/combine and gets the fix for free. - DeepEPTokenDispatcher uses a separate metadata type and is unaffected.

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. [ghstack-poisoned]

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. ghstack-source-id: 9b5e3e0 Pull Request resolved: #3172

This was referenced Apr 30, 2026

[Not ready][rl] Enable TP2EP for unified MoE model in vLLM wrapper #3142

Open

[WIP]Enable DP-to-EP for MoE inference #3171

Closed

pytorch-bot Bot added the ciflow/8gpu label Apr 30, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 30, 2026

wwwjn changed the title ~~Add bitwise parity test for MoE EP~~ [WIP] Add bitwise parity test for MoE EP Apr 30, 2026

wwwjn mentioned this pull request May 1, 2026

[MoE] Pad token count to a multiple of sp_size in AllToAllTokenDispatcher #3193

Merged

wwwjn added 2 commits April 30, 2026 20:25

wwwjn closed this May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add bitwise parity test for MoE EP#3172

[WIP] Add bitwise parity test for MoE EP#3172
wwwjn wants to merge 9 commits into
gh/wwwjn/17/basefrom
gh/wwwjn/17/head

wwwjn commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wwwjn commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wwwjn commented Apr 30, 2026 •

edited

Loading