Skip to content

[WIP] Add bitwise parity test for MoE EP#3172

Closed
wwwjn wants to merge 9 commits into
gh/wwwjn/17/basefrom
gh/wwwjn/17/head
Closed

[WIP] Add bitwise parity test for MoE EP#3172
wwwjn wants to merge 9 commits into
gh/wwwjn/17/basefrom
gh/wwwjn/17/head

Conversation

@wwwjn

@wwwjn wwwjn commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
python scripts/rl/create_debug_moe_ckpt.py
torchrun --nproc_per_node=4 -m pytest \
torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

[ghstack-poisoned]
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 30, 2026
@wwwjn wwwjn changed the title Add bitwise parity test for MoE EP [WIP] Add bitwise parity test for MoE EP Apr 30, 2026
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

[ghstack-poisoned]
wwwjn added a commit that referenced this pull request May 1, 2026
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

ghstack-source-id: 65e6b7a
Pull Request resolved: #3172
wwwjn added 2 commits April 30, 2026 20:25
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

[ghstack-poisoned]
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

[ghstack-poisoned]
wwwjn added a commit that referenced this pull request May 1, 2026
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

ghstack-source-id: 4f3aef7
Pull Request resolved: #3172
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

[ghstack-poisoned]
wwwjn added a commit that referenced this pull request May 1, 2026
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

ghstack-source-id: 99cccd6
Pull Request resolved: #3172
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

[ghstack-poisoned]
wwwjn added a commit that referenced this pull request May 4, 2026
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

ghstack-source-id: 6cdd70a
Pull Request resolved: #3172
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

[ghstack-poisoned]
wwwjn added a commit that referenced this pull request May 4, 2026
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

ghstack-source-id: d4e678c
Pull Request resolved: #3172
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

[ghstack-poisoned]
wwwjn added a commit that referenced this pull request May 4, 2026
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

ghstack-source-id: 1c0e85c
Pull Request resolved: #3172
wwwjn added a commit that referenced this pull request May 5, 2026
…cher (#3193)

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* #3172
* #3171
* #3142
* __->__ #3193

Today the dispatcher's _split_along_sp() raises when num_tokens (bs *
slen) is not divisible by sp_size (= TP degree). Real workloads with
varlen prompts can land on non-divisible totals and crash the MoE
forward.

Pad inside dispatch(): round num_tokens up to the next multiple of
sp_size, padding x and top_scores with zeros and
selected_experts_indices with 0 (so pad rows route deterministically to
expert 0 with zero score). combine() reads metadata.original_num_tokens
to size the scatter_add buffer at the padded length and slices the pad
rows off before returning. When sp_size == 1 or input is already
divisible, behavior is bitwise identical to today.

Pad tokens are numerically inert:
- Zero scores -> contribution to scatter_add is exactly zero either
before or after expert compute (independent of score_before_experts).
- Pad indices fall in [original, padded), which is sliced off after
scatter_add, so they never appear in the returned output.

Trainer/generator can pad by different amounts depending on their batch
shapes; the unpadded portions remain bitwise identical.

- TorchAOTokenDispatcher inherits dispatch/combine and gets the fix for
free.
- DeepEPTokenDispatcher uses a separate metadata type and is unaffected.
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

[ghstack-poisoned]
wwwjn added a commit that referenced this pull request May 5, 2026
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

ghstack-source-id: 9b5e3e0
Pull Request resolved: #3172
@wwwjn wwwjn closed this May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant