[Not ready][rl] Enable TP2EP for unified MoE model in vLLM wrapper by wwwjn · Pull Request #3142 · pytorch/torchtitan

wwwjn · 2026-04-28T21:00:16Z

Stack from ghstack (oldest at bottom):

Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.

Key changes:

Meta-device init + to_empty() + init_states() for large MoE models
EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
Use enable_sp (not inference flag) for output layout and SP splitting
enable_sequence_parallel=True, disable_loss_parallel=True for inference
Remove stale ModelConvertersContainer references
Add MoE debug and 30B-A3B RL configs

Similar as #3057

Dependency

Verification

Qwen3-30B-A3B model on 8 GPU (Trainer TP4 EP4, generator TP4 EP4):

wandb: Detected [openai] in use.
wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
wandb: For more information, check out the docs at: https://weave-docs.wandb.ai
WandB logging enabled
Using renderer qwen3, of type <class 'renderers.configs.Qwen3RendererConfig'>, with args {'enable_thinking': False}
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
4 generator GPUs + 4 trainer GPUs = 8 total
[actor=<root>] Initializing TorchStoreStrategy with default_transport_type=<TransportType.Unset: 1>
[actor=<root>] Pre-training validation; then 10 steps of RL training
[actor=<root>] ----------
Validation | Step:  0  validation/response_length/mean: 30.10  timing/validate: 29.13
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  1  loss/mean: 0.013  bit_wise/logprob_diff/max: 55.49  reward/_mean: 0.79  rollout_reward/_mean: 0.79  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.68  rollout/response_length/max: 42.0  train/grad_norm/mean: 0.46  train/lr: 5.0e-07  perf/tokens_per_second: 141.1  timing/step: 120.0
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  2  loss/mean: 0.0055  bit_wise/logprob_diff/max: 56.85  reward/_mean: 0.70  rollout_reward/_mean: 0.70  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.80  rollout/response_length/max: 44.0  train/grad_norm/mean: 0.37  train/lr: 1.0e-06  perf/tokens_per_second: 158.5  timing/step: 110.2
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  3  loss/mean: 0.016  bit_wise/logprob_diff/max: 40.089  reward/_mean: 0.73  rollout_reward/_mean: 0.73  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.76  rollout/response_length/max: 47.0  train/grad_norm/mean: 0.45  train/lr: 1.0e-06  perf/tokens_per_second: 156.9  timing/step: 110.0
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  4  loss/mean: 0.011  bit_wise/logprob_diff/max: 40.92  reward/_mean: 0.73  rollout_reward/_mean: 0.73  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.72  rollout/response_length/max: 45.0  train/grad_norm/mean: 0.43  train/lr: 1.0e-06  perf/tokens_per_second: 163.3  timing/step: 108.7
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  5  loss/mean: 0.0076  bit_wise/logprob_diff/max: 96.53  reward/_mean: 0.71  rollout_reward/_mean: 0.71  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.76  rollout/response_length/max: 42.0  train/grad_norm/mean: 0.36  train/lr: 1.0e-06  perf/tokens_per_second: 165.3  timing/step: 109.2
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  6  loss/mean: 0.011  bit_wise/logprob_diff/max: 48.36  reward/_mean: 0.79  rollout_reward/_mean: 0.79  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.84  rollout/response_length/max: 49.0  train/grad_norm/mean: 0.44  train/lr: 1.0e-06  perf/tokens_per_second: 147.1  timing/step: 117.7
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  7  loss/mean: 0.0055  bit_wise/logprob_diff/max: 54.41  reward/_mean: 0.65  rollout_reward/_mean: 0.65  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.72  rollout/response_length/max: 43.0  train/grad_norm/mean: 0.12  train/lr: 1.0e-06  perf/tokens_per_second: 158.1  timing/step: 110.1
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  8  loss/mean: 0.0035  bit_wise/logprob_diff/max: 51.040  reward/_mean: 0.66  rollout_reward/_mean: 0.66  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.72  rollout/response_length/max: 45.0  train/grad_norm/mean: 0.31  train/lr: 1.0e-06  perf/tokens_per_second: 168.6  timing/step: 101.4
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  9  loss/mean: 0.023  bit_wise/logprob_diff/max: 58.88  reward/_mean: 0.56  rollout_reward/_mean: 0.56  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.64  rollout/response_length/max: 46.0  train/grad_norm/mean: 0.73  train/lr: 1.0e-06  perf/tokens_per_second: 175.6  timing/step: 103.8
[actor=<root>] Dropping 2 packed rows (10 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step: 10  loss/mean: 0.020  bit_wise/logprob_diff/max: 50.84  reward/_mean: 0.52  rollout_reward/_mean: 0.52  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.64  rollout/response_length/max: 47.0  train/grad_norm/mean: 0.23  train/lr: 1.0e-06  perf/tokens_per_second: 167.2  timing/step: 109.2
[actor=<root>] ----------
Validation | Step: 10  validation/response_length/mean: 31.85  timing/validate: 25.94
[actor=<root>] ============================================================
[actor=<root>] Validation summary (pre / post):
[actor=<root>]   validation_reward/_max:  +1.000  /  +1.000
[actor=<root>]   validation_reward/_mean:  +0.639  /  +0.664
[actor=<root>]   validation_reward/_min:  +0.052  /  +0.056
[actor=<root>]   validation_reward/_std:  +0.406  /  +0.386
[actor=<root>]   validation_reward/_sum:  +12.776  /  +13.289
[actor=<root>]   validation_reward/component/RewardAlphabetSort/mean:  +0.639  /  +0.664
[actor=<root>] ============================================================
[actor=<root>] Closing: tearing down actors and process meshes.

Batch-invariant verification

TBA

Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. [ghstack-poisoned]

Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]

…cher (#3193) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #3172 * #3171 * #3142 * __->__ #3193 Today the dispatcher's _split_along_sp() raises when num_tokens (bs * slen) is not divisible by sp_size (= TP degree). Real workloads with varlen prompts can land on non-divisible totals and crash the MoE forward. Pad inside dispatch(): round num_tokens up to the next multiple of sp_size, padding x and top_scores with zeros and selected_experts_indices with 0 (so pad rows route deterministically to expert 0 with zero score). combine() reads metadata.original_num_tokens to size the scatter_add buffer at the padded length and slices the pad rows off before returning. When sp_size == 1 or input is already divisible, behavior is bitwise identical to today. Pad tokens are numerically inert: - Zero scores -> contribution to scatter_add is exactly zero either before or after expert compute (independent of score_before_experts). - Pad indices fall in [original, padded), which is sliced off after scatter_add, so they never appear in the returned output. Trainer/generator can pad by different amounts depending on their batch shapes; the unpadded portions remain bitwise identical. - TorchAOTokenDispatcher inherits dispatch/combine and gets the fix for free. - DeepEPTokenDispatcher uses a separate metadata type and is unaffected.

tianyu-l · 2026-05-07T20:52:49Z

+        # Replicate (no-op) and Shard(-1) (all-gather) lm_head output placements.
        if isinstance(logits, DTensor):
-            logits = logits.to_local()
+            logits = logits.full_tensor()


should work with disable_loss_parallel already?

Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]

…endency (#3242) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #3236 * #3142 * __->__ #3242 vllm has this customized config parser registry support so we can plug in TorchTitan's config parser. Why we need this: - get rid of dependency on a HF format checkpoint folder when initializing. Don't implicitly depend on `config.json` as config source of truth Another changes in this PR: - remove the round-trip translation from torchtitan config -> vllm config -> torchtitan config. Using closure to bypass.

Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. ghstack-source-id: 95289dc Pull Request resolved: #3142

[ghstack-poisoned]

tianyu-l · 2026-06-04T02:17:09Z

+        # All-to-all dispatch tokens to EP ranks.
+        # Use the non-autograd version under inference (vLLM), since
+        # _c10d_functional_autograd ops don't dispatch correctly without
+        # an active autograd context. Gated by a Python bool so the choice
+        # is stable at trace time.


any progress? could you follow up?
add a TODO and issue if you think it's not possible to get this done quickly

[ghstack-poisoned]

Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. ghstack-source-id: 03bd198 Pull Request resolved: #3142

wwwjn

Left comments for myself to be update after rebasing to last main

[ghstack-poisoned]

… and fix combine() shape mismatch (#3595) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #3236 * #3142 * __->__ #3595 ## What's the problem - [Currently] combine() wrongly assume the tokens are evenly sharded on each rank https://github.com/pytorch/torchtitan/blob/c0428bb186f5c97e1d1b4ed89febd22916eee302/torchtitan/models/common/token_dispatcher.py#L439-L456 (infer global SPMD in local SPMD region) - If uneven sharded, out_TD will have different shapes across SP ranks. - We should directly ban if input number of tokens in input batch can not be evenly sharded by SP ranks - [Future] Router will use spmd_types soon, and router is per SP rank. Per SP rank should have even sharding - [Future] we want to avoid dispatch/load_balacing the padded token, we should be able to do that by adding metadata field to record the actually local tokens for each sp rank ## What does this PR do? This PR is doing "virtual padding" , and passing metadata around - Calculate num_local_tokens_after_padding = (T + pad_tokens) // sp_size in MoE module - Pass num_local_tokens_after_padding to GroupedExperts module, then to combine() - combine() returns a tensor with shape (num_local_tokens_after_padding * sp_rank, .... ) - slice the combined tensor to (T, ...) in MoE

[ghstack-poisoned]

wwwjn · 2026-06-10T19:49:28Z

      "<|vision_end|>": 2006,
      "<|image_pad|>": 2007,
-      "<|video_pad|>": 2008
+      "<|video_pad|>": 2008,


These special tokens are needed becuase external render (AutoTokenizer in RL) checks these and they are needed by Qwen3 model

hf_assets_path="tests/assets/tokenizer"
renderer=RendererConfig(name="qwen3", enable_thinking=False)

[ghstack-poisoned]

wwwjn · 2026-06-11T01:06:57Z

            completion_message["reasoning_content"] = parsed.reasoning_content
        if parsed.tool_calls:
            completion_message["tool_calls"] = parsed.tool_calls
-


Need to revert

wwwjn · 2026-06-11T01:07:40Z

+    group_size = 8
+    return RLTrainer.Config(
+        model_spec=model_registry("30B-A3B", attn_backend="varlen"),
+        hf_assets_path="/data/users/jianiw/model/Qwen3-30B-A3B",


This needs to be reverted

wwwjn · 2026-06-11T01:08:37Z

Changes in this files just adding basic kwargs and make this script runnable with different configs, and align with the generator setup in RL.

wwwjn · 2026-06-11T01:11:41Z

+        # forwards cannot shard across all SP ranks, so pad to ``sp_size`` and
+        # trim before returning. This is a no-op for training/prefill and EP-off;
+        # padded tokens only affect the inference-only load-balance counter.
+        seq_pad = sp_size - L if L < sp_size else 0


This is to solve #3622

TLDR the issue is: when seq_len >= TP_degree, num_local_tokens_per_expert_E = routing_map_BLE.sum(dim=(0, 1)) suppose to be
V (tokens vary across TP ranks) --sum(seq)--> P (partial per-rank count). But when seq_len < TP, it degraded to R --> sum --> R, which can not be converted to P later.

The simple fix there is to pad when seq_len < TP_degree

wwwjn requested review from fegin, tianyu-l and wconstab as code owners April 28, 2026 21:00

pytorch-bot Bot added the ciflow/8gpu label Apr 28, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 28, 2026

wwwjn mentioned this pull request Apr 28, 2026

[WIP] Enable DP-to-EP for MoE inference #3143

Closed

wwwjn commented Apr 28, 2026

View reviewed changes

Comment thread torchtitan/experiments/rl/grpo.py Outdated

Comment thread torchtitan/experiments/rl/models/vllm_wrapper.py Outdated

wwwjn changed the title ~~Enable EP+TP for MoE inference in vLLM wrapper~~ Enable TP2EP for MoE inference in vLLM wrapper Apr 28, 2026

tianyu-l reviewed Apr 28, 2026

View reviewed changes

acisseJZhong reviewed Apr 28, 2026

View reviewed changes

Comment thread torchtitan/models/llama4/parallelize.py Outdated

tianyu-l mentioned this pull request Apr 29, 2026

Improve compilation time (reduce from ~50 seconds to ~15s for vLLM) #3145

Merged

This was referenced Apr 30, 2026

[WIP]Enable DP-to-EP for MoE inference #3171

Closed

[WIP] Add bitwise parity test for MoE EP #3172

Closed

wwwjn changed the title ~~Enable TP2EP for MoE inference in vLLM wrapper~~ [rl] Enable TP2EP for MoE inference in vLLM wrapper Apr 30, 2026

wwwjn mentioned this pull request May 1, 2026

[MoE] Pad token count to a multiple of sp_size in AllToAllTokenDispatcher #3193

Merged

wwwjn added 2 commits April 30, 2026 20:25

github-actions Bot mentioned this pull request May 1, 2026

TODO Debt Report #2936

Open

wwwjn commented May 4, 2026

View reviewed changes

Comment thread tests/assets/qwen3_moe_debug/tokenizer_config.json Outdated

Comment thread torchtitan/experiments/rl/models/vllm_wrapper.py Outdated

Comment thread torchtitan/experiments/rl/models/vllm_wrapper.py Outdated

wwwjn added 3 commits May 4, 2026 16:14

wwwjn mentioned this pull request May 7, 2026

[MoE] Migrate MoE token dispatch/combine from all_to_all_single_autograd to all_to_all_single #3268

Open

tianyu-l reviewed May 7, 2026

View reviewed changes

wwwjn added 5 commits May 7, 2026 15:15

Update

99a8a7c

[ghstack-poisoned]

tianyu-l reviewed Jun 4, 2026

View reviewed changes

wwwjn changed the title ~~[rl] Enable TP2EP for MoE inference in vLLM wrapper~~ [Not ready][rl] Enable TP2EP for MoE inference in vLLM wrapper Jun 4, 2026

wwwjn changed the title ~~[Not ready][rl] Enable TP2EP for MoE inference in vLLM wrapper~~ [Not ready][rl] Enable TP2EP for unified MoE model in vLLM wrapper Jun 4, 2026

Update

5fde05e

[ghstack-poisoned]

wwwjn commented Jun 5, 2026

View reviewed changes

wwwjn added 2 commits June 5, 2026 12:13

Update

4fcad5e

[ghstack-poisoned]

Update

a3c41bd

[ghstack-poisoned]

wwwjn mentioned this pull request Jun 9, 2026

Using "virtual padding" to calculate number_local_tokens per SP rank, and fix combine() shape mismatch #3595

Merged

wwwjn added 6 commits June 9, 2026 20:35

Update

90ece02

[ghstack-poisoned]

Update

88e31ff

[ghstack-poisoned]

Update

f1e83c1

[ghstack-poisoned]

Update

8e2ea62

[ghstack-poisoned]

Update

2d737b3

[ghstack-poisoned]

Update

671fef9

[ghstack-poisoned]

Update

b43d3d1

[ghstack-poisoned]

wwwjn commented Jun 10, 2026

View reviewed changes

Update

bf6ecfa

[ghstack-poisoned]

wwwjn commented Jun 11, 2026

View reviewed changes

Conversation

wwwjn commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency

Verification

Batch-invariant verification

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wwwjn commented Apr 28, 2026 •

edited

Loading