Skip to content

[Not ready][rl] Enable TP2EP for unified MoE model in vLLM wrapper#3142

Open
wwwjn wants to merge 38 commits into
gh/wwwjn/14/basefrom
gh/wwwjn/14/head
Open

[Not ready][rl] Enable TP2EP for unified MoE model in vLLM wrapper#3142
wwwjn wants to merge 38 commits into
gh/wwwjn/14/basefrom
gh/wwwjn/14/head

Conversation

@wwwjn

@wwwjn wwwjn commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.

Key changes:

  • Meta-device init + to_empty() + init_states() for large MoE models
  • EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
  • Use enable_sp (not inference flag) for output layout and SP splitting
  • enable_sequence_parallel=True, disable_loss_parallel=True for inference
  • Remove stale ModelConvertersContainer references
  • Add MoE debug and 30B-A3B RL configs

Similar as #3057

Dependency

Verification

Qwen3-30B-A3B model on 8 GPU (Trainer TP4 EP4, generator TP4 EP4):

wandb: Detected [openai] in use.
wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
wandb: For more information, check out the docs at: https://weave-docs.wandb.ai
WandB logging enabled
Using renderer qwen3, of type <class 'renderers.configs.Qwen3RendererConfig'>, with args {'enable_thinking': False}
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
4 generator GPUs + 4 trainer GPUs = 8 total
[actor=<root>] Initializing TorchStoreStrategy with default_transport_type=<TransportType.Unset: 1>
[actor=<root>] Pre-training validation; then 10 steps of RL training
[actor=<root>] ----------
Validation | Step:  0  validation/response_length/mean: 30.10  timing/validate: 29.13
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  1  loss/mean: 0.013  bit_wise/logprob_diff/max: 55.49  reward/_mean: 0.79  rollout_reward/_mean: 0.79  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.68  rollout/response_length/max: 42.0  train/grad_norm/mean: 0.46  train/lr: 5.0e-07  perf/tokens_per_second: 141.1  timing/step: 120.0
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  2  loss/mean: 0.0055  bit_wise/logprob_diff/max: 56.85  reward/_mean: 0.70  rollout_reward/_mean: 0.70  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.80  rollout/response_length/max: 44.0  train/grad_norm/mean: 0.37  train/lr: 1.0e-06  perf/tokens_per_second: 158.5  timing/step: 110.2
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  3  loss/mean: 0.016  bit_wise/logprob_diff/max: 40.089  reward/_mean: 0.73  rollout_reward/_mean: 0.73  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.76  rollout/response_length/max: 47.0  train/grad_norm/mean: 0.45  train/lr: 1.0e-06  perf/tokens_per_second: 156.9  timing/step: 110.0
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  4  loss/mean: 0.011  bit_wise/logprob_diff/max: 40.92  reward/_mean: 0.73  rollout_reward/_mean: 0.73  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.72  rollout/response_length/max: 45.0  train/grad_norm/mean: 0.43  train/lr: 1.0e-06  perf/tokens_per_second: 163.3  timing/step: 108.7
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  5  loss/mean: 0.0076  bit_wise/logprob_diff/max: 96.53  reward/_mean: 0.71  rollout_reward/_mean: 0.71  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.76  rollout/response_length/max: 42.0  train/grad_norm/mean: 0.36  train/lr: 1.0e-06  perf/tokens_per_second: 165.3  timing/step: 109.2
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  6  loss/mean: 0.011  bit_wise/logprob_diff/max: 48.36  reward/_mean: 0.79  rollout_reward/_mean: 0.79  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.84  rollout/response_length/max: 49.0  train/grad_norm/mean: 0.44  train/lr: 1.0e-06  perf/tokens_per_second: 147.1  timing/step: 117.7
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  7  loss/mean: 0.0055  bit_wise/logprob_diff/max: 54.41  reward/_mean: 0.65  rollout_reward/_mean: 0.65  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.72  rollout/response_length/max: 43.0  train/grad_norm/mean: 0.12  train/lr: 1.0e-06  perf/tokens_per_second: 158.1  timing/step: 110.1
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  8  loss/mean: 0.0035  bit_wise/logprob_diff/max: 51.040  reward/_mean: 0.66  rollout_reward/_mean: 0.66  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.72  rollout/response_length/max: 45.0  train/grad_norm/mean: 0.31  train/lr: 1.0e-06  perf/tokens_per_second: 168.6  timing/step: 101.4
[actor=<root>] Dropping 1 packed rows (9 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step:  9  loss/mean: 0.023  bit_wise/logprob_diff/max: 58.88  reward/_mean: 0.56  rollout_reward/_mean: 0.56  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.64  rollout/response_length/max: 46.0  train/grad_norm/mean: 0.73  train/lr: 1.0e-06  perf/tokens_per_second: 175.6  timing/step: 103.8
[actor=<root>] Dropping 2 packed rows (10 -> 8) to fit global_batch_size
[actor=<root>] ----------
Train | Step: 10  loss/mean: 0.020  bit_wise/logprob_diff/max: 50.84  reward/_mean: 0.52  rollout_reward/_mean: 0.52  reward/_max: 1.0  rollout_reward/_max: 1.0  reward/zero_std_frac: 0.64  rollout/response_length/max: 47.0  train/grad_norm/mean: 0.23  train/lr: 1.0e-06  perf/tokens_per_second: 167.2  timing/step: 109.2
[actor=<root>] ----------
Validation | Step: 10  validation/response_length/mean: 31.85  timing/validate: 25.94
[actor=<root>] ============================================================
[actor=<root>] Validation summary (pre / post):
[actor=<root>]   validation_reward/_max:  +1.000  /  +1.000
[actor=<root>]   validation_reward/_mean:  +0.639  /  +0.664
[actor=<root>]   validation_reward/_min:  +0.052  /  +0.056
[actor=<root>]   validation_reward/_std:  +0.406  /  +0.386
[actor=<root>]   validation_reward/_sum:  +12.776  /  +13.289
[actor=<root>]   validation_reward/component/RewardAlphabetSort/mean:  +0.639  /  +0.664
[actor=<root>] ============================================================
[actor=<root>] Closing: tearing down actors and process meshes.

Batch-invariant verification

TBA

Enable Expert Parallelism for MoE models in the vLLM inference path,
with TP on dense layers. Includes meta-device init, init_states() fix
for RoPE cache, EP mesh creation, enable_sp-based output layout, and
MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching
native vLLM on Qwen3-30B-A3B.

[ghstack-poisoned]
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 28, 2026
Enable Expert Parallelism for MoE models in the vLLM inference path,
with TP on dense layers. Includes meta-device init, init_states() fix
for RoPE cache, EP mesh creation, enable_sp-based output layout, and
MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching
native vLLM on Qwen3-30B-A3B.


Similar as #3057 

[ghstack-poisoned]
Comment thread torchtitan/experiments/rl/grpo.py Outdated
Comment thread torchtitan/experiments/rl/models/vllm_wrapper.py Outdated
@wwwjn wwwjn changed the title Enable EP+TP for MoE inference in vLLM wrapper Enable TP2EP for MoE inference in vLLM wrapper Apr 28, 2026
Comment thread torchtitan/experiments/rl/models/vllm_wrapper.py Outdated
Comment thread torchtitan/experiments/rl/models/vllm_wrapper.py Outdated
Comment thread torchtitan/models/common/token_dispatcher.py Outdated
Comment thread torchtitan/models/llama4/parallelize.py Outdated
Comment thread torchtitan/models/qwen3/state_dict_adapter.py Outdated
Comment thread torchtitan/experiments/rl/grpo.py Outdated
Comment thread torchtitan/experiments/rl/actors/generator.py
Comment thread torchtitan/experiments/rl/config_registry.py
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B.


Similar as #3057 

[ghstack-poisoned]
Comment thread torchtitan/experiments/rl/models/vllm_wrapper.py Outdated
Comment thread torchtitan/experiments/rl/models/vllm_wrapper.py Outdated
Comment thread torchtitan/experiments/rl/models/vllm_wrapper.py
Comment thread torchtitan/experiments/rl/config_registry.py Outdated
Comment thread torchtitan/experiments/rl/grpo.py Outdated
Comment thread torchtitan/models/llama4/parallelize.py Outdated
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B.


Similar as #3057 

[ghstack-poisoned]
@wwwjn wwwjn changed the title Enable TP2EP for MoE inference in vLLM wrapper [rl] Enable TP2EP for MoE inference in vLLM wrapper Apr 30, 2026
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B.


Similar as #3057 

[ghstack-poisoned]
wwwjn added 2 commits April 30, 2026 20:25
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B.


Similar as #3057 

[ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B.


Similar as #3057 

[ghstack-poisoned]
@github-actions github-actions Bot mentioned this pull request May 1, 2026
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B.


Similar as #3057 

[ghstack-poisoned]
Comment thread tests/assets/qwen3_moe_debug/tokenizer_config.json Outdated
Comment thread torchtitan/experiments/rl/models/vllm_wrapper.py Outdated
Comment thread torchtitan/experiments/rl/models/vllm_wrapper.py Outdated
wwwjn added 3 commits May 4, 2026 16:14
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B.


Similar as #3057 

[ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B.


Similar as #3057 

[ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B.


Similar as #3057 

[ghstack-poisoned]
wwwjn added a commit that referenced this pull request May 5, 2026
…cher (#3193)

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* #3172
* #3171
* #3142
* __->__ #3193

Today the dispatcher's _split_along_sp() raises when num_tokens (bs *
slen) is not divisible by sp_size (= TP degree). Real workloads with
varlen prompts can land on non-divisible totals and crash the MoE
forward.

Pad inside dispatch(): round num_tokens up to the next multiple of
sp_size, padding x and top_scores with zeros and
selected_experts_indices with 0 (so pad rows route deterministically to
expert 0 with zero score). combine() reads metadata.original_num_tokens
to size the scatter_add buffer at the padded length and slices the pad
rows off before returning. When sp_size == 1 or input is already
divisible, behavior is bitwise identical to today.

Pad tokens are numerically inert:
- Zero scores -> contribution to scatter_add is exactly zero either
before or after expert compute (independent of score_before_experts).
- Pad indices fall in [original, padded), which is sliced off after
scatter_add, so they never appear in the returned output.

Trainer/generator can pad by different amounts depending on their batch
shapes; the unpadded portions remain bitwise identical.

- TorchAOTokenDispatcher inherits dispatch/combine and gets the fix for
free.
- DeepEPTokenDispatcher uses a separate metadata type and is unaffected.
# Replicate (no-op) and Shard(-1) (all-gather) lm_head output placements.
if isinstance(logits, DTensor):
logits = logits.to_local()
logits = logits.full_tensor()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should work with disable_loss_parallel already?

wwwjn added 5 commits May 7, 2026 15:15
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B.


Similar as #3057 

[ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B.


Similar as #3057 

[ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B.


Similar as #3057 

[ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B.


Similar as #3057 

[ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B.


Similar as #3057 

[ghstack-poisoned]
wwwjn added a commit that referenced this pull request May 11, 2026
…endency (#3242)

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* #3236
* #3142
* __->__ #3242

vllm has this customized config parser registry support so we can plug
in TorchTitan's config parser. Why we need this:
- get rid of dependency on a HF format checkpoint folder when
initializing. Don't implicitly depend on `config.json` as config source
of truth

Another changes in this PR:
- remove the round-trip translation from torchtitan config -> vllm
config -> torchtitan config. Using closure to bypass.
wwwjn added a commit that referenced this pull request Jun 3, 2026
Enable Expert Parallelism for MoE models in the vLLM inference path,
with TP on dense layers. Includes meta-device init, init_states() fix
for RoPE cache, EP mesh creation, enable_sp-based output layout, and
MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching
native vLLM on Qwen3-30B-A3B.

ghstack-source-id: 95289dc
Pull Request resolved: #3142
[ghstack-poisoned]
Comment thread torchtitan/experiments/rl/config_registry.py
Comment on lines +317 to +321
# All-to-all dispatch tokens to EP ranks.
# Use the non-autograd version under inference (vLLM), since
# _c10d_functional_autograd ops don't dispatch correctly without
# an active autograd context. Gated by a Python bool so the choice
# is stable at trace time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any progress? could you follow up?
add a TODO and issue if you think it's not possible to get this done quickly

Comment thread torchtitan/config/configs.py Outdated
@wwwjn wwwjn changed the title [rl] Enable TP2EP for MoE inference in vLLM wrapper [Not ready][rl] Enable TP2EP for MoE inference in vLLM wrapper Jun 4, 2026
@wwwjn wwwjn changed the title [Not ready][rl] Enable TP2EP for MoE inference in vLLM wrapper [Not ready][rl] Enable TP2EP for unified MoE model in vLLM wrapper Jun 4, 2026
[ghstack-poisoned]
wwwjn added a commit that referenced this pull request Jun 5, 2026
Enable Expert Parallelism for MoE models in the vLLM inference path,
with TP on dense layers. Includes meta-device init, init_states() fix
for RoPE cache, EP mesh creation, enable_sp-based output layout, and
MoE debug configs.

Key changes:
- Meta-device init + to_empty() + init_states() for large MoE models
- EP mesh: ep = tp_size (TP ranks become EP ranks for experts)
- Use enable_sp (not inference flag) for output layout and SP splitting
- enable_sequence_parallel=True, disable_loss_parallel=True for inference
- Remove stale ModelConvertersContainer references
- Add MoE debug and 30B-A3B RL configs

Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching
native vLLM on Qwen3-30B-A3B.

ghstack-source-id: 03bd198
Pull Request resolved: #3142

@wwwjn wwwjn left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments for myself to be update after rebasing to last main

Comment thread .github/workflows/integration_test_8gpu_rl.yaml
Comment thread torchtitan/config/configs.py Outdated
Comment thread torchtitan/experiments/rl/actors/generator.py
Comment thread torchtitan/experiments/rl/models/vllm_wrapper.py
Comment thread torchtitan/models/common/token_dispatcher.py Outdated
Comment thread torchtitan/experiments/rl/config_registry.py
Comment thread torchtitan/experiments/rl/tests/integration_tests.py Outdated
Comment thread torchtitan/experiments/rl/models/vllm_wrapper.py Outdated
Comment thread torchtitan/experiments/rl/models/vllm_wrapper.py Outdated
wwwjn added 2 commits June 5, 2026 12:13
[ghstack-poisoned]
[ghstack-poisoned]
wwwjn added 6 commits June 9, 2026 20:35
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
wwwjn added a commit that referenced this pull request Jun 10, 2026
… and fix combine() shape mismatch (#3595)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0)
(oldest at bottom):
* #3236
* #3142
* __->__ #3595

## What's the problem
- [Currently] combine() wrongly assume the tokens are evenly sharded on
each rank
https://github.com/pytorch/torchtitan/blob/c0428bb186f5c97e1d1b4ed89febd22916eee302/torchtitan/models/common/token_dispatcher.py#L439-L456
(infer global SPMD in local SPMD region)
- If uneven sharded, out_TD will have different shapes across SP ranks.
- We should directly ban if input number of tokens in input batch can
not be evenly sharded by SP ranks

- [Future] Router will use spmd_types soon, and router is per SP rank.
Per SP rank should have even sharding
- [Future] we want to avoid dispatch/load_balacing the padded token, we
should be able to do that by adding metadata field to record the
actually local tokens for each sp rank

## What does this PR do?

This PR is doing "virtual padding" , and passing metadata around
- Calculate num_local_tokens_after_padding = (T + pad_tokens) // sp_size
in MoE module
- Pass num_local_tokens_after_padding to GroupedExperts module, then to
combine()
- combine() returns a tensor with shape (num_local_tokens_after_padding
* sp_rank, .... )
- slice the combined tensor to (T, ...)  in MoE
[ghstack-poisoned]
"<|vision_end|>": 2006,
"<|image_pad|>": 2007,
"<|video_pad|>": 2008
"<|video_pad|>": 2008,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These special tokens are needed becuase external render (AutoTokenizer in RL) checks these and they are needed by Qwen3 model

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hf_assets_path="tests/assets/tokenizer"
renderer=RendererConfig(name="qwen3", enable_thinking=False)

[ghstack-poisoned]
completion_message["reasoning_content"] = parsed.reasoning_content
if parsed.tool_calls:
completion_message["tool_calls"] = parsed.tool_calls

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to revert

group_size = 8
return RLTrainer.Config(
model_spec=model_registry("30B-A3B", attn_backend="varlen"),
hf_assets_path="/data/users/jianiw/model/Qwen3-30B-A3B",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be reverted

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in this files just adding basic kwargs and make this script runnable with different configs, and align with the generator setup in RL.

# forwards cannot shard across all SP ranks, so pad to ``sp_size`` and
# trim before returning. This is a no-op for training/prefill and EP-off;
# padded tokens only affect the inference-only load-balance counter.
seq_pad = sp_size - L if L < sp_size else 0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to solve #3622

TLDR the issue is: when seq_len >= TP_degree, num_local_tokens_per_expert_E = routing_map_BLE.sum(dim=(0, 1)) suppose to be
V (tokens vary across TP ranks) --sum(seq)--> P (partial per-rank count). But when seq_len < TP, it degraded to R --> sum --> R, which can not be converted to P later.

The simple fix there is to pad when seq_len < TP_degree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rl ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants