[Not ready][rl] Enable TP2EP for unified MoE model in vLLM wrapper#3142
[Not ready][rl] Enable TP2EP for unified MoE model in vLLM wrapper#3142wwwjn wants to merge 38 commits into
Conversation
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
…cher (#3193) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #3172 * #3171 * #3142 * __->__ #3193 Today the dispatcher's _split_along_sp() raises when num_tokens (bs * slen) is not divisible by sp_size (= TP degree). Real workloads with varlen prompts can land on non-divisible totals and crash the MoE forward. Pad inside dispatch(): round num_tokens up to the next multiple of sp_size, padding x and top_scores with zeros and selected_experts_indices with 0 (so pad rows route deterministically to expert 0 with zero score). combine() reads metadata.original_num_tokens to size the scatter_add buffer at the padded length and slices the pad rows off before returning. When sp_size == 1 or input is already divisible, behavior is bitwise identical to today. Pad tokens are numerically inert: - Zero scores -> contribution to scatter_add is exactly zero either before or after expert compute (independent of score_before_experts). - Pad indices fall in [original, padded), which is sliced off after scatter_add, so they never appear in the returned output. Trainer/generator can pad by different amounts depending on their batch shapes; the unpadded portions remain bitwise identical. - TorchAOTokenDispatcher inherits dispatch/combine and gets the fix for free. - DeepEPTokenDispatcher uses a separate metadata type and is unaffected.
| # Replicate (no-op) and Shard(-1) (all-gather) lm_head output placements. | ||
| if isinstance(logits, DTensor): | ||
| logits = logits.to_local() | ||
| logits = logits.full_tensor() |
There was a problem hiding this comment.
should work with disable_loss_parallel already?
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
…endency (#3242) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #3236 * #3142 * __->__ #3242 vllm has this customized config parser registry support so we can plug in TorchTitan's config parser. Why we need this: - get rid of dependency on a HF format checkpoint folder when initializing. Don't implicitly depend on `config.json` as config source of truth Another changes in this PR: - remove the round-trip translation from torchtitan config -> vllm config -> torchtitan config. Using closure to bypass.
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. ghstack-source-id: 95289dc Pull Request resolved: #3142
| # All-to-all dispatch tokens to EP ranks. | ||
| # Use the non-autograd version under inference (vLLM), since | ||
| # _c10d_functional_autograd ops don't dispatch correctly without | ||
| # an active autograd context. Gated by a Python bool so the choice | ||
| # is stable at trace time. |
There was a problem hiding this comment.
any progress? could you follow up?
add a TODO and issue if you think it's not possible to get this done quickly
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. ghstack-source-id: 03bd198 Pull Request resolved: #3142
wwwjn
left a comment
There was a problem hiding this comment.
Left comments for myself to be update after rebasing to last main
… and fix combine() shape mismatch (#3595) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #3236 * #3142 * __->__ #3595 ## What's the problem - [Currently] combine() wrongly assume the tokens are evenly sharded on each rank https://github.com/pytorch/torchtitan/blob/c0428bb186f5c97e1d1b4ed89febd22916eee302/torchtitan/models/common/token_dispatcher.py#L439-L456 (infer global SPMD in local SPMD region) - If uneven sharded, out_TD will have different shapes across SP ranks. - We should directly ban if input number of tokens in input batch can not be evenly sharded by SP ranks - [Future] Router will use spmd_types soon, and router is per SP rank. Per SP rank should have even sharding - [Future] we want to avoid dispatch/load_balacing the padded token, we should be able to do that by adding metadata field to record the actually local tokens for each sp rank ## What does this PR do? This PR is doing "virtual padding" , and passing metadata around - Calculate num_local_tokens_after_padding = (T + pad_tokens) // sp_size in MoE module - Pass num_local_tokens_after_padding to GroupedExperts module, then to combine() - combine() returns a tensor with shape (num_local_tokens_after_padding * sp_rank, .... ) - slice the combined tensor to (T, ...) in MoE
| "<|vision_end|>": 2006, | ||
| "<|image_pad|>": 2007, | ||
| "<|video_pad|>": 2008 | ||
| "<|video_pad|>": 2008, |
There was a problem hiding this comment.
These special tokens are needed becuase external render (AutoTokenizer in RL) checks these and they are needed by Qwen3 model
There was a problem hiding this comment.
hf_assets_path="tests/assets/tokenizer"
renderer=RendererConfig(name="qwen3", enable_thinking=False)
| completion_message["reasoning_content"] = parsed.reasoning_content | ||
| if parsed.tool_calls: | ||
| completion_message["tool_calls"] = parsed.tool_calls | ||
|
|
| group_size = 8 | ||
| return RLTrainer.Config( | ||
| model_spec=model_registry("30B-A3B", attn_backend="varlen"), | ||
| hf_assets_path="/data/users/jianiw/model/Qwen3-30B-A3B", |
There was a problem hiding this comment.
This needs to be reverted
There was a problem hiding this comment.
Changes in this files just adding basic kwargs and make this script runnable with different configs, and align with the generator setup in RL.
| # forwards cannot shard across all SP ranks, so pad to ``sp_size`` and | ||
| # trim before returning. This is a no-op for training/prefill and EP-off; | ||
| # padded tokens only affect the inference-only load-balance counter. | ||
| seq_pad = sp_size - L if L < sp_size else 0 |
There was a problem hiding this comment.
This is to solve #3622
TLDR the issue is: when seq_len >= TP_degree, num_local_tokens_per_expert_E = routing_map_BLE.sum(dim=(0, 1)) suppose to be
V (tokens vary across TP ranks) --sum(seq)--> P (partial per-rank count). But when seq_len < TP, it degraded to R --> sum --> R, which can not be converted to P later.
The simple fix there is to pad when seq_len < TP_degree
Stack from ghstack (oldest at bottom):
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.
Key changes:
Similar as #3057
Dependency
Verification
Qwen3-30B-A3B model on 8 GPU (Trainer TP4 EP4, generator TP4 EP4):
Batch-invariant verification
TBA