[WIP] Enable DP-to-EP for MoE inference#3143
Closed
wwwjn wants to merge 4 commits into
Closed
Conversation
Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math, enabling EP to span both DP and TP ranks (ep = dp * tp). For inference, the skip_dp path computes 2D meshes (fsdp + dp_replicate) then calls apply_fsdp which uses shard_placement_fn to route expert params to efsdp mesh and dense params to fsdp mesh. Changes: - parallel_dims: dp_replicate mesh always exists (needed for 2D mesh) - vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size - qwen3/parallelize: inference path computes 2D meshes for apply_fsdp - llama4/parallelize: clarify shard_placement_fn comments [ghstack-poisoned]
2 tasks
Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math, enabling EP to span both DP and TP ranks (ep = dp * tp). For inference, the skip_dp path computes 2D meshes (fsdp + dp_replicate) then calls apply_fsdp which uses shard_placement_fn to route expert params to efsdp mesh and dense params to fsdp mesh. Changes: - parallel_dims: dp_replicate mesh always exists (needed for 2D mesh) - vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size - qwen3/parallelize: inference path computes 2D meshes for apply_fsdp - llama4/parallelize: clarify shard_placement_fn comments [ghstack-poisoned]
wwwjn
added a commit
that referenced
this pull request
Apr 28, 2026
Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math, enabling EP to span both DP and TP ranks (ep = dp * tp). For inference, the skip_dp path computes 2D meshes (fsdp + dp_replicate) then calls apply_fsdp which uses shard_placement_fn to route expert params to efsdp mesh and dense params to fsdp mesh. Changes: - parallel_dims: dp_replicate mesh always exists (needed for 2D mesh) - vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size - qwen3/parallelize: inference path computes 2D meshes for apply_fsdp - llama4/parallelize: clarify shard_placement_fn comments ghstack-source-id: 3b4296c Pull Request resolved: #3143
Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math, enabling EP to span both DP and TP ranks (ep = dp * tp). For inference, the skip_dp path computes 2D meshes (fsdp + dp_replicate) then calls apply_fsdp which uses shard_placement_fn to route expert params to efsdp mesh and dense params to fsdp mesh. Changes: - parallel_dims: dp_replicate mesh always exists (needed for 2D mesh) - vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size - qwen3/parallelize: inference path computes 2D meshes for apply_fsdp - llama4/parallelize: clarify shard_placement_fn comments [ghstack-poisoned]
wwwjn
added a commit
that referenced
this pull request
Apr 28, 2026
Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math, enabling EP to span both DP and TP ranks (ep = dp * tp). For inference, the skip_dp path computes 2D meshes (fsdp + dp_replicate) then calls apply_fsdp which uses shard_placement_fn to route expert params to efsdp mesh and dense params to fsdp mesh. Changes: - parallel_dims: dp_replicate mesh always exists (needed for 2D mesh) - vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size - qwen3/parallelize: inference path computes 2D meshes for apply_fsdp - llama4/parallelize: clarify shard_placement_fn comments ghstack-source-id: 3b4296c Pull Request resolved: #3143
In development. Currently it doesn't work with weight loading Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math, enabling EP to span both DP and TP ranks (ep = dp * tp). For inference, the skip_dp path computes 2D meshes (fsdp + dp_replicate) then calls apply_fsdp which uses shard_placement_fn to route expert params to efsdp mesh and dense params to fsdp mesh. Changes: - parallel_dims: dp_replicate mesh always exists (needed for 2D mesh) - vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size - qwen3/parallelize: inference path computes 2D meshes for apply_fsdp - llama4/parallelize: clarify shard_placement_fn comments [ghstack-poisoned]
wwwjn
added a commit
that referenced
this pull request
Apr 28, 2026
Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math, enabling EP to span both DP and TP ranks (ep = dp * tp). For inference, the skip_dp path computes 2D meshes (fsdp + dp_replicate) then calls apply_fsdp which uses shard_placement_fn to route expert params to efsdp mesh and dense params to fsdp mesh. Changes: - parallel_dims: dp_replicate mesh always exists (needed for 2D mesh) - vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size - qwen3/parallelize: inference path computes 2D meshes for apply_fsdp - llama4/parallelize: clarify shard_placement_fn comments ghstack-source-id: 2228b13 Pull Request resolved: #3143
wwwjn
commented
Apr 28, 2026
| # Always keep fsdp mesh with real backend so fully_shard() | ||
| # can apply MixedPrecisionPolicy even at degree 1. | ||
| return True | ||
| if name == "dp_replicate": |
Contributor
Author
There was a problem hiding this comment.
We need to make dp_replicate always exist because we need a 2D mesh to apply DDP via fully_shard
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
In development. Currently it doesn't work with weight loading
Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math, enabling EP to span both DP and TP ranks (ep = dp * tp). For inference, the skip_dp path computes 2D meshes (fsdp + dp_replicate) then calls apply_fsdp which uses shard_placement_fn to route expert params to efsdp mesh and dense params to fsdp mesh.
Changes: