Skip to content

[WIP] Enable DP-to-EP for MoE inference#3143

Closed
wwwjn wants to merge 4 commits into
gh/wwwjn/15/basefrom
gh/wwwjn/15/head
Closed

[WIP] Enable DP-to-EP for MoE inference#3143
wwwjn wants to merge 4 commits into
gh/wwwjn/15/basefrom
gh/wwwjn/15/head

Conversation

@wwwjn

@wwwjn wwwjn commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

In development. Currently it doesn't work with weight loading

Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math, enabling EP to span both DP and TP ranks (ep = dp * tp). For inference, the skip_dp path computes 2D meshes (fsdp + dp_replicate) then calls apply_fsdp which uses shard_placement_fn to route expert params to efsdp mesh and dense params to fsdp mesh.

Changes:

  • parallel_dims: dp_replicate mesh always exists (needed for 2D mesh)
  • vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size
  • qwen3/parallelize: inference path computes 2D meshes for apply_fsdp
  • llama4/parallelize: clarify shard_placement_fn comments

Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math,
enabling EP to span both DP and TP ranks (ep = dp * tp). For inference,
the skip_dp path computes 2D meshes (fsdp + dp_replicate) then calls
apply_fsdp which uses shard_placement_fn to route expert params to
efsdp mesh and dense params to fsdp mesh.

Changes:
- parallel_dims: dp_replicate mesh always exists (needed for 2D mesh)
- vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size
- qwen3/parallelize: inference path computes 2D meshes for apply_fsdp
- llama4/parallelize: clarify shard_placement_fn comments

[ghstack-poisoned]
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 28, 2026
@wwwjn wwwjn changed the title Enable DP-to-EP for MoE inference [WIP] Enable DP-to-EP for MoE inference Apr 28, 2026
Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math,
enabling EP to span both DP and TP ranks (ep = dp * tp). For inference,
the skip_dp path computes 2D meshes (fsdp + dp_replicate) then calls
apply_fsdp which uses shard_placement_fn to route expert params to
efsdp mesh and dense params to fsdp mesh.

Changes:
- parallel_dims: dp_replicate mesh always exists (needed for 2D mesh)
- vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size
- qwen3/parallelize: inference path computes 2D meshes for apply_fsdp
- llama4/parallelize: clarify shard_placement_fn comments

[ghstack-poisoned]
wwwjn added a commit that referenced this pull request Apr 28, 2026
Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math,
enabling EP to span both DP and TP ranks (ep = dp * tp). For inference,
the skip_dp path computes 2D meshes (fsdp + dp_replicate) then calls
apply_fsdp which uses shard_placement_fn to route expert params to
efsdp mesh and dense params to fsdp mesh.

Changes:
- parallel_dims: dp_replicate mesh always exists (needed for 2D mesh)
- vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size
- qwen3/parallelize: inference path computes 2D meshes for apply_fsdp
- llama4/parallelize: clarify shard_placement_fn comments

ghstack-source-id: 3b4296c
Pull Request resolved: #3143
Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math,
enabling EP to span both DP and TP ranks (ep = dp * tp). For inference,
the skip_dp path computes 2D meshes (fsdp + dp_replicate) then calls
apply_fsdp which uses shard_placement_fn to route expert params to
efsdp mesh and dense params to fsdp mesh.

Changes:
- parallel_dims: dp_replicate mesh always exists (needed for 2D mesh)
- vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size
- qwen3/parallelize: inference path computes 2D meshes for apply_fsdp
- llama4/parallelize: clarify shard_placement_fn comments

[ghstack-poisoned]
wwwjn added a commit that referenced this pull request Apr 28, 2026
Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math,
enabling EP to span both DP and TP ranks (ep = dp * tp). For inference,
the skip_dp path computes 2D meshes (fsdp + dp_replicate) then calls
apply_fsdp which uses shard_placement_fn to route expert params to
efsdp mesh and dense params to fsdp mesh.

Changes:
- parallel_dims: dp_replicate mesh always exists (needed for 2D mesh)
- vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size
- qwen3/parallelize: inference path computes 2D meshes for apply_fsdp
- llama4/parallelize: clarify shard_placement_fn comments

ghstack-source-id: 3b4296c
Pull Request resolved: #3143
In development. Currently it doesn't work with weight loading

Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math, enabling EP to span both DP and TP ranks (ep = dp * tp). For inference, the skip_dp path computes 2D meshes (fsdp + dp_replicate) then calls apply_fsdp which uses shard_placement_fn to route expert params to efsdp mesh and dense params to fsdp mesh.

Changes:
- parallel_dims: dp_replicate mesh always exists (needed for 2D mesh)
- vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size
- qwen3/parallelize: inference path computes 2D meshes for apply_fsdp
- llama4/parallelize: clarify shard_placement_fn comments

[ghstack-poisoned]
wwwjn added a commit that referenced this pull request Apr 28, 2026
Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math,
enabling EP to span both DP and TP ranks (ep = dp * tp). For inference,
the skip_dp path computes 2D meshes (fsdp + dp_replicate) then calls
apply_fsdp which uses shard_placement_fn to route expert params to
efsdp mesh and dense params to fsdp mesh.

Changes:
- parallel_dims: dp_replicate mesh always exists (needed for 2D mesh)
- vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size
- qwen3/parallelize: inference path computes 2D meshes for apply_fsdp
- llama4/parallelize: clarify shard_placement_fn comments

ghstack-source-id: 2228b13
Pull Request resolved: #3143
# Always keep fsdp mesh with real backend so fully_shard()
# can apply MixedPrecisionPolicy even at degree 1.
return True
if name == "dp_replicate":

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make dp_replicate always exist because we need a 2D mesh to apply DDP via fully_shard

@wwwjn wwwjn closed this Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant