Using "virtual padding" to calculate number_local_tokens per SP rank, and fix `combine()` shape mismatch by wwwjn · Pull Request #3577 · pytorch/torchtitan

wwwjn · 2026-06-08T17:57:59Z

What's the problem

[Currently] combine() wrongly assume the tokens are evenly sharded on each rank

torchtitan/torchtitan/models/common/token_dispatcher.py

Lines 439 to 456 in c0428bb

    
           out_TD = torch.zeros( 
        
               x_TD.shape[0] * self.sp_size, 
        
               x_TD.shape[-1], 
        
               device=x_TD.device, 
        
               dtype=x_TD.dtype, 
        
           ) 
        
           if not self.score_before_experts: 
        
               routed_output_RD = ( 
        
                   routed_output_RD.to(torch.float32) 
        
                   * metadata.topk_scores_experts_sorted_N.reshape(-1, 1) 
        
               ).to(routed_output_RD.dtype) 
        
           # With SP, token indices are 0-based within the local shard. 
        
           # Offset to global positions for the full-size scatter buffer. 
        
           if self.sp_size > 1: 
        
               token_indices_experts_sorted_N = ( 
        
                   metadata.token_indices_experts_sorted_N + x_TD.shape[0] * self.sp_rank

(infer global SPMD in local SPMD region)

If uneven sharded, out_TD will have different shapes across SP ranks.
We should directly ban if input number of tokens in input batch can not be evenly sharded by SP ranks

[Future] Router will use spmd_types soon, and router is per SP rank. Per SP rank should have even sharding
[Future] we want to avoid dispatch/load_balacing the padded token, we should be able to do that by adding metadata field to record the actually local tokens for each sp rank

What does this PR do?

This PR is doing "virtual padding" , and passing metadata around

Calculate num_local_tokens_after_padding = (T + pad_tokens) // sp_size in MoE module
Pass num_local_tokens_after_padding to GroupedExperts module, then to combine()
combine() returns a tensor with shape (num_local_tokens_after_padding * sp_rank, .... )
slice the combined tensor to (T, ...) in MoE

wwwjn · 2026-06-08T17:59:26Z

        self.top_k = config.top_k
        self.score_before_experts = config.score_before_experts
+        # Sequence-parallel split coordinates. EP dispatchers update these in
+        # wire_meshes(); the local dispatcher keeps the TP=1 defaults.


Introducing this sp_rank, sp_size in LocalTokenDispatcher is not ideal, but it's used in local_num_valid_tokens() , I want to share local_num_valid_tokens() implementation across AllToAllTokenDispather, DeepEP/HybridEP

tianyu-l · 2026-06-09T05:46:22Z

+        # are never routed and are sliced off below.
        out_TD = torch.zeros(
-            x_TD.shape[0] * self.sp_size,
+            num_local_tokens_after_padding * self.sp_size,


My impression is that this "virtual padding" idea would work when GroupedExperts is in a local region inside a DTensor global region, where DTensor handles the shard / all-gather.

After we migrate to spmd_types, the all-gather would require this metadata as well (spmd.redistribute takes this shape arg), with which spmd_types could also achieve similar "pad / unpad only around collectives" effect. O/w we still have to do "real padding" in model code. cc @pianpwk

I'm OK with this temporary solution (after cleanup) to unblock your vLLM + MoE work.

Yes let me clean up.

For spmd_types in MoE region, we don't need "real padding" in our current setup, just passing this "vitural padded shape" as metadata around.

torchtitan/torchtitan/models/common/MOE_SHARDING.md

Line 4 in da230bb

[`moe_sharding.py`](moe_sharding.py).

wwwjn · 2026-06-09T21:03:17Z

Close this because of #3595 . Let's move discussion there

tianyu-l

sounds right to me, one concrete issue before landing

pytorch-bot Bot added the ciflow/8gpu label Jun 8, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 8, 2026

wwwjn commented Jun 9, 2026

View reviewed changes

tianyu-l reviewed Jun 9, 2026

View reviewed changes

pytorch-bot Bot added the ciflow/rl label Jun 9, 2026

wwwjn added 4 commits June 9, 2026 08:49

Add local-shard MoE padding

81ee099

no padding at all

5cdea95

no padding at all

c6845d2

update

7bbe52e

wwwjn force-pushed the moe-padding branch from 7825c77 to 7bbe52e Compare June 9, 2026 15:52

wwwjn added 2 commits June 9, 2026 09:55

clean up

1e59d84

add gpt-oss

9008f5a

wwwjn marked this pull request as ready for review June 9, 2026 18:46

wwwjn requested review from fegin and wconstab as code owners June 9, 2026 18:46

wwwjn changed the title ~~[Do not review] Add local-shard MoE padding~~ Using "virtual padding" to calculate number_local_tokens per SP rank, and fix combine() shape mismatc Jun 9, 2026

wwwjn changed the title ~~Using "virtual padding" to calculate number_local_tokens per SP rank, and fix combine() shape mismatc~~ Using "virtual padding" to calculate number_local_tokens per SP rank, and fix combine() shape mismatch Jun 9, 2026

clean up

ee74f69

wwwjn force-pushed the moe-padding branch from 6730268 to ee74f69 Compare June 9, 2026 19:02

wwwjn closed this Jun 9, 2026

tianyu-l reviewed Jun 9, 2026

View reviewed changes

tianyu-l deleted the moe-padding branch June 9, 2026 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using "virtual padding" to calculate number_local_tokens per SP rank, and fix `combine()` shape mismatch#3577

Using "virtual padding" to calculate number_local_tokens per SP rank, and fix `combine()` shape mismatch#3577
wwwjn wants to merge 7 commits into
mainfrom
moe-padding

wwwjn commented Jun 8, 2026 •

edited

Loading

Uh oh!

wwwjn Jun 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l Jun 9, 2026

Uh oh!

wwwjn Jun 9, 2026

Uh oh!

Uh oh!

wwwjn commented Jun 9, 2026

Uh oh!

tianyu-l left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	out_TD = torch.zeros(
	x_TD.shape[0] * self.sp_size,
	x_TD.shape[-1],
	device=x_TD.device,
	dtype=x_TD.dtype,
	)

	if not self.score_before_experts:
	routed_output_RD = (
	routed_output_RD.to(torch.float32)
	* metadata.topk_scores_experts_sorted_N.reshape(-1, 1)
	).to(routed_output_RD.dtype)

	# With SP, token indices are 0-based within the local shard.
	# Offset to global positions for the full-size scatter buffer.
	if self.sp_size > 1:
	token_indices_experts_sorted_N = (
	metadata.token_indices_experts_sorted_N + x_TD.shape[0] * self.sp_rank

Conversation

wwwjn commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's the problem

What does this PR do?

Uh oh!

wwwjn Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

wwwjn Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wwwjn commented Jun 9, 2026

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wwwjn commented Jun 8, 2026 •

edited

Loading