Using "virtual padding" to calculate number_local_tokens per SP rank, and fix combine() shape mismatch by wwwjn · Pull Request #3595 · pytorch/torchtitan

wwwjn · 2026-06-09T20:57:06Z

Stack from ghstack (oldest at bottom):

[WIP] Enable DP+EP for MoE inference in vLLM wrapper #3236
[Not ready][rl] Enable TP2EP for unified MoE model in vLLM wrapper #3142
-> Using "virtual padding" to calculate number_local_tokens per SP rank, and fix combine() shape mismatch #3595

What's the problem

[Currently] combine() wrongly assume the tokens are evenly sharded on each rank

torchtitan/torchtitan/models/common/token_dispatcher.py

Lines 439 to 456 in c0428bb

    
           out_TD = torch.zeros( 
        
               x_TD.shape[0] * self.sp_size, 
        
               x_TD.shape[-1], 
        
               device=x_TD.device, 
        
               dtype=x_TD.dtype, 
        
           ) 
        
           if not self.score_before_experts: 
        
               routed_output_RD = ( 
        
                   routed_output_RD.to(torch.float32) 
        
                   * metadata.topk_scores_experts_sorted_N.reshape(-1, 1) 
        
               ).to(routed_output_RD.dtype) 
        
           # With SP, token indices are 0-based within the local shard. 
        
           # Offset to global positions for the full-size scatter buffer. 
        
           if self.sp_size > 1: 
        
               token_indices_experts_sorted_N = ( 
        
                   metadata.token_indices_experts_sorted_N + x_TD.shape[0] * self.sp_rank

(infer global SPMD in local SPMD region)

If uneven sharded, out_TD will have different shapes across SP ranks.
We should directly ban if input number of tokens in input batch can not be evenly sharded by SP ranks

[Future] Router will use spmd_types soon, and router is per SP rank. Per SP rank should have even sharding
[Future] we want to avoid dispatch/load_balacing the padded token, we should be able to do that by adding metadata field to record the actually local tokens for each sp rank

What does this PR do?

This PR is doing "virtual padding" , and passing metadata around

Calculate num_local_tokens_after_padding = (T + pad_tokens) // sp_size in MoE module
Pass num_local_tokens_after_padding to GroupedExperts module, then to combine()
combine() returns a tensor with shape (num_local_tokens_after_padding * sp_rank, .... )
slice the combined tensor to (T, ...) in MoE

[ghstack-poisoned]

tianyu-l

sounds right to me, one concrete issue before landing

tianyu-l · 2026-06-09T21:04:24Z

+            trainer_parallelism = self.trainer.parallelism
+            sp_degree = trainer_parallelism.tensor_parallel_degree
+            # RL policy inputs are shaped by BatchConfig, not TrainingConfig.
+            seq_len = self.batcher.batch.seq_len


checking these in post_init is not safe, because CLI can override -- we have to do these check in update_from_config

I see, this is valid. Even TrainingConfig is not used in PolicyTrainer today, a user could also mistakenly override by CLI --trainer.training.seq_len to hack that.

Then the problem is passing BatchConfig from controller, into PolicyTrainer's TrainingConfig(), then calling self.model.update_from_config(config). So my updated plan is:

remove the check in post_init

Pass PolicyTrainer.TrianingConfig.seq_len to be BatchConfig.seq_len after parsing CLI override here

Then we are good to only check in Decoder.update_config()

Pass PolicyTrainer.TrianingConfig.seq_len to be BatchConfig.seq_len after parsing CLI override here

As discussed earlier, PolicyTrainer.TrainingConfig shouldn't have seq_len. In fact, for pretaining, we should have BatchConfig in Dataloader.Config, not in Trainer.TrainingConfig

I see, that's the right direction, let me add a TODO and also a github issue to track this

pianpwk · 2026-06-09T21:15:43Z

sorry, I saw this in the closed PR:

After we migrate to spmd_types, the all-gather would require this metadata as well (spmd.redistribute takes this shape arg), with which spmd_types could also achieve similar "pad / unpad only around collectives" effect.

what's the shape arg mentioned here?

wwwjn · 2026-06-09T21:25:08Z

sorry, I saw this in the closed PR:

After we migrate to spmd_types, the all-gather would require this metadata as well (spmd.redistribute takes this shape arg), with which spmd_types could also achieve similar "pad / unpad only around collectives" effect.

what's the shape arg mentioned here?

num_local_tokens_after_padding = (T + pad_tokens) // sp_size

This parameter here, num_local_tokens_after_padding we are passing this to combine() and doing a virtual padding ,

pianpwk · 2026-06-09T21:27:48Z

This parameter here, num_local_tokens_after_padding we are passing this to combine() and doing a virtual padding ,

oh my question was, redistribute() takes this?

[ghstack-poisoned]

tianyu-l

nice! some nit comments

tianyu-l · 2026-06-10T06:25:37Z

                        f"length {max_seq_len}."
                    )

+


accidental?

[ghstack-poisoned]

wwwjn · 2026-06-10T16:05:37Z

CPU test failing might be cause of spmd_types version, and seems not related this PR
I rerun the test locally

torch 2.13.0.dev20260609+cpu
spmd_types==0.2.1

Command result:

CUDA_VISIBLE_DEVICES= titan-rl/bin/python -m pytest tests/unit_tests/
Result: 1 passed in 10.50s.

… and fix combine() shape mismatch #3595 (#3619) #3595 is merged to wrong base, replay that PR

Update

45008be

[ghstack-poisoned]

wwwjn requested review from fegin, tianyu-l and wconstab as code owners June 9, 2026 20:57

pytorch-bot Bot added ciflow/8gpu ciflow/rl labels Jun 9, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 9, 2026

This was referenced Jun 9, 2026

[Not ready][rl] Enable TP2EP for unified MoE model in vLLM wrapper #3142

Open

[WIP] Enable DP+EP for MoE inference in vLLM wrapper #3236

Open

Using "virtual padding" to calculate number_local_tokens per SP rank, and fix combine() shape mismatch #3577

Closed

wwwjn changed the title ~~Apply moe-padding changes~~ Using "virtual padding" to calculate number_local_tokens per SP rank, and fix combine() shape mismatch Jun 9, 2026

tianyu-l reviewed Jun 9, 2026

View reviewed changes

tianyu-l mentioned this pull request Jun 10, 2026

[qwen3_5] evolve qwen3_vl to qwen3_5 #3371

Merged

wwwjn added 2 commits June 9, 2026 20:35

Update

74d4f67

[ghstack-poisoned]

Update

2cb859d

[ghstack-poisoned]

tianyu-l approved these changes Jun 10, 2026

View reviewed changes

Update

354ff36

[ghstack-poisoned]

wwwjn changed the base branch from gh/wwwjn/22/base to main June 10, 2026 14:52

Update

3325581

[ghstack-poisoned]

wwwjn changed the base branch from main to gh/wwwjn/22/base June 10, 2026 14:57

Update

d8a57bb

[ghstack-poisoned]

wwwjn merged commit bbd66c6 into gh/wwwjn/22/base Jun 10, 2026
10 of 11 checks passed

wwwjn mentioned this pull request Jun 10, 2026

Using "virtual padding" to calculate number_local_tokens per SP rank, and fix combine() shape mismatch #3595 #3619

Merged

wwwjn added a commit that referenced this pull request Jun 10, 2026

Using "virtual padding" to calculate number_local_tokens per SP rank,…

db0a723

… and fix combine() shape mismatch #3595 (#3619) #3595 is merged to wrong base, replay that PR

	out_TD = torch.zeros(
	x_TD.shape[0] * self.sp_size,
	x_TD.shape[-1],
	device=x_TD.device,
	dtype=x_TD.dtype,
	)

	if not self.score_before_experts:
	routed_output_RD = (
	routed_output_RD.to(torch.float32)
	* metadata.topk_scores_experts_sorted_N.reshape(-1, 1)
	).to(routed_output_RD.dtype)

	# With SP, token indices are 0-based within the local shard.
	# Offset to global positions for the full-size scatter buffer.
	if self.sp_size > 1:
	token_indices_experts_sorted_N = (
	metadata.token_indices_experts_sorted_N + x_TD.shape[0] * self.sp_rank

Conversation

wwwjn commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's the problem

What does this PR do?

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

wwwjn Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

wwwjn Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pianpwk commented Jun 9, 2026

Uh oh!

wwwjn commented Jun 9, 2026

Uh oh!

pianpwk commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wwwjn commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wwwjn commented Jun 9, 2026 •

edited

Loading

wwwjn Jun 9, 2026 •

edited

Loading

pianpwk commented Jun 9, 2026 •

edited

Loading