Add communication-free Muon for FlexShard by weifengpy · Pull Request #3502 · pytorch/torchtitan

weifengpy · 2026-06-03T20:56:21Z

Stack from ghstack (oldest at bottom):

Place each Muon-eligible 2D matrix whole on one rank via the Owned placement so Newton-Schulz runs locally on the owner after the backward reduce-to-owner -- no collective in optimizer.step(), and bit-exact with single-device Muon. Layers are balanced across ranks with greedy LPT; embeddings, LM head, and final norm stay Shard(0) + AdamW. Owned now composes with reshard_after_forward=True (broadcast ops tagged for activation-checkpoint recompute).

Adds example/muon.py (comm_free_muon_buckets, build_muon_param_groups, build_comm_free_muon_optimizers, CombinedOptimizer) and example/owned.py helpers (make_owned_placement_fn, assign_layer_owners_lpt).

Also adds GroupedMuon: batched Newton-Schulz over the leading dim of stacked weight matrices (>=3D, e.g. MoE grouped experts). build_comm_free_muon_optimizers routes 2D params to torch.optim.Muon and >=3D stacks to GroupedMuon; GroupedMuon matches running torch.optim.Muon on each 2D sub-matrix.

Tests: python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_muon.py

[ghstack-poisoned]

Place each Muon-eligible 2D matrix whole on one rank via the Owned placement so Newton-Schulz runs locally on the owner after the backward reduce-to-owner -- no collective in optimizer.step(), and bit-exact with single-device Muon. Layers are balanced across ranks with greedy LPT; embeddings, LM head, and final norm stay Shard(0) + AdamW. Owned now composes with reshard_after_forward=True (broadcast ops tagged for activation-checkpoint recompute). Adds example/muon.py (comm_free_muon_buckets, build_muon_param_groups, build_comm_free_muon_optimizers, CombinedOptimizer) and example/owned.py helpers (make_owned_placement_fn, assign_layer_owners_lpt). Also adds GroupedMuon: batched Newton-Schulz over the leading dim of stacked weight matrices (>=3D, e.g. MoE grouped experts). build_comm_free_muon_optimizers routes 2D params to torch.optim.Muon and >=3D stacks to GroupedMuon; GroupedMuon matches running torch.optim.Muon on each 2D sub-matrix. Tests: python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_muon.py ghstack-source-id: fe8ef01 Pull-Request: #3502

[ghstack-poisoned]

Place each Muon-eligible 2D matrix whole on one rank via the Owned placement so Newton-Schulz runs locally on the owner after the backward reduce-to-owner -- no collective in optimizer.step(), and bit-exact with single-device Muon. Layers are balanced across ranks with greedy LPT; embeddings, LM head, and final norm stay Shard(0) + AdamW. Owned now composes with reshard_after_forward=True (broadcast ops tagged for activation-checkpoint recompute). Adds example/muon.py (comm_free_muon_buckets, build_muon_param_groups, build_comm_free_muon_optimizers, CombinedOptimizer) and example/owned.py helpers (make_owned_placement_fn, assign_layer_owners_lpt). Also adds GroupedMuon: batched Newton-Schulz over the leading dim of stacked weight matrices (>=3D, e.g. MoE grouped experts). build_comm_free_muon_optimizers routes 2D params to torch.optim.Muon and >=3D stacks to GroupedMuon; GroupedMuon matches running torch.optim.Muon on each 2D sub-matrix. Tests: python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_muon.py ghstack-source-id: 17456cf Pull-Request: #3502

[ghstack-poisoned]

Place each Muon-eligible 2D matrix whole on one rank via the Owned placement so Newton-Schulz runs locally on the owner after the backward reduce-to-owner -- no collective in optimizer.step(), and bit-exact with single-device Muon. Layers are balanced across ranks with greedy LPT; embeddings, LM head, and final norm stay Shard(0) + AdamW. Owned now composes with reshard_after_forward=True (broadcast ops tagged for activation-checkpoint recompute). Adds example/muon.py (comm_free_muon_buckets, build_muon_param_groups, build_comm_free_muon_optimizers, CombinedOptimizer) and example/owned.py helpers (make_owned_placement_fn, assign_layer_owners_lpt). Also adds GroupedMuon: batched Newton-Schulz over the leading dim of stacked weight matrices (>=3D, e.g. MoE grouped experts). build_comm_free_muon_optimizers routes 2D params to torch.optim.Muon and >=3D stacks to GroupedMuon; GroupedMuon matches running torch.optim.Muon on each 2D sub-matrix. Tests: python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_muon.py ghstack-source-id: 651b804 Pull-Request: #3502

[ghstack-poisoned]

Place each Muon-eligible 2D matrix whole on one rank via the Owned placement so Newton-Schulz runs locally on the owner after the backward reduce-to-owner -- no collective in optimizer.step(), and bit-exact with single-device Muon. Layers are balanced across ranks with greedy LPT; embeddings, LM head, and final norm stay Shard(0) + AdamW. Owned now composes with reshard_after_forward=True (broadcast ops tagged for activation-checkpoint recompute). Adds example/muon.py (comm_free_muon_buckets, build_muon_param_groups, build_comm_free_muon_optimizers, CombinedOptimizer) and example/owned.py helpers (make_owned_placement_fn, assign_layer_owners_lpt). Also adds GroupedMuon: batched Newton-Schulz over the leading dim of stacked weight matrices (>=3D, e.g. MoE grouped experts). build_comm_free_muon_optimizers routes 2D params to torch.optim.Muon and >=3D stacks to GroupedMuon; GroupedMuon matches running torch.optim.Muon on each 2D sub-matrix. Tests: python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_muon.py ghstack-source-id: 2cbbe74 Pull-Request: #3502

[ghstack-poisoned]

Place each Muon-eligible 2D matrix whole on one rank via the Owned placement so Newton-Schulz runs locally on the owner after the backward reduce-to-owner -- no collective in optimizer.step(), and bit-exact with single-device Muon. Layers are balanced across ranks with greedy LPT; embeddings, LM head, and final norm stay Shard(0) + AdamW. Owned now composes with reshard_after_forward=True (broadcast ops tagged for activation-checkpoint recompute). Adds example/muon.py (comm_free_muon_buckets, build_muon_param_groups, build_comm_free_muon_optimizers, CombinedOptimizer) and example/owned.py helpers (make_owned_placement_fn, assign_layer_owners_lpt). Also adds GroupedMuon: batched Newton-Schulz over the leading dim of stacked weight matrices (>=3D, e.g. MoE grouped experts). build_comm_free_muon_optimizers routes 2D params to torch.optim.Muon and >=3D stacks to GroupedMuon; GroupedMuon matches running torch.optim.Muon on each 2D sub-matrix. Tests: python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_muon.py ghstack-source-id: e70eda4 Pull-Request: #3502

[ghstack-poisoned]

Place each Muon-eligible 2D matrix whole on one rank via the Owned placement so Newton-Schulz runs locally on the owner after the backward reduce-to-owner -- no collective in optimizer.step(), and bit-exact with single-device Muon. Layers are balanced across ranks with greedy LPT; embeddings, LM head, and final norm stay Shard(0) + AdamW. Owned now composes with reshard_after_forward=True (broadcast ops tagged for activation-checkpoint recompute). Adds example/muon.py (comm_free_muon_buckets, build_muon_param_groups, build_comm_free_muon_optimizers, CombinedOptimizer) and example/owned.py helpers (make_owned_placement_fn, assign_layer_owners_lpt). Also adds GroupedMuon: batched Newton-Schulz over the leading dim of stacked weight matrices (>=3D, e.g. MoE grouped experts). build_comm_free_muon_optimizers routes 2D params to torch.optim.Muon and >=3D stacks to GroupedMuon; GroupedMuon matches running torch.optim.Muon on each 2D sub-matrix. Tests: python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_muon.py ghstack-source-id: 7274abe Pull-Request: #3502

[ghstack-poisoned]

Place each Muon-eligible 2D matrix whole on one rank via the Owned placement so Newton-Schulz runs locally on the owner after the backward reduce-to-owner -- no collective in optimizer.step(), and bit-exact with single-device Muon. Layers are balanced across ranks with greedy LPT; embeddings, LM head, and final norm stay Shard(0) + AdamW. Owned now composes with reshard_after_forward=True (broadcast ops tagged for activation-checkpoint recompute). Adds example/muon.py (comm_free_muon_buckets, build_muon_param_groups, build_comm_free_muon_optimizers, CombinedOptimizer) and example/owned.py helpers (make_owned_placement_fn, assign_layer_owners_lpt). Also adds GroupedMuon: batched Newton-Schulz over the leading dim of stacked weight matrices (>=3D, e.g. MoE grouped experts). build_comm_free_muon_optimizers routes 2D params to torch.optim.Muon and >=3D stacks to GroupedMuon; GroupedMuon matches running torch.optim.Muon on each 2D sub-matrix. Tests: python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_muon.py ghstack-source-id: 151c99c Pull-Request: #3502

[ghstack-poisoned]

Place each Muon-eligible 2D matrix whole on one rank via the Owned placement so Newton-Schulz runs locally on the owner after the backward reduce-to-owner -- no collective in optimizer.step(), and bit-exact with single-device Muon. Layers are balanced across ranks with greedy LPT; embeddings, LM head, and final norm stay Shard(0) + AdamW. Owned now composes with reshard_after_forward=True (broadcast ops tagged for activation-checkpoint recompute). Adds example/muon.py (comm_free_muon_buckets, build_muon_param_groups, build_comm_free_muon_optimizers, CombinedOptimizer) and example/owned.py helpers (make_owned_placement_fn, assign_layer_owners_lpt). Also adds GroupedMuon: batched Newton-Schulz over the leading dim of stacked weight matrices (>=3D, e.g. MoE grouped experts). build_comm_free_muon_optimizers routes 2D params to torch.optim.Muon and >=3D stacks to GroupedMuon; GroupedMuon matches running torch.optim.Muon on each 2D sub-matrix. Tests: python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_muon.py ghstack-source-id: 5a171d5 Pull-Request: #3502

[ghstack-poisoned]

Place each Muon-eligible 2D matrix whole on one rank via the Owned placement so Newton-Schulz runs locally on the owner after the backward reduce-to-owner -- no collective in optimizer.step(), and bit-exact with single-device Muon. Layers are balanced across ranks with greedy LPT; embeddings, LM head, and final norm stay Shard(0) + AdamW. Owned now composes with reshard_after_forward=True (broadcast ops tagged for activation-checkpoint recompute). Adds example/muon.py (comm_free_muon_buckets, build_muon_param_groups, build_comm_free_muon_optimizers, CombinedOptimizer) and example/owned.py helpers (make_owned_placement_fn, assign_layer_owners_lpt). Also adds GroupedMuon: batched Newton-Schulz over the leading dim of stacked weight matrices (>=3D, e.g. MoE grouped experts). build_comm_free_muon_optimizers routes 2D params to torch.optim.Muon and >=3D stacks to GroupedMuon; GroupedMuon matches running torch.optim.Muon on each 2D sub-matrix. Tests: python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_muon.py ghstack-source-id: 702ec18 Pull-Request: #3502

Update

a4bdca4

[ghstack-poisoned]

pytorch-bot Bot added the ciflow/8gpu label Jun 3, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 3, 2026

This was referenced Jun 3, 2026

Introduce FlexShard for flexible bucketed parameter sharding #3239

Draft

Make FlexShard traceable by torch.compile #3317

Draft

Add grouped RaggedShard bucket layout #3407

Closed

Update

d59ad0c

[ghstack-poisoned]

Update

6e7bea8

[ghstack-poisoned]

Update

4738a6a

[ghstack-poisoned]

weifengpy marked this pull request as draft June 4, 2026 03:37

Update

c748b26

[ghstack-poisoned]

Update

d8946d9

[ghstack-poisoned]

Update

5c3d943

[ghstack-poisoned]

Update

7d58193

[ghstack-poisoned]

weifengpy mentioned this pull request Jun 4, 2026

[flex_shard] Plan: fp8 all-gather on GroupedRaggedShard (block-wise) #3537

Draft

weifengpy added 4 commits June 8, 2026 13:38

Update

220054b

[ghstack-poisoned]

Update

2042642

[ghstack-poisoned]

Update

9e0d9b5

[ghstack-poisoned]

Update

1c0931c

[ghstack-poisoned]

Update

025ee37

[ghstack-poisoned]

weifengpy mentioned this pull request Jun 10, 2026

[flex_shard] Add DeepSeek V3 eager training entry point #3603

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add communication-free Muon for FlexShard#3502

Add communication-free Muon for FlexShard#3502
weifengpy wants to merge 13 commits into
gh/weifengpy/30/basefrom
gh/weifengpy/30/head

weifengpy commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weifengpy commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

weifengpy commented Jun 3, 2026 •

edited

Loading