Introduce FlexShard for flexible bucketed parameter sharding by weifengpy · Pull Request #3239 · pytorch/torchtitan

weifengpy · 2026-05-06T07:25:46Z

Stack from ghstack (oldest at bottom):

Introduce experimental FlexShard, a placement-driven API for flexible bucketed parameter sharding. FlexShard lets users group parameters into explicit communication buckets and assign each parameter a pluggable placement implementation. The initial runtime is eager for now.

flex_shard() API with explicit BucketSpecs
Per-bucket flat byte storage for sharded parameters
Pluggable Placement contract for local layout, unshard collectives, and gradient reduction
Shard, Owned, and RaggedShard example placements to prove the placement contract
RaggedShard supports uneven flattened-prefix sharding, bucket unshard/reduce-grad, and focused CPU/CUDA runtime coverage
Eager parameter accessors backed by batched bucket unshard hooks
Per-bucket MixedPrecisionPolicy
reshard-after-forward path using selective activation checkpointing
Runtime support is eager-only; will explore graph capture and torch.compile in next PR
Each bucket currently requires a uniform parameter dtype and placement tuple.
Mixed precision is supported per bucket.
CPU offload is represented in the API but rejected until implemented

Test Plan:

python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py

[ghstack-poisoned]

Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission. ghstack-source-id: 9bf864c Pull-Request: #3239

[ghstack-poisoned]

Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission. ghstack-source-id: ac840c7 Pull-Request: #3239

[ghstack-poisoned]

Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission. ghstack-source-id: 2a4fe9e Pull-Request: #3239

[ghstack-poisoned]

Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission. ghstack-source-id: 06c050b Pull-Request: #3239

[ghstack-poisoned]

Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission. ghstack-source-id: 64a0c16 Pull-Request: #3239

[ghstack-poisoned]

Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission. ghstack-source-id: f06dbc6 Pull-Request: #3239

[ghstack-poisoned]

Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission. ghstack-source-id: c5a426a Pull-Request: #3239

[ghstack-poisoned]

Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission. ghstack-source-id: 19640c3 Pull-Request: #3239

[ghstack-poisoned]

Update

842452c

[ghstack-poisoned]

weifengpy requested review from SherlockNoMad, aditvenk, fegin, sanketpurandare, tianyu-l, wconstab, wwwjn, xmfan and yiming0416 as code owners May 6, 2026 07:25

pytorch-bot Bot added the ciflow/8gpu label May 6, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 6, 2026

weifengpy marked this pull request as draft May 6, 2026 07:31

Update

f8b76c9

[ghstack-poisoned]

weifengpy added a commit that referenced this pull request May 6, 2026

[WIP] FlexShard

c7cd826

Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission. ghstack-source-id: 9bf864c Pull-Request: #3239

Update

84c76a6

[ghstack-poisoned]

weifengpy added a commit that referenced this pull request May 6, 2026

[WIP] FlexShard

8d9d0eb

Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission. ghstack-source-id: ac840c7 Pull-Request: #3239

Update

b56a0f5

[ghstack-poisoned]

weifengpy added a commit that referenced this pull request May 6, 2026

[WIP] FlexShard

db2de26

Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission. ghstack-source-id: 2a4fe9e Pull-Request: #3239

Update

a67cf9c

[ghstack-poisoned]

weifengpy added a commit that referenced this pull request May 6, 2026

[WIP] FlexShard

1220bae

Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission. ghstack-source-id: 06c050b Pull-Request: #3239

Update

30e5a14

[ghstack-poisoned]

weifengpy added a commit that referenced this pull request May 6, 2026

[WIP] FlexShard

49b3636

Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission. ghstack-source-id: 64a0c16 Pull-Request: #3239

Update

3eb6581

[ghstack-poisoned]

weifengpy added a commit that referenced this pull request May 6, 2026

[WIP] FlexShard

27de408

Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission. ghstack-source-id: f06dbc6 Pull-Request: #3239

Update

d2c90bf

[ghstack-poisoned]

weifengpy added a commit that referenced this pull request May 6, 2026

[WIP] FlexShard

7d9caca

Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission. ghstack-source-id: c5a426a Pull-Request: #3239

Update

d12a741

[ghstack-poisoned]

weifengpy added a commit that referenced this pull request May 6, 2026

[WIP] FlexShard

4d47ba6

Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission. ghstack-source-id: 19640c3 Pull-Request: #3239

Update

9a7e53a

[ghstack-poisoned]

Update

8bfe2d2

[ghstack-poisoned]

This was referenced May 14, 2026

Support grouped FlexShard reshard-after-forward buckets #3348

Draft

Scope FlexShard recompute policy by provenance #3349

Draft

weifengpy added 2 commits May 14, 2026 01:41

Update

74d4524

[ghstack-poisoned]

Update

1afc642

[ghstack-poisoned]

weifengpy mentioned this pull request May 14, 2026

Add Owned placement and placement-owned collectives #3359

Closed

weifengpy added 13 commits May 14, 2026 14:02

Update

7e05cce

[ghstack-poisoned]

Update

7622672

[ghstack-poisoned]

Update

1d9fff2

[ghstack-poisoned]

Update

1b4842f

[ghstack-poisoned]

Update

d997f98

[ghstack-poisoned]

Update

46af13f

[ghstack-poisoned]

Update

d3494af

[ghstack-poisoned]

Update

d545c48

[ghstack-poisoned]

Update

7b3e5d9

[ghstack-poisoned]

Update

ae074af

[ghstack-poisoned]

Update

f89cfcf

[ghstack-poisoned]

Update

53e410a

[ghstack-poisoned]

Update

40836c0

[ghstack-poisoned]

weifengpy mentioned this pull request May 19, 2026

Add FlexShard RaggedShard example #3402

Draft

Update

d5eea28

[ghstack-poisoned]

weifengpy mentioned this pull request May 20, 2026

Add grouped RaggedShard bucket layout #3407

Closed

anshul-si mentioned this pull request May 21, 2026

[flex_shard][flatshard] #3417

Open

This was referenced Jun 3, 2026

Add communication-free Muon for FlexShard #3502

Draft

[flex_shard] Plan: fp8 all-gather on GroupedRaggedShard (block-wise) #3537

Draft

weifengpy added 4 commits June 8, 2026 12:30

Update

c23fd5c

[ghstack-poisoned]

Update

afc129e

[ghstack-poisoned]

Update

0c7218b

[ghstack-poisoned]

Update

97c5c97

[ghstack-poisoned]

weifengpy mentioned this pull request Jun 10, 2026

[flex_shard] Add DeepSeek V3 eager training entry point #3603

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce FlexShard for flexible bucketed parameter sharding#3239

Introduce FlexShard for flexible bucketed parameter sharding#3239
weifengpy wants to merge 106 commits into
gh/weifengpy/1/basefrom
gh/weifengpy/1/head

weifengpy commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

weifengpy commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

weifengpy commented May 6, 2026 •

edited

Loading