Skip to content

Introduce FlexShard for flexible bucketed parameter sharding#3239

Draft
weifengpy wants to merge 106 commits into
gh/weifengpy/1/basefrom
gh/weifengpy/1/head
Draft

Introduce FlexShard for flexible bucketed parameter sharding#3239
weifengpy wants to merge 106 commits into
gh/weifengpy/1/basefrom
gh/weifengpy/1/head

Conversation

@weifengpy

@weifengpy weifengpy commented May 6, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Introduce experimental FlexShard, a placement-driven API for flexible bucketed parameter sharding. FlexShard lets users group parameters into explicit communication buckets and assign each parameter a pluggable placement implementation. The initial runtime is eager for now.

  • flex_shard() API with explicit BucketSpecs

  • Per-bucket flat byte storage for sharded parameters

  • Pluggable Placement contract for local layout, unshard collectives, and gradient reduction

  • Shard, Owned, and RaggedShard example placements to prove the placement contract

  • RaggedShard supports uneven flattened-prefix sharding, bucket unshard/reduce-grad, and focused CPU/CUDA runtime coverage

  • Eager parameter accessors backed by batched bucket unshard hooks

  • Per-bucket MixedPrecisionPolicy

  • reshard-after-forward path using selective activation checkpointing

  • Runtime support is eager-only; will explore graph capture and torch.compile in next PR

  • Each bucket currently requires a uniform parameter dtype and placement tuple.

  • Mixed precision is supported per bucket.

  • CPU offload is represented in the API but rejected until implemented

Test Plan:

  • python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py

[ghstack-poisoned]
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 6, 2026
@weifengpy weifengpy marked this pull request as draft May 6, 2026 07:31
[ghstack-poisoned]
weifengpy added a commit that referenced this pull request May 6, 2026
Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission.

ghstack-source-id: 9bf864c
Pull-Request: #3239
[ghstack-poisoned]
weifengpy added a commit that referenced this pull request May 6, 2026
Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission.

ghstack-source-id: ac840c7
Pull-Request: #3239
[ghstack-poisoned]
weifengpy added a commit that referenced this pull request May 6, 2026
Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission.

ghstack-source-id: 2a4fe9e
Pull-Request: #3239
[ghstack-poisoned]
weifengpy added a commit that referenced this pull request May 6, 2026
Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission.

ghstack-source-id: 06c050b
Pull-Request: #3239
[ghstack-poisoned]
weifengpy added a commit that referenced this pull request May 6, 2026
Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission.

ghstack-source-id: 64a0c16
Pull-Request: #3239
[ghstack-poisoned]
weifengpy added a commit that referenced this pull request May 6, 2026
Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission.

ghstack-source-id: f06dbc6
Pull-Request: #3239
[ghstack-poisoned]
weifengpy added a commit that referenced this pull request May 6, 2026
Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission.

ghstack-source-id: c5a426a
Pull-Request: #3239
[ghstack-poisoned]
weifengpy added a commit that referenced this pull request May 6, 2026
Port the current aggregate FlexShard change from the existing WIP branch onto main as a single commit for ghstack submission.

ghstack-source-id: 19640c3
Pull-Request: #3239
[ghstack-poisoned]
[ghstack-poisoned]
weifengpy added 2 commits May 14, 2026 01:41
[ghstack-poisoned]
[ghstack-poisoned]
weifengpy added 13 commits May 14, 2026 14:02
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
weifengpy added 4 commits June 8, 2026 12:30
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants