Skip to content

Add FlexShard RaggedShard example#3402

Draft
weifengpy wants to merge 4 commits into
gh/weifengpy/28/basefrom
gh/weifengpy/28/head
Draft

Add FlexShard RaggedShard example#3402
weifengpy wants to merge 4 commits into
gh/weifengpy/28/basefrom
gh/weifengpy/28/head

Conversation

@weifengpy

@weifengpy weifengpy commented May 19, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Summary:

  • implement a RaggedShard placement with ragged bucket unshard and grad reduction
  • export RaggedShard placement helpers and add focused coverage
  • add 2-rank CUDA runtime coverage and mixed RaggedShard bucket validation
  • keep local planning docs and repro scripts out of the committed change

Test Plan:

  • python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
  • python -m ruff check torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
  • pre-commit run ufmt --files torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
  • pre-commit run flake8 --files torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
  • pre-commit run pydoclint --files torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
  • pre-commit run codespell --files torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
  • pre-commit run --all-files (fails: local Pyrefly config cannot find torch search path; lychee reports unrelated external link failures)

[ghstack-poisoned]
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 19, 2026
[ghstack-poisoned]
weifengpy added a commit that referenced this pull request May 19, 2026
Summary:
- implement a RaggedShard placement with ragged bucket unshard and grad reduction
- export RaggedShard placement helpers and add focused coverage
- keep local planning docs out of the committed change

Test Plan:
- python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run --all-files (fails: local Pyrefly config cannot find torch search path; lychee reports unrelated external link failures)

ghstack-source-id: 5d1d9c3
Pull-Request: #3402
[ghstack-poisoned]
weifengpy added a commit that referenced this pull request May 19, 2026
Summary:
- implement a RaggedShard placement with ragged bucket unshard and grad reduction
- export RaggedShard placement helpers and add focused coverage
- keep local planning docs out of the committed change

Test Plan:
- python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run --all-files (fails: local Pyrefly config cannot find torch search path; lychee reports unrelated external link failures)

ghstack-source-id: b2554bb
Pull-Request: #3402
[ghstack-poisoned]
weifengpy added a commit that referenced this pull request May 19, 2026
Summary:
- implement a RaggedShard placement with ragged bucket unshard and grad reduction
- export RaggedShard placement helpers and add focused coverage
- add 2-rank CUDA runtime coverage and mixed RaggedShard bucket validation
- keep local planning docs and repro scripts out of the committed change

Test Plan:
- python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- python -m ruff check torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run ufmt --files torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run flake8 --files torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run pydoclint --files torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run codespell --files torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run --all-files (fails: local Pyrefly config cannot find torch search path; lychee reports unrelated external link failures)

ghstack-source-id: 1cfa793
Pull-Request: #3402
weifengpy added a commit that referenced this pull request May 19, 2026
Summary:
- implement a RaggedShard placement with ragged bucket unshard and grad reduction
- export RaggedShard placement helpers and add focused coverage
- add 2-rank CUDA runtime coverage and mixed RaggedShard bucket validation
- keep local planning docs and repro scripts out of the committed change

Test Plan:
- python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- python -m ruff check torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run ufmt --files torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run flake8 --files torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run pydoclint --files torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run codespell --files torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run --all-files (fails: local Pyrefly config cannot find torch search path; lychee reports unrelated external link failures)

ghstack-source-id: 1cfa793
Pull-Request: #3402
weifengpy added a commit that referenced this pull request May 19, 2026
Summary:
- implement a RaggedShard placement with ragged bucket unshard and grad reduction
- export RaggedShard placement helpers and add focused coverage
- add 2-rank CUDA runtime coverage and mixed RaggedShard bucket validation
- keep local planning docs and repro scripts out of the committed change

Test Plan:
- python -m pytest -q torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- python -m ruff check torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run ufmt --files torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run flake8 --files torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run pydoclint --files torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run codespell --files torchtitan/experiments/flex_shard/tests/test_flex_shard_ragged_shard.py
- pre-commit run --all-files (fails: local Pyrefly config cannot find torch search path; lychee reports unrelated external link failures)

ghstack-source-id: 1cfa793
Pull-Request: #3402
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant