[graph_trainer] Match Eager FSDP bucket order by SherlockNoMad · Pull Request #3611 · pytorch/torchtitan

SherlockNoMad · 2026-06-10T16:10:53Z

Stack from ghstack (oldest at bottom):

(to be filled)

GT Regional was packing FSDP collective buckets in graph execution order.
Eager FSDP2 packs bucket payloads in managed parameter registration order.
For reduce-scatter, changing the payload order changes byte offsets inside
NCCL buckets; on multinode FSDP groups NCCL may use different ring orders per
channel, so moving parameter elements between offsets can produce different
bf16 accumulation order.

Derive module order from traced state FQNs, which preserve FSDP2's first-seen
parameter order, and use it to order every FSDP bucket group before delegating
to the upstream manual overlap bucketer. This keeps the Torchtitan-specific
logic limited to choosing bucket payload order; upstream still owns merging,
insertion, wait remapping, and bucketed-node tagging.

This also updates the existing _StridedShard call in models/utils.py to pass
split_factor by keyword. CI pyrefly removes the stale suppression there and
then reports the second positional argument, while this form matches the typed
usage elsewhere in GraphTrainer.

Authored by Codex.

Test Plan:

python -m py_compile \
  torchtitan/experiments/graph_trainer/fsdp_passes.py \
  torchtitan/experiments/graph_trainer/passes.py \
  torchtitan/models/utils.py

git diff --check origin/main...HEAD -- \
  torchtitan/experiments/graph_trainer/fsdp_passes.py \
  torchtitan/experiments/graph_trainer/passes.py \
  torchtitan/models/utils.py

black --check \
  torchtitan/experiments/graph_trainer/fsdp_passes.py \
  torchtitan/experiments/graph_trainer/passes.py \
  torchtitan/models/utils.py

flake8 --config=.flake8 \
  torchtitan/experiments/graph_trainer/fsdp_passes.py \
  torchtitan/experiments/graph_trainer/passes.py \
  torchtitan/models/utils.py

PYTHONPATH=/data/users/ivankobzarev/f/torchtitan-orderfix-main \
  python torchtitan/experiments/graph_trainer/tests/test_passes.py \
    -k TestBucketingPrefetchOrder

lintrunner -a

This failed because Torchtitan has no .lintrunner.toml in this checkout.
pre-commit, ufmt, and pyrefly are also not installed in the local Python
environment.

MAST 64-GPU TP=1/BS=1 weight-hash validation matched Eager for all 25 steps:

ORDERFIX_GT8B_TP1_BS1_WHASH_Eager_8b-64gpu-64-ivankobzarev-lhpsfjn4
ORDERFIX_GT8B_TP1_BS1_WHASH_GT_Reg_NoCG_8b-64gpu-64-ivankob-t06vk4kk

MAST 128-GPU TP=1/BS=1 weight-hash validation matched Eager for all 25 steps:

ORDERFIX_GT8B_TP1_BS1_WHASH_Eager_8b-128gpu-128-ivankobzare-dvg55jz4
ORDERFIX_GT8B_TP1_BS1_WHASH_GT_Reg_NoCG_8b-128gpu-128-ivank-qtz1q3fk

MAST 256-GPU TP=1/BS=1 weight-hash validation matched Eager for all 25 steps:

ORDERFIX_GT8B_TP1_BS1_WHASH_Eager_8b-256gpu-256-ivankobzare-jtmfxc4z
ORDERFIX_GT8B_TP1_BS1_WHASH_GT_Reg_NoCG_8b-256gpu-256-ivank-dhm7t4th

(cherry picked from commit bd56b38)

[ghstack-poisoned]

Update

08dc406

[ghstack-poisoned]

pytorch-bot Bot added the ciflow/8gpu label Jun 10, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[graph_trainer] Match Eager FSDP bucket order#3611

[graph_trainer] Match Eager FSDP bucket order#3611
SherlockNoMad wants to merge 1 commit into
gh/SherlockNoMad/47/basefrom
gh/SherlockNoMad/47/head

SherlockNoMad commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SherlockNoMad commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant