[graph_trainer] Match Eager FSDP bucket order#3611
Draft
SherlockNoMad wants to merge 1 commit into
Draft
Conversation
This was referenced Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
GT Regional was packing FSDP collective buckets in graph execution order.
Eager FSDP2 packs bucket payloads in managed parameter registration order.
For reduce-scatter, changing the payload order changes byte offsets inside
NCCL buckets; on multinode FSDP groups NCCL may use different ring orders per
channel, so moving parameter elements between offsets can produce different
bf16 accumulation order.
Derive module order from traced state FQNs, which preserve FSDP2's first-seen
parameter order, and use it to order every FSDP bucket group before delegating
to the upstream manual overlap bucketer. This keeps the Torchtitan-specific
logic limited to choosing bucket payload order; upstream still owns merging,
insertion, wait remapping, and bucketed-node tagging.
This also updates the existing _StridedShard call in models/utils.py to pass
split_factor by keyword. CI pyrefly removes the stale suppression there and
then reports the second positional argument, while this form matches the typed
usage elsewhere in GraphTrainer.
Authored by Codex.
Test Plan:
PYTHONPATH=/data/users/ivankobzarev/f/torchtitan-orderfix-main \ python torchtitan/experiments/graph_trainer/tests/test_passes.py \ -k TestBucketingPrefetchOrderThis failed because Torchtitan has no
.lintrunner.tomlin this checkout.pre-commit,ufmt, andpyreflyare also not installed in the local Pythonenvironment.
MAST 64-GPU TP=1/BS=1 weight-hash validation matched Eager for all 25 steps:
MAST 128-GPU TP=1/BS=1 weight-hash validation matched Eager for all 25 steps:
MAST 256-GPU TP=1/BS=1 weight-hash validation matched Eager for all 25 steps:
(cherry picked from commit bd56b38)