Skip to content

[graph_trainer] Match Eager FSDP bucket order#3611

Draft
SherlockNoMad wants to merge 1 commit into
gh/SherlockNoMad/47/basefrom
gh/SherlockNoMad/47/head
Draft

[graph_trainer] Match Eager FSDP bucket order#3611
SherlockNoMad wants to merge 1 commit into
gh/SherlockNoMad/47/basefrom
gh/SherlockNoMad/47/head

Conversation

@SherlockNoMad

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

  • (to be filled)

GT Regional was packing FSDP collective buckets in graph execution order.
Eager FSDP2 packs bucket payloads in managed parameter registration order.
For reduce-scatter, changing the payload order changes byte offsets inside
NCCL buckets; on multinode FSDP groups NCCL may use different ring orders per
channel, so moving parameter elements between offsets can produce different
bf16 accumulation order.

Derive module order from traced state FQNs, which preserve FSDP2's first-seen
parameter order, and use it to order every FSDP bucket group before delegating
to the upstream manual overlap bucketer. This keeps the Torchtitan-specific
logic limited to choosing bucket payload order; upstream still owns merging,
insertion, wait remapping, and bucketed-node tagging.

This also updates the existing _StridedShard call in models/utils.py to pass
split_factor by keyword. CI pyrefly removes the stale suppression there and
then reports the second positional argument, while this form matches the typed
usage elsewhere in GraphTrainer.

Authored by Codex.

Test Plan:

python -m py_compile \
  torchtitan/experiments/graph_trainer/fsdp_passes.py \
  torchtitan/experiments/graph_trainer/passes.py \
  torchtitan/models/utils.py
git diff --check origin/main...HEAD -- \
  torchtitan/experiments/graph_trainer/fsdp_passes.py \
  torchtitan/experiments/graph_trainer/passes.py \
  torchtitan/models/utils.py
black --check \
  torchtitan/experiments/graph_trainer/fsdp_passes.py \
  torchtitan/experiments/graph_trainer/passes.py \
  torchtitan/models/utils.py
flake8 --config=.flake8 \
  torchtitan/experiments/graph_trainer/fsdp_passes.py \
  torchtitan/experiments/graph_trainer/passes.py \
  torchtitan/models/utils.py
PYTHONPATH=/data/users/ivankobzarev/f/torchtitan-orderfix-main \
  python torchtitan/experiments/graph_trainer/tests/test_passes.py \
    -k TestBucketingPrefetchOrder
lintrunner -a

This failed because Torchtitan has no .lintrunner.toml in this checkout.
pre-commit, ufmt, and pyrefly are also not installed in the local Python
environment.

MAST 64-GPU TP=1/BS=1 weight-hash validation matched Eager for all 25 steps:

ORDERFIX_GT8B_TP1_BS1_WHASH_Eager_8b-64gpu-64-ivankobzarev-lhpsfjn4
ORDERFIX_GT8B_TP1_BS1_WHASH_GT_Reg_NoCG_8b-64gpu-64-ivankob-t06vk4kk

MAST 128-GPU TP=1/BS=1 weight-hash validation matched Eager for all 25 steps:

ORDERFIX_GT8B_TP1_BS1_WHASH_Eager_8b-128gpu-128-ivankobzare-dvg55jz4
ORDERFIX_GT8B_TP1_BS1_WHASH_GT_Reg_NoCG_8b-128gpu-128-ivank-qtz1q3fk

MAST 256-GPU TP=1/BS=1 weight-hash validation matched Eager for all 25 steps:

ORDERFIX_GT8B_TP1_BS1_WHASH_Eager_8b-256gpu-256-ivankobzare-jtmfxc4z
ORDERFIX_GT8B_TP1_BS1_WHASH_GT_Reg_NoCG_8b-256gpu-256-ivank-dhm7t4th

(cherry picked from commit bd56b38)

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant