Skip to content

[graph_trainer] Replace stable_topological_sort with _move_overlap_nodes in fsdp_passes#3607

Draft
SherlockNoMad wants to merge 1 commit into
gh/SherlockNoMad/43/basefrom
gh/SherlockNoMad/43/head
Draft

[graph_trainer] Replace stable_topological_sort with _move_overlap_nodes in fsdp_passes#3607
SherlockNoMad wants to merge 1 commit into
gh/SherlockNoMad/43/basefrom
gh/SherlockNoMad/43/head

Conversation

@SherlockNoMad

@SherlockNoMad SherlockNoMad commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Replace the two _stable_topological_sort calls in JointManualOverlapScheduler
with _move_overlap_nodes from upstream pytorch. This surgically moves only
the AG/RS chains instead of re-sorting the entire graph, reducing node
displacement from ~97% to ~5% and making the pass composable with downstream
graph passes.

  • _manual_bucket_collectives: remove the no-op _stable_topological_sort(graph, {})
    call after bucketing — the graph is already valid after manual_bucket_collectives.
  • _manual_reorder_graph: replace _stable_topological_sort(graph, overlap_deps)
    with _move_overlap_nodes(graph, overlap_deps, bucketed_node_types).

Depends on pytorch PR #184711.

Authored by Claude.

(cherry picked from commit 504dc71)

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant