[graph_trainer] Add EP overlap eager chunking scaffolding by sanketpurandare · Pull Request #3363 · pytorch/torchtitan

sanketpurandare · 2026-05-15T03:03:31Z

Stacked PRs:

[graph_trainer] Add EP overlap eager chunking scaffolding

Introduce the public graph_trainer EP-overlap configuration surface and the eager chunking producer for its chunk metadata contract. The new compile options select the logical chunk dimension, the chunking strategy, and one supported module-root pattern: all transformer blocks or all MoE blocks. Sequence chunking is limited to MoE block roots because attention needs full K/V context.

Keep EP overlap validation in configs.py next to the other graph_trainer compile config validation. The eager producer wraps selected module forwards during model parallelization, splits tensor inputs into two chunks, calls the original forward once per chunk, and materializes tensor outputs with cat. It emits the same chunk metadata that graph chunking will emit, so the later scheduling pass can consume either producer through one contract.

Add shared MoE EP region annotations for dispatcher dispatch/combine bodies, config fingerprinting for the new options, and a generic trace-input-preparation hook in GraphTrainer. This commit does not add graph chunking or communication-overlap scheduling.

This pass stack relies on pending PyTorch support for hinted unbacked symbolic dimensions in the tracing and distributed compiler paths:

FakeTensor folded matmul: [ATen][FakeTensor] Handle unbacked dims in folded matmul pytorch#183397
ProxyTensor SDPA tracing: [ATen][ProxyTensor] Preserve unbacked batch dims in SDPA tracing pytorch#183398
Inductor bucketing trace isolation from ambient unbacked symbols: [Inductor][Bucketing] Isolate bucketing traces from ambient unbacked symbols pytorch#183495
Inductor collective bucketing with hinted unbacked SymInts: [Inductor][Bucketing] Make collective bucketing tolerate hinted unbacked SymInts pytorch#183544
DTensor sharding padding for hinted even unbacked shards: [DTensor] Use explicit hints for unbacked sharding pytorch#183545
HOP fake traces with discarded unbacked symbols: [HOP][Dynamic Shapes] Ignore discarded unbacked symbols in fake traces pytorch#183837
FlexAttention chunked unbacked input extents: [Inductor][HOP] Handle chunked unbacked FlexAttention shapes pytorch#183838
FakeTensor trace metadata for hinted symbolic storage: [FakeTensor] Add hinted symbolic storage size metadata pytorch#183839
Inductor symbolic stride ordering with unbacked hints: [Inductor] Handle hinted and fallback unbacked symbols pytorch#183840

Test Plan:

Covered by the full graph_trainer pass, numerics, and H100 integration test runs after the stacked graph chunking and scheduling commits.

Introduce the public graph_trainer EP-overlap configuration surface and the eager chunking producer for its chunk metadata contract. The new compile options select the logical chunk dimension, the chunking strategy, and one supported module-root pattern: all transformer blocks or all MoE blocks. Sequence chunking is limited to MoE block roots because attention needs full K/V context. Keep EP overlap validation in configs.py next to the other graph_trainer compile config validation. The eager producer wraps selected module forwards during model parallelization, splits tensor inputs into two chunks, calls the original forward once per chunk, and materializes tensor outputs with cat. It emits the same chunk metadata that graph chunking will emit, so the later scheduling pass can consume either producer through one contract. Add shared MoE EP region annotations for dispatcher dispatch/combine bodies, config fingerprinting for the new options, and a generic trace-input-preparation hook in GraphTrainer. This commit does not add graph chunking or communication-overlap scheduling. This pass stack relies on pending PyTorch support for hinted unbacked symbolic dimensions in the tracing and distributed compiler paths: - FakeTensor folded matmul: pytorch/pytorch#183397 - ProxyTensor SDPA tracing: pytorch/pytorch#183398 - Inductor bucketing trace isolation from ambient unbacked symbols: pytorch/pytorch#183495 - Inductor collective bucketing with hinted unbacked SymInts: pytorch/pytorch#183544 - DTensor sharding padding for hinted even unbacked shards: pytorch/pytorch#183545 - HOP fake traces with discarded unbacked symbols: pytorch/pytorch#183837 - FlexAttention chunked unbacked input extents: pytorch/pytorch#183838 - FakeTensor trace metadata for hinted symbolic storage: pytorch/pytorch#183839 - Inductor symbolic stride ordering with unbacked hints: pytorch/pytorch#183840 Test Plan: - Covered by the full graph_trainer pass, numerics, and H100 integration test runs after the stacked graph chunking and scheduling commits. stack-info: PR: #3363, branch: sanketpurandare/stack/17

sanketpurandare requested review from SherlockNoMad, aditvenk, tianyu-l, xmfan and yiming0416 as code owners May 15, 2026 03:03

pytorch-bot Bot added the ciflow/8gpu label May 15, 2026

sanketpurandare force-pushed the sanketpurandare/stack/17 branch from 7b0065b to 5a5c076 Compare May 15, 2026 03:03

sanketpurandare force-pushed the sanketpurandare/stack/16 branch from 60ac042 to 3178797 Compare May 15, 2026 03:03

This was referenced May 15, 2026

[graph_trainer] Add DeepSeek V3 16B SDPA config #3361

Merged

[graph_trainer] Support hinted symbolic input dims in tracing #3362

Open

[graph_trainer] Add graph EP chunking pass #3325

Open

[graph_trainer] Add EP overlap scheduling pass #3328

Open

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 15, 2026

sanketpurandare marked this pull request as draft May 15, 2026 09:36

sanketpurandare changed the base branch from sanketpurandare/stack/16 to main May 15, 2026 09:36

sanketpurandare force-pushed the sanketpurandare/stack/17 branch 2 times, most recently from f8b4a0c to c1314bb Compare May 15, 2026 09:36

sanketpurandare mentioned this pull request May 15, 2026

[graph_trainer] Use separate EP process groups for overlap #3369

Merged

sanketpurandare changed the base branch from main to sanketpurandare/stack/16 May 15, 2026 09:36

sanketpurandare marked this pull request as ready for review May 15, 2026 09:37

SherlockNoMad mentioned this pull request May 15, 2026

[graph_trainer] Nightly scout tracking issue #2856

Open

sanketpurandare marked this pull request as draft May 22, 2026 19:02

sanketpurandare changed the base branch from sanketpurandare/stack/16 to main May 22, 2026 19:02

sanketpurandare force-pushed the sanketpurandare/stack/17 branch from c1314bb to 60a804e Compare May 22, 2026 19:03

sanketpurandare changed the base branch from main to sanketpurandare/stack/16 May 22, 2026 19:03

sanketpurandare marked this pull request as ready for review May 22, 2026 19:03

sanketpurandare marked this pull request as draft May 26, 2026 05:28

sanketpurandare changed the base branch from sanketpurandare/stack/16 to main May 26, 2026 05:28

sanketpurandare force-pushed the sanketpurandare/stack/17 branch from 60a804e to e62b4d9 Compare May 26, 2026 05:29

sanketpurandare changed the base branch from main to sanketpurandare/stack/16 May 26, 2026 05:29

sanketpurandare marked this pull request as ready for review May 26, 2026 05:31

sanketpurandare marked this pull request as draft May 27, 2026 03:51

sanketpurandare changed the base branch from sanketpurandare/stack/16 to main May 27, 2026 03:51

sanketpurandare force-pushed the sanketpurandare/stack/17 branch from e62b4d9 to c7d1c78 Compare May 27, 2026 03:51

sanketpurandare changed the base branch from main to sanketpurandare/stack/16 May 27, 2026 03:51

sanketpurandare marked this pull request as ready for review May 27, 2026 03:52

sanketpurandare force-pushed the sanketpurandare/stack/16 branch from 7a084ea to dd9fd88 Compare May 27, 2026 04:41

sanketpurandare requested review from fegin, wconstab and wwwjn as code owners May 27, 2026 04:41

sanketpurandare force-pushed the sanketpurandare/stack/17 branch from c7d1c78 to c719e42 Compare May 27, 2026 04:41

sanketpurandare marked this pull request as draft May 27, 2026 15:28

sanketpurandare changed the base branch from sanketpurandare/stack/16 to main May 27, 2026 15:28

sanketpurandare force-pushed the sanketpurandare/stack/17 branch from c719e42 to 91ebac3 Compare May 27, 2026 15:28

sanketpurandare changed the base branch from main to sanketpurandare/stack/16 May 27, 2026 15:28

sanketpurandare marked this pull request as ready for review May 27, 2026 15:29

sanketpurandare marked this pull request as draft June 4, 2026 21:36

sanketpurandare changed the base branch from sanketpurandare/stack/16 to main June 4, 2026 21:36

sanketpurandare force-pushed the sanketpurandare/stack/17 branch from 91ebac3 to 802b756 Compare June 4, 2026 21:36

sanketpurandare changed the base branch from main to sanketpurandare/stack/16 June 4, 2026 21:36

sanketpurandare marked this pull request as ready for review June 4, 2026 21:36

sanketpurandare marked this pull request as draft June 10, 2026 23:49

sanketpurandare changed the base branch from sanketpurandare/stack/16 to main June 10, 2026 23:49

sanketpurandare force-pushed the sanketpurandare/stack/17 branch from 802b756 to 44d3d19 Compare June 10, 2026 23:50

sanketpurandare changed the base branch from main to sanketpurandare/stack/16 June 10, 2026 23:50

sanketpurandare marked this pull request as ready for review June 10, 2026 23:50

sanketpurandare mentioned this pull request Jun 10, 2026

[MoE] Use CPU split-size sum for EP permute output size #3627

Open

sanketpurandare requested a review from IvanKobzarev as a code owner June 10, 2026 23:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[graph_trainer] Add EP overlap eager chunking scaffolding#3363

[graph_trainer] Add EP overlap eager chunking scaffolding#3363
sanketpurandare wants to merge 1 commit into
sanketpurandare/stack/16from
sanketpurandare/stack/17

sanketpurandare commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sanketpurandare commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!