Skip to content

[graph_trainer] Add EP overlap eager chunking scaffolding#3363

Open
sanketpurandare wants to merge 1 commit into
sanketpurandare/stack/16from
sanketpurandare/stack/17
Open

[graph_trainer] Add EP overlap eager chunking scaffolding#3363
sanketpurandare wants to merge 1 commit into
sanketpurandare/stack/16from
sanketpurandare/stack/17

Conversation

@sanketpurandare

@sanketpurandare sanketpurandare commented May 15, 2026

Copy link
Copy Markdown
Contributor

Stacked PRs:


[graph_trainer] Add EP overlap eager chunking scaffolding

Introduce the public graph_trainer EP-overlap configuration surface and the eager chunking producer for its chunk metadata contract. The new compile options select the logical chunk dimension, the chunking strategy, and one supported module-root pattern: all transformer blocks or all MoE blocks. Sequence chunking is limited to MoE block roots because attention needs full K/V context.

Keep EP overlap validation in configs.py next to the other graph_trainer compile config validation. The eager producer wraps selected module forwards during model parallelization, splits tensor inputs into two chunks, calls the original forward once per chunk, and materializes tensor outputs with cat. It emits the same chunk metadata that graph chunking will emit, so the later scheduling pass can consume either producer through one contract.

Add shared MoE EP region annotations for dispatcher dispatch/combine bodies, config fingerprinting for the new options, and a generic trace-input-preparation hook in GraphTrainer. This commit does not add graph chunking or communication-overlap scheduling.

This pass stack relies on pending PyTorch support for hinted unbacked symbolic dimensions in the tracing and distributed compiler paths:

Test Plan:

  • Covered by the full graph_trainer pass, numerics, and H100 integration test runs after the stacked graph chunking and scheduling commits.

@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/17 branch from 7b0065b to 5a5c076 Compare May 15, 2026 03:03
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/16 branch from 60ac042 to 3178797 Compare May 15, 2026 03:03
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 15, 2026
@sanketpurandare sanketpurandare marked this pull request as draft May 15, 2026 09:36
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/16 to main May 15, 2026 09:36
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/17 branch 2 times, most recently from f8b4a0c to c1314bb Compare May 15, 2026 09:36
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/16 May 15, 2026 09:36
@sanketpurandare sanketpurandare marked this pull request as ready for review May 15, 2026 09:37
@sanketpurandare sanketpurandare marked this pull request as draft May 22, 2026 19:02
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/16 to main May 22, 2026 19:02
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/17 branch from c1314bb to 60a804e Compare May 22, 2026 19:03
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/16 May 22, 2026 19:03
@sanketpurandare sanketpurandare marked this pull request as ready for review May 22, 2026 19:03
@sanketpurandare sanketpurandare marked this pull request as draft May 26, 2026 05:28
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/16 to main May 26, 2026 05:28
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/17 branch from 60a804e to e62b4d9 Compare May 26, 2026 05:29
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/16 May 26, 2026 05:29
@sanketpurandare sanketpurandare marked this pull request as ready for review May 26, 2026 05:31
@sanketpurandare sanketpurandare marked this pull request as draft May 27, 2026 03:51
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/16 to main May 27, 2026 03:51
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/17 branch from e62b4d9 to c7d1c78 Compare May 27, 2026 03:51
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/16 May 27, 2026 03:51
@sanketpurandare sanketpurandare marked this pull request as ready for review May 27, 2026 03:52
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/16 branch from 7a084ea to dd9fd88 Compare May 27, 2026 04:41
sanketpurandare added a commit that referenced this pull request May 27, 2026
Introduce the public graph_trainer EP-overlap configuration surface and the eager chunking producer for its chunk metadata contract. The new compile options select the logical chunk dimension, the chunking strategy, and one supported module-root pattern: all transformer blocks or all MoE blocks. Sequence chunking is limited to MoE block roots because attention needs full K/V context.

Keep EP overlap validation in configs.py next to the other graph_trainer compile config validation. The eager producer wraps selected module forwards during model parallelization, splits tensor inputs into two chunks, calls the original forward once per chunk, and materializes tensor outputs with cat. It emits the same chunk metadata that graph chunking will emit, so the later scheduling pass can consume either producer through one contract.

Add shared MoE EP region annotations for dispatcher dispatch/combine bodies, config fingerprinting for the new options, and a generic trace-input-preparation hook in GraphTrainer. This commit does not add graph chunking or communication-overlap scheduling.

This pass stack relies on pending PyTorch support for hinted unbacked symbolic dimensions in the tracing and distributed compiler paths:
- FakeTensor folded matmul: pytorch/pytorch#183397
- ProxyTensor SDPA tracing: pytorch/pytorch#183398
- Inductor bucketing trace isolation from ambient unbacked symbols: pytorch/pytorch#183495
- Inductor collective bucketing with hinted unbacked SymInts: pytorch/pytorch#183544
- DTensor sharding padding for hinted even unbacked shards: pytorch/pytorch#183545
- HOP fake traces with discarded unbacked symbols: pytorch/pytorch#183837
- FlexAttention chunked unbacked input extents: pytorch/pytorch#183838
- FakeTensor trace metadata for hinted symbolic storage: pytorch/pytorch#183839
- Inductor symbolic stride ordering with unbacked hints: pytorch/pytorch#183840

Test Plan:
- Covered by the full graph_trainer pass, numerics, and H100 integration test runs after the stacked graph chunking and scheduling commits.

stack-info: PR: #3363, branch: sanketpurandare/stack/17
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/17 branch from c7d1c78 to c719e42 Compare May 27, 2026 04:41
@sanketpurandare sanketpurandare marked this pull request as draft May 27, 2026 15:28
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/16 to main May 27, 2026 15:28
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/17 branch from c719e42 to 91ebac3 Compare May 27, 2026 15:28
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/16 May 27, 2026 15:28
@sanketpurandare sanketpurandare marked this pull request as ready for review May 27, 2026 15:29
@sanketpurandare sanketpurandare marked this pull request as draft June 4, 2026 21:36
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/16 to main June 4, 2026 21:36
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/17 branch from 91ebac3 to 802b756 Compare June 4, 2026 21:36
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/16 June 4, 2026 21:36
@sanketpurandare sanketpurandare marked this pull request as ready for review June 4, 2026 21:36
Introduce the public graph_trainer EP-overlap configuration surface and the eager chunking producer for its chunk metadata contract. The new compile options select the logical chunk dimension, the chunking strategy, and one supported module-root pattern: all transformer blocks or all MoE blocks. Sequence chunking is limited to MoE block roots because attention needs full K/V context.

Keep EP overlap validation in configs.py next to the other graph_trainer compile config validation. The eager producer wraps selected module forwards during model parallelization, splits tensor inputs into two chunks, calls the original forward once per chunk, and materializes tensor outputs with cat. It emits the same chunk metadata that graph chunking will emit, so the later scheduling pass can consume either producer through one contract.

Add shared MoE EP region annotations for dispatcher dispatch/combine bodies, config fingerprinting for the new options, and a generic trace-input-preparation hook in GraphTrainer. This commit does not add graph chunking or communication-overlap scheduling.

This pass stack relies on pending PyTorch support for hinted unbacked symbolic dimensions in the tracing and distributed compiler paths:
- FakeTensor folded matmul: pytorch/pytorch#183397
- ProxyTensor SDPA tracing: pytorch/pytorch#183398
- Inductor bucketing trace isolation from ambient unbacked symbols: pytorch/pytorch#183495
- Inductor collective bucketing with hinted unbacked SymInts: pytorch/pytorch#183544
- DTensor sharding padding for hinted even unbacked shards: pytorch/pytorch#183545
- HOP fake traces with discarded unbacked symbols: pytorch/pytorch#183837
- FlexAttention chunked unbacked input extents: pytorch/pytorch#183838
- FakeTensor trace metadata for hinted symbolic storage: pytorch/pytorch#183839
- Inductor symbolic stride ordering with unbacked hints: pytorch/pytorch#183840

Test Plan:
- Covered by the full graph_trainer pass, numerics, and H100 integration test runs after the stacked graph chunking and scheduling commits.

stack-info: PR: #3363, branch: sanketpurandare/stack/17
@sanketpurandare sanketpurandare marked this pull request as draft June 10, 2026 23:49
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/16 to main June 10, 2026 23:49
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/17 branch from 802b756 to 44d3d19 Compare June 10, 2026 23:50
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/16 June 10, 2026 23:50
@sanketpurandare sanketpurandare marked this pull request as ready for review June 10, 2026 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant