[graph_trainer] Add EP overlap eager chunking scaffolding#3363
Open
sanketpurandare wants to merge 1 commit into
Open
[graph_trainer] Add EP overlap eager chunking scaffolding#3363sanketpurandare wants to merge 1 commit into
sanketpurandare wants to merge 1 commit into
Conversation
7b0065b to
5a5c076
Compare
60ac042 to
3178797
Compare
This was referenced May 15, 2026
f8b4a0c to
c1314bb
Compare
c1314bb to
60a804e
Compare
60a804e to
e62b4d9
Compare
e62b4d9 to
c7d1c78
Compare
7a084ea to
dd9fd88
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 27, 2026
Introduce the public graph_trainer EP-overlap configuration surface and the eager chunking producer for its chunk metadata contract. The new compile options select the logical chunk dimension, the chunking strategy, and one supported module-root pattern: all transformer blocks or all MoE blocks. Sequence chunking is limited to MoE block roots because attention needs full K/V context. Keep EP overlap validation in configs.py next to the other graph_trainer compile config validation. The eager producer wraps selected module forwards during model parallelization, splits tensor inputs into two chunks, calls the original forward once per chunk, and materializes tensor outputs with cat. It emits the same chunk metadata that graph chunking will emit, so the later scheduling pass can consume either producer through one contract. Add shared MoE EP region annotations for dispatcher dispatch/combine bodies, config fingerprinting for the new options, and a generic trace-input-preparation hook in GraphTrainer. This commit does not add graph chunking or communication-overlap scheduling. This pass stack relies on pending PyTorch support for hinted unbacked symbolic dimensions in the tracing and distributed compiler paths: - FakeTensor folded matmul: pytorch/pytorch#183397 - ProxyTensor SDPA tracing: pytorch/pytorch#183398 - Inductor bucketing trace isolation from ambient unbacked symbols: pytorch/pytorch#183495 - Inductor collective bucketing with hinted unbacked SymInts: pytorch/pytorch#183544 - DTensor sharding padding for hinted even unbacked shards: pytorch/pytorch#183545 - HOP fake traces with discarded unbacked symbols: pytorch/pytorch#183837 - FlexAttention chunked unbacked input extents: pytorch/pytorch#183838 - FakeTensor trace metadata for hinted symbolic storage: pytorch/pytorch#183839 - Inductor symbolic stride ordering with unbacked hints: pytorch/pytorch#183840 Test Plan: - Covered by the full graph_trainer pass, numerics, and H100 integration test runs after the stacked graph chunking and scheduling commits. stack-info: PR: #3363, branch: sanketpurandare/stack/17
c7d1c78 to
c719e42
Compare
c719e42 to
91ebac3
Compare
91ebac3 to
802b756
Compare
Introduce the public graph_trainer EP-overlap configuration surface and the eager chunking producer for its chunk metadata contract. The new compile options select the logical chunk dimension, the chunking strategy, and one supported module-root pattern: all transformer blocks or all MoE blocks. Sequence chunking is limited to MoE block roots because attention needs full K/V context. Keep EP overlap validation in configs.py next to the other graph_trainer compile config validation. The eager producer wraps selected module forwards during model parallelization, splits tensor inputs into two chunks, calls the original forward once per chunk, and materializes tensor outputs with cat. It emits the same chunk metadata that graph chunking will emit, so the later scheduling pass can consume either producer through one contract. Add shared MoE EP region annotations for dispatcher dispatch/combine bodies, config fingerprinting for the new options, and a generic trace-input-preparation hook in GraphTrainer. This commit does not add graph chunking or communication-overlap scheduling. This pass stack relies on pending PyTorch support for hinted unbacked symbolic dimensions in the tracing and distributed compiler paths: - FakeTensor folded matmul: pytorch/pytorch#183397 - ProxyTensor SDPA tracing: pytorch/pytorch#183398 - Inductor bucketing trace isolation from ambient unbacked symbols: pytorch/pytorch#183495 - Inductor collective bucketing with hinted unbacked SymInts: pytorch/pytorch#183544 - DTensor sharding padding for hinted even unbacked shards: pytorch/pytorch#183545 - HOP fake traces with discarded unbacked symbols: pytorch/pytorch#183837 - FlexAttention chunked unbacked input extents: pytorch/pytorch#183838 - FakeTensor trace metadata for hinted symbolic storage: pytorch/pytorch#183839 - Inductor symbolic stride ordering with unbacked hints: pytorch/pytorch#183840 Test Plan: - Covered by the full graph_trainer pass, numerics, and H100 integration test runs after the stacked graph chunking and scheduling commits. stack-info: PR: #3363, branch: sanketpurandare/stack/17
802b756 to
44d3d19
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked PRs:
[graph_trainer] Add EP overlap eager chunking scaffolding
Introduce the public graph_trainer EP-overlap configuration surface and the eager chunking producer for its chunk metadata contract. The new compile options select the logical chunk dimension, the chunking strategy, and one supported module-root pattern: all transformer blocks or all MoE blocks. Sequence chunking is limited to MoE block roots because attention needs full K/V context.
Keep EP overlap validation in configs.py next to the other graph_trainer compile config validation. The eager producer wraps selected module forwards during model parallelization, splits tensor inputs into two chunks, calls the original forward once per chunk, and materializes tensor outputs with cat. It emits the same chunk metadata that graph chunking will emit, so the later scheduling pass can consume either producer through one contract.
Add shared MoE EP region annotations for dispatcher dispatch/combine bodies, config fingerprinting for the new options, and a generic trace-input-preparation hook in GraphTrainer. This commit does not add graph chunking or communication-overlap scheduling.
This pass stack relies on pending PyTorch support for hinted unbacked symbolic dimensions in the tracing and distributed compiler paths:
Test Plan: