[graph_trainer] Force cudagraph for MinimalAsyncEP on H100 + Run 4 results by SherlockNoMad · Pull Request #3610 · pytorch/torchtitan

SherlockNoMad · 2026-06-10T16:10:50Z

Stack from ghstack (oldest at bottom):

(to be filled)

Enable cudagraph in the dsv3 run script and force cudagraph_pass past the
_grouped_mm '< sm_100' gate in is_cudagraphable, so MinimalAsyncEP captures a
full cudagraph on H100 (sm_90) instead of skipping (312 _grouped_mm nodes).
Records Run 4 in the results doc: forced cudagraph runs without crashing and
gives a stable ~16.5% MFU / ~9k tps (vs ~12.7% with cudagraph skipped).

WARNING: the cudagraph.py change is an unconditional DEBUG override (TODO:
revert / gate behind a flag or relax only the _grouped_mm gate). Forcing past
this safety check is numerically UNVERIFIED -- _grouped_mm may do hidden
CPU<->CUDA copies that replay stale under cudagraph. Needs an eager loss_compare
before trusting.

[ghstack-poisoned]

Update

6eeedb3

[ghstack-poisoned]

pytorch-bot Bot added the ciflow/8gpu label Jun 10, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[graph_trainer] Force cudagraph for MinimalAsyncEP on H100 + Run 4 results#3610

[graph_trainer] Force cudagraph for MinimalAsyncEP on H100 + Run 4 results#3610
SherlockNoMad wants to merge 1 commit into
gh/SherlockNoMad/46/basefrom
gh/SherlockNoMad/46/head

SherlockNoMad commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SherlockNoMad commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant