Skip to content

[graph_trainer] Force cudagraph for MinimalAsyncEP on H100 + Run 4 results#3610

Draft
SherlockNoMad wants to merge 1 commit into
gh/SherlockNoMad/46/basefrom
gh/SherlockNoMad/46/head
Draft

[graph_trainer] Force cudagraph for MinimalAsyncEP on H100 + Run 4 results#3610
SherlockNoMad wants to merge 1 commit into
gh/SherlockNoMad/46/basefrom
gh/SherlockNoMad/46/head

Conversation

@SherlockNoMad

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

  • (to be filled)

Enable cudagraph in the dsv3 run script and force cudagraph_pass past the
_grouped_mm '< sm_100' gate in is_cudagraphable, so MinimalAsyncEP captures a
full cudagraph on H100 (sm_90) instead of skipping (312 _grouped_mm nodes).
Records Run 4 in the results doc: forced cudagraph runs without crashing and
gives a stable ~16.5% MFU / ~9k tps (vs ~12.7% with cudagraph skipped).

WARNING: the cudagraph.py change is an unconditional DEBUG override (TODO:
revert / gate behind a flag or relax only the _grouped_mm gate). Forcing past
this safety check is numerically UNVERIFIED -- _grouped_mm may do hidden
CPU<->CUDA copies that replay stale under cudagraph. Needs an eager loss_compare
before trusting.

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant