Add nightly benchmark CI: canonical FA4 benchmark tracking#2533
Draft
Johnsonms wants to merge 47 commits into
Draft
Add nightly benchmark CI: canonical FA4 benchmark tracking#2533Johnsonms wants to merge 47 commits into
Johnsonms wants to merge 47 commits into
Conversation
- benchmark_attn.py: add --json-output flag for structured result collection - bench_nightly.py: thin wrapper running canonical MHA + MLA configs via benchmark_attn.py, with clock locking - tools/ci/push_results.py: append results to benchmark-data branch - tools/ci/slack_notify.py: 7-day trend table with regression alerts (>2% below avg) - .github/workflows/nightly.yml: scheduled full test suite + benchmark, posts to Slack Canonical configs: MHA (hdim 64/128/256), MLA-absorbed decode/prefill (64/512, nheads=128), DeepSeek shape (192/128).
Add nightly benchmark CI: canonical FA4 benchmark tracking
fix push_results: use git checkout --orphan for older git versions
fix slack_notify: unpack 7-element cfg_key tuple, add group/hdim_v co…
slack_notify: per-group sections, bar_chart header, run URL link
nightly: lock GPU clocks on host before Apptainer, reset after
Nightly ci
…(avoids red text)
…eqlen_kv, causal)
Nightly ci
slack_notify: Block Kit rich_text_preformatted for compact monospace …
…iority nightly job
Nightly ci
…emory threshold Three root causes behind CI failures at 64 xdist workers: 1. cache_utils: LOCK_TIMEOUT_SECONDS 15→300. With 64 workers queuing on the same kernel's exclusive lock (0.2s poll interval), the last worker needed ≥18s; the old 15s limit caused RuntimeError on every cache write. 2. conftest: atomic gpu_ids.json write (temp-file + rename) + retry loop on JSONDecodeError. Workers 1-63 could read a truncated file during worker 0's open(mode="w")+write sequence, crashing pytest_configure. 3. run_fa4_ci: max_used_memory_mb 1000→8000. B200 driver baseline is 626–1254 MiB; the old threshold falsely reported all GPUs as busy.
ci: fix 64-worker test failures from lock timeout, GPU-id race, and m…
hdim=192 on SM100 requires 2CTA instructions, but softcap injects a score_mod that disables 2CTA, triggering the assertion in FlashAttentionBackwardSm100.__init__. The non-varlen test already gates its backward on softcap==0.0; add the equivalent skip to the varlen backward block.
tests: skip SM100 hd192 bwd with softcap in varlen test
torch.AcceleratorError is the async variant of OOM — the allocation fails in a prior CUDA op and the error surfaces on the next API call. The existing retry_on_oom only caught torch.OutOfMemoryError, so async OOMs caused by concurrent kernel compilation across 64 xdist workers were not retried.
tests: retry on AcceleratorError OOM in addition to OutOfMemoryError
983f0f2 to
d62c178
Compare
Useful when Pass 1 already completed successfully in a prior run and only Pass 2 (real GPU execution) needs to be re-run. Exposed as --skip-compile in run_fa4_ci.py, skip-compile input in the gpu-test action, and skip_compile workflow_dispatch input in the Nightly workflow.
Nightly ci
Nightly ci
SM100 varlen kernel hangs when deterministic=True and softcap > 0.0. Skip until the kernel-side bug is fixed.
Nightly ci
fb2a286 to
4b892ba
Compare
…and local+softcap
tests: skip SM100 non-varlen kernel hangs with deterministic+softcap …
tests: only print failures to stdout, log test start/all results to file
SM100 varlen backward deadlocks whenever local attention is enabled, regardless of softcap or deterministic setting. Broaden the two existing varlen+local skips into one unconditional IS_SM100+local skip.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a nightly CI pipeline for FA4 that automatically tracks correctness and performance every day.
Problem: The existing CI only runs 2 test cases (fast gate for merges). There's no way to know if a merge caused a
regression until someone manually benchmarks. Benchmark results are also never recorded, so there's no performance
history.
What this adds:
skips for speed
benchmark-databranch asbenchmark_history.jsonlCanonical benchmark configs
Files changed
benchmarks/benchmark_attn.py— add--json-outputflag (structured result collection, no behavior changeotherwise)
benchmarks/bench_nightly.py— new thin wrapper: defines canonical configs, callsbenchmark_attn.py, handlesclock locking
tools/ci/push_results.py— appends JSON results tobenchmark-databranch (creates orphan branch on first run)tools/ci/slack_notify.py— reads last 7 runs, posts comparison table + regression alerts to Slack.github/workflows/nightly.yml— scheduled at 08:00 UTC, runs tests + benchmark in parallel, notifies on failureRequired setup
FA4_NIGHTLY_SLACK_WEBHOOK(Slack incoming webhook URL)b200