Add nightly benchmark CI: canonical FA4 benchmark tracking by Johnsonms · Pull Request #2533 · Dao-AILab/flash-attention

Johnsonms · 2026-05-02T19:01:52Z

Summary

Add a nightly CI pipeline for FA4 that automatically tracks correctness and performance every day.

Problem: The existing CI only runs 2 test cases (fast gate for merges). There's no way to know if a merge caused a
regression until someone manually benchmarks. Benchmark results are also never recorded, so there's no performance
history.

What this adds:

Full nightly test suite — runs all parametrized tests (all hdim, seqlen, varlen, GQA, score_mod, etc.) that CI
skips for speed
Canonical benchmark tracking — fixed configs run every night, results stored in benchmark-data branch as
benchmark_history.jsonl
7-day trend + regression alerts — Slack notification with today vs yesterday delta and alert if any config drops

2% below the 6-day average

Clock locking — GPU clocks locked to max before benchmarking for reproducible numbers

Canonical benchmark configs

Group	Shape	Scenario
MHA	hdim 64/128/256	fwd + bwd, seqlen 4k/16k, causal both
MLA decode	hdim 64/512, nheads=128/1	fwd, seqlen_q=1, seqlen_kv 4k/16k/64k, batch=128
MLA prefill	hdim 64/512, nheads=128/1	fwd, seqlen 4k/16k
DeepSeek shape	hdim 192/128	fwd, seqlen 4k/16k, causal both

Files changed

benchmarks/benchmark_attn.py — add --json-output flag (structured result collection, no behavior change
otherwise)
benchmarks/bench_nightly.py — new thin wrapper: defines canonical configs, calls benchmark_attn.py, handles
clock locking
tools/ci/push_results.py — appends JSON results to benchmark-data branch (creates orphan branch on first run)
tools/ci/slack_notify.py — reads last 7 runs, posts comparison table + regression alerts to Slack
.github/workflows/nightly.yml — scheduled at 08:00 UTC, runs tests + benchmark in parallel, notifies on failure

Required setup

GitHub secret: FA4_NIGHTLY_SLACK_WEBHOOK (Slack incoming webhook URL)
Self-hosted runner with label b200

- benchmark_attn.py: add --json-output flag for structured result collection - bench_nightly.py: thin wrapper running canonical MHA + MLA configs via benchmark_attn.py, with clock locking - tools/ci/push_results.py: append results to benchmark-data branch - tools/ci/slack_notify.py: 7-day trend table with regression alerts (>2% below avg) - .github/workflows/nightly.yml: scheduled full test suite + benchmark, posts to Slack Canonical configs: MHA (hdim 64/128/256), MLA-absorbed decode/prefill (64/512, nheads=128), DeepSeek shape (192/128).

Add nightly benchmark CI: canonical FA4 benchmark tracking

fix push_results: use git checkout --orphan for older git versions

…lumns

fix slack_notify: unpack 7-element cfg_key tuple, add group/hdim_v co…

slack_notify: per-group sections, bar_chart header, run URL link

nightly: lock GPU clocks on host before Apptainer, reset after

Nightly ci

…(avoids red text)

…eqlen_kv, causal)

Nightly ci

…tables

slack_notify: Block Kit rich_text_preformatted for compact monospace …

…iority nightly job

Nightly ci

…emory threshold Three root causes behind CI failures at 64 xdist workers: 1. cache_utils: LOCK_TIMEOUT_SECONDS 15→300. With 64 workers queuing on the same kernel's exclusive lock (0.2s poll interval), the last worker needed ≥18s; the old 15s limit caused RuntimeError on every cache write. 2. conftest: atomic gpu_ids.json write (temp-file + rename) + retry loop on JSONDecodeError. Workers 1-63 could read a truncated file during worker 0's open(mode="w")+write sequence, crashing pytest_configure. 3. run_fa4_ci: max_used_memory_mb 1000→8000. B200 driver baseline is 626–1254 MiB; the old threshold falsely reported all GPUs as busy.

ci: fix 64-worker test failures from lock timeout, GPU-id race, and m…

hdim=192 on SM100 requires 2CTA instructions, but softcap injects a score_mod that disables 2CTA, triggering the assertion in FlashAttentionBackwardSm100.__init__. The non-varlen test already gates its backward on softcap==0.0; add the equivalent skip to the varlen backward block.

tests: skip SM100 hd192 bwd with softcap in varlen test

torch.AcceleratorError is the async variant of OOM — the allocation fails in a prior CUDA op and the error surfaces on the next API call. The existing retry_on_oom only caught torch.OutOfMemoryError, so async OOMs caused by concurrent kernel compilation across 64 xdist workers were not retried.

tests: retry on AcceleratorError OOM in addition to OutOfMemoryError

…4224

Useful when Pass 1 already completed successfully in a prior run and only Pass 2 (real GPU execution) needs to be re-run. Exposed as --skip-compile in run_fa4_ci.py, skip-compile input in the gpu-test action, and skip_compile workflow_dispatch input in the Nightly workflow.

Nightly ci

SM100 varlen kernel hangs when deterministic=True and softcap > 0.0. Skip until the kernel-side bug is fixed.

Nightly ci

…and local+softcap

tests: skip SM100 non-varlen kernel hangs with deterministic+softcap …

tests: only print failures to stdout, log test start/all results to file

…=True (#19)

…=True (#20)

SM100 varlen backward deadlocks whenever local attention is enabled, regardless of softcap or deterministic setting. Broaden the two existing varlen+local skips into one unconditional IS_SM100+local skip.

Johnsonms marked this pull request as draft May 2, 2026 19:02

Johnsonms force-pushed the nightly-ci branch from c6bc93e to fb0b3b2 Compare May 2, 2026 19:02

Johnsonms added 24 commits May 2, 2026 19:03

Merge pull request #2 from Johnsonms/nightly-ci

257617b

Add nightly benchmark CI: canonical FA4 benchmark tracking

fix push_results: use git checkout --orphan for older git versions

8d71259

Merge pull request #3 from Johnsonms/nightly-ci

3be1814

fix push_results: use git checkout --orphan for older git versions

fix slack_notify: unpack 7-element cfg_key tuple, add group/hdim_v co…

62988b8

…lumns

Merge pull request #4 from Johnsonms/nightly-ci

ecd8409

fix slack_notify: unpack 7-element cfg_key tuple, add group/hdim_v co…

slack_notify: per-group sections, bar_chart header, run URL link

fa09715

Merge pull request #5 from Johnsonms/nightly-ci

28cded1

slack_notify: per-group sections, bar_chart header, run URL link

nightly: lock GPU clocks on host before Apptainer, reset after

42e9716

Merge pull request #6 from Johnsonms/nightly-ci

c5d8747

nightly: lock GPU clocks on host before Apptainer, reset after

slack_notify: use quoted monospace lines instead of code block

a321a24

nightly: record and display locked clock frequency in Slack header

8aa84fe

Merge pull request #7 from Johnsonms/nightly-ci

e3b30b9

Nightly ci

nightly: pass LOCKED_CLOCK_MHZ into Apptainer so clock shows in Slack

969078f

slack_notify: use code block per section instead of inline backticks …

686fe2a

…(avoids red text)

slack_notify: align column names with benchmark_attn (op, seqlen_q, s…

2c97870

…eqlen_kv, causal)

slack_notify: use Δ yday / Δ Nd-avg column headers

eac60c9

Merge pull request #8 from Johnsonms/nightly-ci

07113a6

Nightly ci

slack_notify: Block Kit rich_text_preformatted for compact monospace …

2021c41

…tables

Merge pull request #9 from Johnsonms/nightly-ci

5aba3c1

slack_notify: Block Kit rich_text_preformatted for compact monospace …

ci: parallelize Pass 2 across all free GPUs (one worker per GPU)

389baa0

ci: detect truly idle GPUs (util==0 + memory.used<=1000MB) for low-pr…

011b884

…iority nightly job

ci: wait up to 60min for idle GPUs, post Slack alert if none available

9433c9b

Merge pull request #10 from Johnsonms/nightly-ci

a32e60e

Nightly ci

Johnsonms force-pushed the nightly-ci branch from b685712 to 74ab78b Compare May 3, 2026 01:46

Johnsonms added 3 commits May 3, 2026 01:46

Merge pull request #11 from Johnsonms/nightly-ci

d7390af

ci: fix 64-worker test failures from lock timeout, GPU-id race, and m…

Johnsonms added 3 commits May 2, 2026 20:56

Merge pull request #12 from Johnsonms/nightly-ci

7c8d0fd

tests: skip SM100 hd192 bwd with softcap in varlen test

Merge pull request #13 from Johnsonms/nightly-ci

307f66f

tests: retry on AcceleratorError OOM in addition to OutOfMemoryError

Johnsonms force-pushed the nightly-ci branch 2 times, most recently from 983f0f2 to d62c178 Compare May 3, 2026 06:50

Johnsonms and others added 10 commits May 3, 2026 06:50

tests: fix SM100 hd192/OOM failures and comment out slow seqlen=4096/…

b1e4ffe

…4224

ci: increase test job timeout from 180 to 300 min for large seqlen tests

d62c178

Merge pull request #14 from Johnsonms/nightly-ci

afaacbb

Nightly ci

tests: log per-test name and duration for CI timeout debugging

c6ad56e

ci: truncate regression text to fit Slack 3000-char section limit

8dde6f9

Merge pull request #15 from Johnsonms/nightly-ci

f49622e

Nightly ci

tests: skip SM100 varlen deterministic+softcap hang

f62595c

SM100 varlen kernel hangs when deterministic=True and softcap > 0.0. Skip until the kernel-side bug is fixed.

tests: skip SM100 varlen local-attention+softcap hang

77426ee

Merge pull request #16 from Johnsonms/nightly-ci

3ea0776

Nightly ci

github-actions Bot force-pushed the nightly-ci branch from fb2a286 to 4b892ba Compare May 5, 2026 00:45

Johnsonms added 6 commits May 5, 2026 00:46

tests: skip SM100 non-varlen kernel hangs with deterministic+softcap …

4b892ba

…and local+softcap

Merge pull request #17 from Johnsonms/nightly-ci

4ad56de

tests: skip SM100 non-varlen kernel hangs with deterministic+softcap …

tests: only print failures to stdout, log test start/all results to file

157d74c

Merge pull request #18 from Johnsonms/nightly-ci

cce748b

tests: only print failures to stdout, log test start/all results to file

tests: skip SM100 kernel hangs with local attention and deterministic…

8ceb1c7

…=True (#19)

tests: skip SM100 kernel hangs with local attention and deterministic…

e167cba

…=True (#20)

Johnsonms force-pushed the nightly-ci branch from 46abb18 to 355f862 Compare May 5, 2026 05:52

tests: skip SM100 varlen kernel hangs with local attention

355f862

SM100 varlen backward deadlocks whenever local attention is enabled, regardless of softcap or deterministic setting. Broaden the two existing varlen+local skips into one unconditional IS_SM100+local skip.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nightly benchmark CI: canonical FA4 benchmark tracking#2533

Add nightly benchmark CI: canonical FA4 benchmark tracking#2533
Johnsonms wants to merge 47 commits into
Dao-AILab:mainfrom
Johnsonms:nightly-ci

Johnsonms commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Johnsonms commented May 2, 2026

Summary

Canonical benchmark configs

Files changed

Required setup

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant