Skip to content

Add nightly benchmark CI: canonical FA4 benchmark tracking#2533

Draft
Johnsonms wants to merge 47 commits into
Dao-AILab:mainfrom
Johnsonms:nightly-ci
Draft

Add nightly benchmark CI: canonical FA4 benchmark tracking#2533
Johnsonms wants to merge 47 commits into
Dao-AILab:mainfrom
Johnsonms:nightly-ci

Conversation

@Johnsonms
Copy link
Copy Markdown
Collaborator

Summary

Add a nightly CI pipeline for FA4 that automatically tracks correctness and performance every day.

Problem: The existing CI only runs 2 test cases (fast gate for merges). There's no way to know if a merge caused a
regression until someone manually benchmarks. Benchmark results are also never recorded, so there's no performance
history.

What this adds:

  • Full nightly test suite — runs all parametrized tests (all hdim, seqlen, varlen, GQA, score_mod, etc.) that CI
    skips for speed
  • Canonical benchmark tracking — fixed configs run every night, results stored in benchmark-data branch as
    benchmark_history.jsonl
  • 7-day trend + regression alerts — Slack notification with today vs yesterday delta and alert if any config drops

2% below the 6-day average

  • Clock locking — GPU clocks locked to max before benchmarking for reproducible numbers

Canonical benchmark configs

Group Shape Scenario
MHA hdim 64/128/256 fwd + bwd, seqlen 4k/16k, causal both
MLA decode hdim 64/512, nheads=128/1 fwd, seqlen_q=1, seqlen_kv 4k/16k/64k, batch=128
MLA prefill hdim 64/512, nheads=128/1 fwd, seqlen 4k/16k
DeepSeek shape hdim 192/128 fwd, seqlen 4k/16k, causal both

Files changed

  • benchmarks/benchmark_attn.py — add --json-output flag (structured result collection, no behavior change
    otherwise)
  • benchmarks/bench_nightly.py — new thin wrapper: defines canonical configs, calls benchmark_attn.py, handles
    clock locking
  • tools/ci/push_results.py — appends JSON results to benchmark-data branch (creates orphan branch on first run)
  • tools/ci/slack_notify.py — reads last 7 runs, posts comparison table + regression alerts to Slack
  • .github/workflows/nightly.yml — scheduled at 08:00 UTC, runs tests + benchmark in parallel, notifies on failure

Required setup

  • GitHub secret: FA4_NIGHTLY_SLACK_WEBHOOK (Slack incoming webhook URL)
  • Self-hosted runner with label b200

@Johnsonms Johnsonms marked this pull request as draft May 2, 2026 19:02
Johnsonms added 24 commits May 2, 2026 19:03
- benchmark_attn.py: add --json-output flag for structured result collection
- bench_nightly.py: thin wrapper running canonical MHA + MLA configs via benchmark_attn.py, with clock locking
- tools/ci/push_results.py: append results to benchmark-data branch
- tools/ci/slack_notify.py: 7-day trend table with regression alerts (>2% below avg)
- .github/workflows/nightly.yml: scheduled full test suite + benchmark, posts to Slack

Canonical configs: MHA (hdim 64/128/256), MLA-absorbed decode/prefill (64/512, nheads=128), DeepSeek shape (192/128).
Add nightly benchmark CI: canonical FA4 benchmark tracking
fix push_results: use git checkout --orphan for older git versions
fix slack_notify: unpack 7-element cfg_key tuple, add group/hdim_v co…
slack_notify: per-group sections, bar_chart header, run URL link
nightly: lock GPU clocks on host before Apptainer, reset after
slack_notify: Block Kit rich_text_preformatted for compact monospace …
Johnsonms added 3 commits May 3, 2026 01:46
…emory threshold

Three root causes behind CI failures at 64 xdist workers:

1. cache_utils: LOCK_TIMEOUT_SECONDS 15→300.  With 64 workers queuing on
   the same kernel's exclusive lock (0.2s poll interval), the last worker
   needed ≥18s; the old 15s limit caused RuntimeError on every cache write.

2. conftest: atomic gpu_ids.json write (temp-file + rename) + retry loop on
   JSONDecodeError.  Workers 1-63 could read a truncated file during worker
   0's open(mode="w")+write sequence, crashing pytest_configure.

3. run_fa4_ci: max_used_memory_mb 1000→8000.  B200 driver baseline is
   626–1254 MiB; the old threshold falsely reported all GPUs as busy.
ci: fix 64-worker test failures from lock timeout, GPU-id race, and m…
hdim=192 on SM100 requires 2CTA instructions, but softcap injects a
score_mod that disables 2CTA, triggering the assertion in
FlashAttentionBackwardSm100.__init__. The non-varlen test already
gates its backward on softcap==0.0; add the equivalent skip to the
varlen backward block.
Johnsonms added 3 commits May 2, 2026 20:56
tests: skip SM100 hd192 bwd with softcap in varlen test
torch.AcceleratorError is the async variant of OOM — the allocation fails
in a prior CUDA op and the error surfaces on the next API call. The existing
retry_on_oom only caught torch.OutOfMemoryError, so async OOMs caused by
concurrent kernel compilation across 64 xdist workers were not retried.
tests: retry on AcceleratorError OOM in addition to OutOfMemoryError
@Johnsonms Johnsonms force-pushed the nightly-ci branch 2 times, most recently from 983f0f2 to d62c178 Compare May 3, 2026 06:50
Johnsonms and others added 10 commits May 3, 2026 06:50
Useful when Pass 1 already completed successfully in a prior run and only
Pass 2 (real GPU execution) needs to be re-run. Exposed as --skip-compile
in run_fa4_ci.py, skip-compile input in the gpu-test action, and
skip_compile workflow_dispatch input in the Nightly workflow.
SM100 varlen kernel hangs when deterministic=True and softcap > 0.0.
Skip until the kernel-side bug is fixed.
SM100 varlen backward deadlocks whenever local attention is enabled,
regardless of softcap or deterministic setting. Broaden the two
existing varlen+local skips into one unconditional IS_SM100+local skip.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant