tests: fix SM100 varlen backward failures on B200 by Johnsonms · Pull Request #2534 · Dao-AILab/flash-attention

Johnsonms · 2026-05-03T05:05:43Z

Summary

Skip SM100 hd192 bwd + softcap: d=192 on SM100 requires 2CTA instructions, but softcap > 0.0 injects a score_mod that forcesuse_2cta_instrs=False, hitting the assertion in FlashAttentionBackwardSm100.__init__. Added pytest.skip in the varlen backward block, matching the existing pattern for d=256. Fixes 49 CI failures.
Retry on AcceleratorError OOM: retry_on_oom only caught torch.OutOfMemoryError. Async CUDA OOM raises torch.AcceleratorError instead (allocation fails in a prior op, surfaces on next API call). Extended the catch to include both, still guarded by the "out of memory" message check.

Repro

AssertionError: Must use 2CTA for hdim 192 flash_attn/cute/flash_bwd_sm100.py:93
Triggered by any test_flash_attn_varlen_output case with d=192, softcap=15.0 on SM100 (B200). Root cause: FlashAttentionBackwardSm100 sets use_2cta_instrs = use_2cta_instrs and ... and score_mod is None, so softcap's score_mod silently disables 2CTA, then the assertion fires. The non-varlen test (test_flash_attn_output) was already guarded by and softcap == 0.0 in its backward condition; the varlen test was missing the equivalent guard.

For the OOM: torch.AcceleratorError: CUDA error: out of memory surfaces at an innocent call (lengths[i] = 0) because the actual allocation failure happened asynchronously in a prior CUDA op during concurrent kernel compilation across 64 xdist workers.

Test plan

Ran on B200 (SM100) locally:
pytest tests/cute/test_flash_attn.py -k "test_flash_attn_varlen_output and 192 and 15.0"

Result: 48384 skipped, 0 failed (1:27:33) — all previously failing cases now skip correctly via the new guard.
Full suite result with both fixes applied:
168605 passed, 249112 skipped, 0 failed (0:32:59)

hdim=192 on SM100 requires 2CTA instructions, but softcap injects a score_mod that disables 2CTA, triggering the assertion in FlashAttentionBackwardSm100.__init__. The non-varlen test already gates its backward on softcap==0.0; add the equivalent skip to the varlen backward block.

torch.AcceleratorError is the async variant of OOM — the allocation fails in a prior CUDA op and the error surfaces on the next API call. The existing retry_on_oom only caught torch.OutOfMemoryError, so async OOMs caused by concurrent kernel compilation across 64 xdist workers were not retried.

SM100 varlen kernel hangs when deterministic=True and softcap > 0.0. Skip until the kernel-side bug is fixed.

…and local+softcap

Johnsonms and others added 4 commits May 3, 2026 04:39

tests: skip SM100 varlen deterministic+softcap hang

7679917

SM100 varlen kernel hangs when deterministic=True and softcap > 0.0. Skip until the kernel-side bug is fixed.

tests: skip SM100 varlen local-attention+softcap hang

f89cb36

github-actions Bot force-pushed the fix/varlen-test-sm100 branch from 27d692d to fc577ea Compare May 5, 2026 00:45

tests: skip SM100 non-varlen kernel hangs with deterministic+softcap …

fc577ea

…and local+softcap

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: fix SM100 varlen backward failures on B200#2534

tests: fix SM100 varlen backward failures on B200#2534
Johnsonms wants to merge 5 commits into
Dao-AILab:mainfrom
Johnsonms:fix/varlen-test-sm100

Johnsonms commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Johnsonms commented May 3, 2026

Summary

Repro

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant