Skip to content

tests: fix SM100 varlen backward failures on B200#2534

Draft
Johnsonms wants to merge 5 commits into
Dao-AILab:mainfrom
Johnsonms:fix/varlen-test-sm100
Draft

tests: fix SM100 varlen backward failures on B200#2534
Johnsonms wants to merge 5 commits into
Dao-AILab:mainfrom
Johnsonms:fix/varlen-test-sm100

Conversation

@Johnsonms
Copy link
Copy Markdown
Collaborator

Summary

  • Skip SM100 hd192 bwd + softcap: d=192 on SM100 requires 2CTA instructions, but softcap > 0.0 injects a score_mod that forcesuse_2cta_instrs=False, hitting the assertion in FlashAttentionBackwardSm100.__init__. Added pytest.skip in the varlen backward block, matching the existing pattern for d=256. Fixes 49 CI failures.

  • Retry on AcceleratorError OOM: retry_on_oom only caught torch.OutOfMemoryError. Async CUDA OOM raises torch.AcceleratorError instead (allocation fails in a prior op, surfaces on next API call). Extended the catch to include both, still guarded by the "out of memory" message check.

Repro

AssertionError: Must use 2CTA for hdim 192 flash_attn/cute/flash_bwd_sm100.py:93
Triggered by any test_flash_attn_varlen_output case with d=192, softcap=15.0 on SM100 (B200). Root cause: FlashAttentionBackwardSm100 sets use_2cta_instrs = use_2cta_instrs and ... and score_mod is None, so softcap's score_mod silently disables 2CTA, then the assertion fires. The non-varlen test (test_flash_attn_output) was already guarded by and softcap == 0.0 in its backward condition; the varlen test was missing the equivalent guard.

For the OOM: torch.AcceleratorError: CUDA error: out of memory surfaces at an innocent call (lengths[i] = 0) because the actual allocation failure happened asynchronously in a prior CUDA op during concurrent kernel compilation across 64 xdist workers.

Test plan

Ran on B200 (SM100) locally:
pytest tests/cute/test_flash_attn.py -k "test_flash_attn_varlen_output and 192 and 15.0"

Result: 48384 skipped, 0 failed (1:27:33) — all previously failing cases now skip correctly via the new guard.
Full suite result with both fixes applied:
168605 passed, 249112 skipped, 0 failed (0:32:59)

Johnsonms and others added 4 commits May 3, 2026 04:39
hdim=192 on SM100 requires 2CTA instructions, but softcap injects a
score_mod that disables 2CTA, triggering the assertion in
FlashAttentionBackwardSm100.__init__. The non-varlen test already
gates its backward on softcap==0.0; add the equivalent skip to the
varlen backward block.
torch.AcceleratorError is the async variant of OOM — the allocation fails
in a prior CUDA op and the error surfaces on the next API call. The existing
retry_on_oom only caught torch.OutOfMemoryError, so async OOMs caused by
concurrent kernel compilation across 64 xdist workers were not retried.
SM100 varlen kernel hangs when deterministic=True and softcap > 0.0.
Skip until the kernel-side bug is fixed.
@github-actions github-actions Bot force-pushed the fix/varlen-test-sm100 branch from 27d692d to fc577ea Compare May 5, 2026 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant