fix(sm100): make arch check robust against CUTLASS DSL Arch enum changes by lingolin128 · Pull Request #2575 · Dao-AILab/flash-attention

lingolin128 · 2026-05-19T11:36:33Z

Summary

The current SM100 forward kernel uses a range-based comparison for the architecture
check:

assert self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f, \
    "Only SM 10.x and 11.x are supported"

This relies on the ordering of Arch enum values, which is fragile. Specifically, in
nvidia-cutlass-dsl 4.5.0, certain Arch enum entries have unexpected .value tuples,
which can cause this assertion to misbehave on valid SM 10.x architectures (e.g.
SM 10.3a / B300).

Fix

fix: #25564
Replace the range comparison with a major-version check, which is independent of how
the suffix variants (a, f, etc.) are ordered in the enum:

  arch_major = self.arch.value[0]
  assert arch_major in [10, 11], "Only SM 10.x and 11.x are supported"

This continues to gate the kernel to SM 10.x and 11.x as intended, but no longer
depends on the relative ordering of Arch.sm_10* / Arch.sm_11* variants in the
upstream CUTLASS DSL package.

Test

Verified on B300 (SM 10.3a) with nvidia-cutlass-dsl==4.5.0:

  import torch
  from flash_attn.cute import flash_attn_func

  q = torch.randn(1, 128, 4, 128, dtype=torch.bfloat16, device='cuda')
  k = torch.randn(1, 128, 4, 128, dtype=torch.bfloat16, device='cuda')
  v = torch.randn(1, 128, 4, 128, dtype=torch.bfloat16, device='cuda')
  out, lse = flash_attn_func(q, k, v, causal=False)
  # torch.Size([1, 128, 4, 128])

lingolin128 · 2026-05-19T11:43:55Z

@Johnsonms @jayhshah Hi there, please review this PR, really appreciate it!

janbernloehr · 2026-05-20T09:27:57Z

Seems to be a duplicate of #2572

lingolin128 · 2026-05-20T11:20:10Z

Seems to be a duplicate of #2572

@janbernloehr You're right, I realize this is duplicated with PR #2572. I actually encountered and verified this issue during my practical usage last week, but I didn’t sort it out and submit the PR earlier. I still hope this change can be merged. It’s totally fine if it cannot be merged eventually. Thanks a lot for your review!

lingolin128 · 2026-05-24T15:33:20Z

Regardless, my pull request was submitted afterward. Junrong Lin is an exceptional engineer whom I look up to. Kindly merge his PR #2572 . I will keep contributing to open source and strive for continuous improvement. ^^

Johnsonms · 2026-05-24T22:41:14Z

Thanks @lingolin128! You’re very welcome to continue contributing to the community.

fix(sm100): relax arch check to support SM 10.3a (B300)

406cb60

Johnsonms requested review from Johnsonms, jayhshah and tridao May 24, 2026 22:38

Johnsonms closed this May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sm100): make arch check robust against CUTLASS DSL Arch enum changes#2575

fix(sm100): make arch check robust against CUTLASS DSL Arch enum changes#2575
lingolin128 wants to merge 1 commit into
Dao-AILab:mainfrom
lingolin128:fix_sm103a_bug

lingolin128 commented May 19, 2026

Uh oh!

lingolin128 commented May 19, 2026

Uh oh!

janbernloehr commented May 20, 2026

Uh oh!

lingolin128 commented May 20, 2026

Uh oh!

lingolin128 commented May 24, 2026

Uh oh!

Johnsonms commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lingolin128 commented May 19, 2026

Summary

Fix

Test

Uh oh!

lingolin128 commented May 19, 2026

Uh oh!

janbernloehr commented May 20, 2026

Uh oh!

lingolin128 commented May 20, 2026

Uh oh!

lingolin128 commented May 24, 2026

Uh oh!

Johnsonms commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants