[CuTe, SM103] Update architecture assertion for SM 10.x and 11.x by ocss884 · Pull Request #2572 · Dao-AILab/flash-attention

ocss884 · 2026-05-17T11:38:48Z

Fix the architecture check in flash_fwd_sm100.py. This keeps the intended support scope for SM 10.x and SM 11.x while avoiding incorrect behavior caused by different Arch enum mappings between cu13 and non-cu13 CuteDSL.

When using B300(sm103) with non-cu13 cutedsl, current assertion introduced in 463623e will narrow the effective supported range as the arch class in cutedsl will do:

For non-cu13 cutedsl: map Arch.sm_110 and Arch.sm_101 to SM101
For cu13 cutedsl: map Arch.sm_101 and Arch.sm_110 to SM110

As a result when using non-cu13 cutedsl, the current the effective range check will be sm100<= and <=sm101 which unintentionally exclude SM103

Workaround for the flash_attn v4 cute kernel's sm_103 assertion failure in the Qwen3.5-VL vision encoder (filed as sgl-project/sglang#25564, upstream fix in Dao-AILab/flash-attention#2572). The text decoder still uses --attention-backend trtllm_mha; this only swaps the multi-modal (vision encoder) attention path to triton_attn, bypassing the broken flash_attn cute dispatch on B300. Suggested by upstream sglang reviewer.

Same workaround as PR #1422 — bypass the broken flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only the multi-modal attention path to triton_attn. Text decoder still uses --attention-backend trtllm_mha. See sgl-project/sglang#25564 + Dao-AILab/flash-attention#2572 for the upstream root cause and the in-flight fix.

Same workaround as #1422 (bf16) and #1451 (fp8) — bypass the broken flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only the multi-modal attention path to triton_attn. Text decoder still uses --attention-backend trtllm_mha. See sgl-project/sglang#25564 (root cause: cutedsl Arch enum aliasing on non-cu13 path collapses sm_100..sm_110f range to exclude sm_103) and Dao-AILab/flash-attention#2572 for the upstream fix in flight.

#1422) * Update qwen3.5-bf16-b300-sglang and qwen3.5-bf16-b300-sglang-mtp SGLang image to v0.5.12-cu130 Ref #1154 Co-authored-by: Klaud Cold <Klaud-Cold@users.noreply.github.com> * fix(qwen3.5_bf16_b300): use --mm-attention-backend triton_attn Workaround for the flash_attn v4 cute kernel's sm_103 assertion failure in the Qwen3.5-VL vision encoder (filed as sgl-project/sglang#25564, upstream fix in Dao-AILab/flash-attention#2572). The text decoder still uses --attention-backend trtllm_mha; this only swaps the multi-modal (vision encoder) attention path to triton_attn, bypassing the broken flash_attn cute dispatch on B300. Suggested by upstream sglang reviewer. --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Klaud Cold <Klaud-Cold@users.noreply.github.com> Co-authored-by: claude-fix-bot <claude-fix-bot@local> Co-authored-by: claude-rebase-bot <claude-rebase-bot@local> Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>

Same workaround as #1422 (bf16) and #1451 (fp8) — bypass the broken flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only the multi-modal attention path to triton_attn. Text decoder still uses --attention-backend trtllm_mha. See sgl-project/sglang#25564 (root cause: cutedsl Arch enum aliasing on non-cu13 path collapses sm_100..sm_110f range to exclude sm_103) and Dao-AILab/flash-attention#2572 for the upstream fix in flight.

Same workaround as PR #1422 — bypass the broken flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only the multi-modal attention path to triton_attn. Text decoder still uses --attention-backend trtllm_mha. See sgl-project/sglang#25564 + Dao-AILab/flash-attention#2572 for the upstream root cause and the in-flight fix.

….5.12-cu130 (#1475) * Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 Update SGLang image from v0.5.11-cu130 (5d old) to v0.5.12-cu130 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(qwen3.5_fp4_b300): use --mm-attention-backend triton_attn Same workaround as #1422 (bf16) and #1451 (fp8) — bypass the broken flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only the multi-modal attention path to triton_attn. Text decoder still uses --attention-backend trtllm_mha. See sgl-project/sglang#25564 (root cause: cutedsl Arch enum aliasing on non-cu13 path collapses sm_100..sm_110f range to exclude sm_103) and Dao-AILab/flash-attention#2572 for the upstream fix in flight. * Re-trigger sweep (previous Run Sweep run stuck pending with 0 jobs) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: claude-fix-bot <claude-fix-bot@local>

@jayhshah

…2590) * Fix bwd postprocess 2CTA gating to include sm_11x The 2CTA gating in flash_bwd_postprocess.py used `arch // 10 == 10`, which only matches SM 10.x (B100/B200/B300) and misses SM 11.x (Thor). The rest of the codebase (e.g. interface.py:549, 563, 834) consistently gates Blackwell-family 2CTA features as `arch // 10 in [10, 11]`. Bring the two postprocess sites in line with that convention. Flagged by @jayhshah in #2572 follow-up discussion. * Include sm_110 in interface.py Blackwell-family heuristics Three sites in interface.py gate Blackwell-family behavior using `arch // 10 == 10`, which appears inconsistent with the rest of the file's `arch // 10 in [10, 11]` convention (used at lines 549, 563, 834, 974, 1035, etc.): - L533: `q_stage` heuristic for Blackwell forward - L579: `use_dedicated_hd256_kernel` (forward) - L1335: `use_dedicated_hd256_kernel` (backward) The dispatch in `_flash_attn_fwd` already routes both sm_10x and sm_11x through the same `FlashAttentionForwardSm100` / MLA classes, so these gates likely should treat them the same. NOTE FOR REVIEWERS: I'm not certain these are all oversight vs. intentional SM100-only paths. If any of them is intentional, please flag so I can revert just that hunk. The FP8 assert at L480 is left untouched on purpose — its error message reads as deliberate. * Apply ruff format to flash_bwd_sm100.py Pre-existing format drift surfaced by pre-commit. Not in the cute_exclude pattern, so it gets auto-fixed when other files in flash_attn/cute/ are touched in the same commit chain.

Update architecture assertion for SM 10.x and 11.x

8d9203b

functionstackx mentioned this pull request May 18, 2026

[Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update qwen3.5-fp8-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 SemiAnalysisAI/InferenceX#1451

Merged

1 task

functionstackx mentioned this pull request May 18, 2026

[Klaud Cold] Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 SemiAnalysisAI/InferenceX#1475

Merged

1 task

functionstackx mentioned this pull request May 18, 2026

[AI Generated] [Handoff] out of 70+ image updates, 13 stuck Klaud Cold PRs need upstream coordination / scope decisions SemiAnalysisAI/InferenceX#1511

Open

janbernloehr mentioned this pull request May 20, 2026

fix(sm100): make arch check robust against CUTLASS DSL Arch enum changes #2575

Closed

Kangyan-Zhou mentioned this pull request May 21, 2026

[Revert] nvidia-cutlass-dsl[cu13] 4.5.1 -> 4.5.0 sgl-project/sglang#25938

Merged

Johnsonms approved these changes May 24, 2026

View reviewed changes

Johnsonms merged commit 2d5d5a1 into Dao-AILab:main May 24, 2026

This was referenced May 25, 2026

Use is_family_of for sm_90 and sm_103 arch checks #2589

Open

Include sm_110 in Blackwell-family arch gating (follow-up to #2572) #2590

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CuTe, SM103] Update architecture assertion for SM 10.x and 11.x#2572

[CuTe, SM103] Update architecture assertion for SM 10.x and 11.x#2572
Johnsonms merged 1 commit into
Dao-AILab:mainfrom
ocss884:patch-1

ocss884 commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ocss884 commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants