[CuTe, SM103] Update architecture assertion for SM 10.x and 11.x#2572
Merged
Conversation
This was referenced May 18, 2026
functionstackx
pushed a commit
to SemiAnalysisAI/InferenceX
that referenced
this pull request
May 18, 2026
Workaround for the flash_attn v4 cute kernel's sm_103 assertion failure in the Qwen3.5-VL vision encoder (filed as sgl-project/sglang#25564, upstream fix in Dao-AILab/flash-attention#2572). The text decoder still uses --attention-backend trtllm_mha; this only swaps the multi-modal (vision encoder) attention path to triton_attn, bypassing the broken flash_attn cute dispatch on B300. Suggested by upstream sglang reviewer.
functionstackx
pushed a commit
to SemiAnalysisAI/InferenceX
that referenced
this pull request
May 18, 2026
Same workaround as PR #1422 — bypass the broken flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only the multi-modal attention path to triton_attn. Text decoder still uses --attention-backend trtllm_mha. See sgl-project/sglang#25564 + Dao-AILab/flash-attention#2572 for the upstream root cause and the in-flight fix.
1 task
functionstackx
pushed a commit
to SemiAnalysisAI/InferenceX
that referenced
this pull request
May 18, 2026
Same workaround as #1422 (bf16) and #1451 (fp8) — bypass the broken flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only the multi-modal attention path to triton_attn. Text decoder still uses --attention-backend trtllm_mha. See sgl-project/sglang#25564 (root cause: cutedsl Arch enum aliasing on non-cu13 path collapses sm_100..sm_110f range to exclude sm_103) and Dao-AILab/flash-attention#2572 for the upstream fix in flight.
1 task
functionstackx
added a commit
to SemiAnalysisAI/InferenceX
that referenced
this pull request
May 18, 2026
#1422) * Update qwen3.5-bf16-b300-sglang and qwen3.5-bf16-b300-sglang-mtp SGLang image to v0.5.12-cu130 Ref #1154 Co-authored-by: Klaud Cold <Klaud-Cold@users.noreply.github.com> * fix(qwen3.5_bf16_b300): use --mm-attention-backend triton_attn Workaround for the flash_attn v4 cute kernel's sm_103 assertion failure in the Qwen3.5-VL vision encoder (filed as sgl-project/sglang#25564, upstream fix in Dao-AILab/flash-attention#2572). The text decoder still uses --attention-backend trtllm_mha; this only swaps the multi-modal (vision encoder) attention path to triton_attn, bypassing the broken flash_attn cute dispatch on B300. Suggested by upstream sglang reviewer. --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Klaud Cold <Klaud-Cold@users.noreply.github.com> Co-authored-by: claude-fix-bot <claude-fix-bot@local> Co-authored-by: claude-rebase-bot <claude-rebase-bot@local> Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
functionstackx
pushed a commit
to SemiAnalysisAI/InferenceX
that referenced
this pull request
May 20, 2026
Same workaround as #1422 (bf16) and #1451 (fp8) — bypass the broken flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only the multi-modal attention path to triton_attn. Text decoder still uses --attention-backend trtllm_mha. See sgl-project/sglang#25564 (root cause: cutedsl Arch enum aliasing on non-cu13 path collapses sm_100..sm_110f range to exclude sm_103) and Dao-AILab/flash-attention#2572 for the upstream fix in flight.
functionstackx
pushed a commit
to SemiAnalysisAI/InferenceX
that referenced
this pull request
May 20, 2026
Same workaround as PR #1422 — bypass the broken flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only the multi-modal attention path to triton_attn. Text decoder still uses --attention-backend trtllm_mha. See sgl-project/sglang#25564 + Dao-AILab/flash-attention#2572 for the upstream root cause and the in-flight fix.
functionstackx
added a commit
to SemiAnalysisAI/InferenceX
that referenced
this pull request
May 20, 2026
….5.12-cu130 (#1475) * Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 Update SGLang image from v0.5.11-cu130 (5d old) to v0.5.12-cu130 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(qwen3.5_fp4_b300): use --mm-attention-backend triton_attn Same workaround as #1422 (bf16) and #1451 (fp8) — bypass the broken flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only the multi-modal attention path to triton_attn. Text decoder still uses --attention-backend trtllm_mha. See sgl-project/sglang#25564 (root cause: cutedsl Arch enum aliasing on non-cu13 path collapses sm_100..sm_110f range to exclude sm_103) and Dao-AILab/flash-attention#2572 for the upstream fix in flight. * Re-trigger sweep (previous Run Sweep run stuck pending with 0 jobs) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: claude-fix-bot <claude-fix-bot@local>
Johnsonms
approved these changes
May 24, 2026
This was referenced May 25, 2026
jayhshah
pushed a commit
that referenced
this pull request
May 26, 2026
…2590) * Fix bwd postprocess 2CTA gating to include sm_11x The 2CTA gating in flash_bwd_postprocess.py used `arch // 10 == 10`, which only matches SM 10.x (B100/B200/B300) and misses SM 11.x (Thor). The rest of the codebase (e.g. interface.py:549, 563, 834) consistently gates Blackwell-family 2CTA features as `arch // 10 in [10, 11]`. Bring the two postprocess sites in line with that convention. Flagged by @jayhshah in #2572 follow-up discussion. * Include sm_110 in interface.py Blackwell-family heuristics Three sites in interface.py gate Blackwell-family behavior using `arch // 10 == 10`, which appears inconsistent with the rest of the file's `arch // 10 in [10, 11]` convention (used at lines 549, 563, 834, 974, 1035, etc.): - L533: `q_stage` heuristic for Blackwell forward - L579: `use_dedicated_hd256_kernel` (forward) - L1335: `use_dedicated_hd256_kernel` (backward) The dispatch in `_flash_attn_fwd` already routes both sm_10x and sm_11x through the same `FlashAttentionForwardSm100` / MLA classes, so these gates likely should treat them the same. NOTE FOR REVIEWERS: I'm not certain these are all oversight vs. intentional SM100-only paths. If any of them is intentional, please flag so I can revert just that hunk. The FP8 assert at L480 is left untouched on purpose — its error message reads as deliberate. * Apply ruff format to flash_bwd_sm100.py Pre-existing format drift surfaced by pre-commit. Not in the cute_exclude pattern, so it gets auto-fixed when other files in flash_attn/cute/ are touched in the same commit chain.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix the architecture check in
flash_fwd_sm100.py. This keeps the intended support scope for SM 10.x and SM 11.x while avoiding incorrect behavior caused by differentArchenum mappings between cu13 and non-cu13 CuteDSL.When using B300(sm103) with non-cu13 cutedsl, current assertion introduced in 463623e will narrow the effective supported range as the arch class in cutedsl will do:
Arch.sm_110andArch.sm_101toSM101Arch.sm_101andArch.sm_110toSM110As a result when using non-cu13 cutedsl, the current the effective range check will be
sm100<= and <=sm101which unintentionally exclude SM103