Skip to content

[CuTe, SM103] Update architecture assertion for SM 10.x and 11.x#2572

Merged
Johnsonms merged 1 commit into
Dao-AILab:mainfrom
ocss884:patch-1
May 24, 2026
Merged

[CuTe, SM103] Update architecture assertion for SM 10.x and 11.x#2572
Johnsonms merged 1 commit into
Dao-AILab:mainfrom
ocss884:patch-1

Conversation

@ocss884
Copy link
Copy Markdown
Contributor

@ocss884 ocss884 commented May 17, 2026

Fix the architecture check in flash_fwd_sm100.py. This keeps the intended support scope for SM 10.x and SM 11.x while avoiding incorrect behavior caused by different Arch enum mappings between cu13 and non-cu13 CuteDSL.

When using B300(sm103) with non-cu13 cutedsl, current assertion introduced in 463623e will narrow the effective supported range as the arch class in cutedsl will do:

  • For non-cu13 cutedsl: map Arch.sm_110 and Arch.sm_101 to SM101
  • For cu13 cutedsl: map Arch.sm_101 and Arch.sm_110 to SM110

As a result when using non-cu13 cutedsl, the current the effective range check will be sm100<= and <=sm101 which unintentionally exclude SM103

functionstackx pushed a commit to SemiAnalysisAI/InferenceX that referenced this pull request May 18, 2026
Workaround for the flash_attn v4 cute kernel's sm_103 assertion failure
in the Qwen3.5-VL vision encoder (filed as sgl-project/sglang#25564,
upstream fix in Dao-AILab/flash-attention#2572).

The text decoder still uses --attention-backend trtllm_mha; this only
swaps the multi-modal (vision encoder) attention path to triton_attn,
bypassing the broken flash_attn cute dispatch on B300.

Suggested by upstream sglang reviewer.
functionstackx pushed a commit to SemiAnalysisAI/InferenceX that referenced this pull request May 18, 2026
Same workaround as PR #1422 — bypass the broken flash-attn cute kernel
sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only
the multi-modal attention path to triton_attn. Text decoder still uses
--attention-backend trtllm_mha.

See sgl-project/sglang#25564 + Dao-AILab/flash-attention#2572 for the
upstream root cause and the in-flight fix.
functionstackx pushed a commit to SemiAnalysisAI/InferenceX that referenced this pull request May 18, 2026
Same workaround as #1422 (bf16) and #1451 (fp8) — bypass the broken
flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision
encoder by switching only the multi-modal attention path to triton_attn.
Text decoder still uses --attention-backend trtllm_mha.

See sgl-project/sglang#25564 (root cause: cutedsl Arch enum aliasing on
non-cu13 path collapses sm_100..sm_110f range to exclude sm_103) and
Dao-AILab/flash-attention#2572 for the upstream fix in flight.
functionstackx added a commit to SemiAnalysisAI/InferenceX that referenced this pull request May 18, 2026
#1422)

* Update qwen3.5-bf16-b300-sglang and qwen3.5-bf16-b300-sglang-mtp SGLang image to v0.5.12-cu130

Ref #1154

Co-authored-by: Klaud Cold <Klaud-Cold@users.noreply.github.com>

* fix(qwen3.5_bf16_b300): use --mm-attention-backend triton_attn

Workaround for the flash_attn v4 cute kernel's sm_103 assertion failure
in the Qwen3.5-VL vision encoder (filed as sgl-project/sglang#25564,
upstream fix in Dao-AILab/flash-attention#2572).

The text decoder still uses --attention-backend trtllm_mha; this only
swaps the multi-modal (vision encoder) attention path to triton_attn,
bypassing the broken flash_attn cute dispatch on B300.

Suggested by upstream sglang reviewer.

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Klaud Cold <Klaud-Cold@users.noreply.github.com>
Co-authored-by: claude-fix-bot <claude-fix-bot@local>
Co-authored-by: claude-rebase-bot <claude-rebase-bot@local>
Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
functionstackx pushed a commit to SemiAnalysisAI/InferenceX that referenced this pull request May 20, 2026
Same workaround as #1422 (bf16) and #1451 (fp8) — bypass the broken
flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision
encoder by switching only the multi-modal attention path to triton_attn.
Text decoder still uses --attention-backend trtllm_mha.

See sgl-project/sglang#25564 (root cause: cutedsl Arch enum aliasing on
non-cu13 path collapses sm_100..sm_110f range to exclude sm_103) and
Dao-AILab/flash-attention#2572 for the upstream fix in flight.
functionstackx pushed a commit to SemiAnalysisAI/InferenceX that referenced this pull request May 20, 2026
Same workaround as PR #1422 — bypass the broken flash-attn cute kernel
sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only
the multi-modal attention path to triton_attn. Text decoder still uses
--attention-backend trtllm_mha.

See sgl-project/sglang#25564 + Dao-AILab/flash-attention#2572 for the
upstream root cause and the in-flight fix.
functionstackx added a commit to SemiAnalysisAI/InferenceX that referenced this pull request May 20, 2026
….5.12-cu130 (#1475)

* Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130

Update SGLang image from v0.5.11-cu130 (5d old) to v0.5.12-cu130

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(qwen3.5_fp4_b300): use --mm-attention-backend triton_attn

Same workaround as #1422 (bf16) and #1451 (fp8) — bypass the broken
flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision
encoder by switching only the multi-modal attention path to triton_attn.
Text decoder still uses --attention-backend trtllm_mha.

See sgl-project/sglang#25564 (root cause: cutedsl Arch enum aliasing on
non-cu13 path collapses sm_100..sm_110f range to exclude sm_103) and
Dao-AILab/flash-attention#2572 for the upstream fix in flight.

* Re-trigger sweep (previous Run Sweep run stuck pending with 0 jobs)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: claude-fix-bot <claude-fix-bot@local>
@Johnsonms Johnsonms merged commit 2d5d5a1 into Dao-AILab:main May 24, 2026
jayhshah pushed a commit that referenced this pull request May 26, 2026
…2590)

* Fix bwd postprocess 2CTA gating to include sm_11x

The 2CTA gating in flash_bwd_postprocess.py used `arch // 10 == 10`,
which only matches SM 10.x (B100/B200/B300) and misses SM 11.x (Thor).
The rest of the codebase (e.g. interface.py:549, 563, 834) consistently
gates Blackwell-family 2CTA features as `arch // 10 in [10, 11]`.

Bring the two postprocess sites in line with that convention.

Flagged by @jayhshah in #2572 follow-up discussion.

* Include sm_110 in interface.py Blackwell-family heuristics

Three sites in interface.py gate Blackwell-family behavior using
`arch // 10 == 10`, which appears inconsistent with the rest of the
file's `arch // 10 in [10, 11]` convention (used at lines 549, 563,
834, 974, 1035, etc.):

- L533: `q_stage` heuristic for Blackwell forward
- L579: `use_dedicated_hd256_kernel` (forward)
- L1335: `use_dedicated_hd256_kernel` (backward)

The dispatch in `_flash_attn_fwd` already routes both sm_10x and sm_11x
through the same `FlashAttentionForwardSm100` / MLA classes, so these
gates likely should treat them the same.

NOTE FOR REVIEWERS: I'm not certain these are all oversight vs. intentional
SM100-only paths. If any of them is intentional, please flag so I can
revert just that hunk. The FP8 assert at L480 is left untouched on
purpose — its error message reads as deliberate.

* Apply ruff format to flash_bwd_sm100.py

Pre-existing format drift surfaced by pre-commit. Not in the
cute_exclude pattern, so it gets auto-fixed when other files in
flash_attn/cute/ are touched in the same commit chain.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants