Skip to content

[Bug] Qwen-3.5 on B300 (sm_103) crashes in flash-attn-4 cute kernel — assertion at flash_fwd_sm100.py:162 (fix exists in Dao-AILab/flash-attention#2572; sglang needs to bump flash-attn-4) #25564

@functionstackx

Description

@functionstackx

Human

TLDR: in v0.5.12 on b300 (sm103), hitting

assert self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f, "Only SM 10.x and 11.x are supported"
AssertionError: Only SM 10.x and 11.x are supported

do to this line in flash-attn where sm103 doesnt fit within these 2 conditions assert self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f. there exists an patch for it in here Dao-AILab/flash-attention#2572

AI Slop below

Summary

On NVIDIA B300 (Blackwell Ultra, sm_103), lmsysorg/sglang:v0.5.12-cu130 fails to serve any Qwen-3.5-VL recipe (Qwen/Qwen3.5-397B-A17B) because the bundled flash-attn-4 cute kernel's architecture check in flash_attn/cute/flash_fwd_sm100.py:162 rejects sm_103. All TP ranks crash on the first warmup forward pass through the vision encoder; warmup eventually times out (600s) and the server is killed.

The root-cause fix already exists upstream, in Dao-AILab/flash-attention#2572 by @ocss884 — sglang's pinned flash-attn-4>=4.0.0b9 needs to be bumped to whatever Tri Dao tags after that PR merges.

Environment

sglang image lmsysorg/sglang:v0.5.12-cu130
Hardware NVIDIA B300 (Blackwell Ultra, sm_103 — compute capability 10.3), 4× / 8× GPU per node
Model Qwen/Qwen3.5-397B-A17B (Qwen-3.5-VL — vision encoder is the failing path)
Tensor parallelism TP=4 and TP=8 (identical failure on both)
Spec decoding Both off and EAGLE-MTP variants hit the same assertion (the vision encoder runs regardless of MTP)
flash-attn version flash-attn-4>=4.0.0b9 per pyproject.toml
Known-good image lmsysorg/sglang:v0.5.11-cu130 (does not bundle the buggy assertion)

Note: sm_120 is consumer Blackwell (RTX 50 series / GB20x dies), not B300. B300 is sm_103. Don't propagate the wrong arch ID.

Exception

[TP0..3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 4041, in run_scheduler_process
  ...
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_vl.py", line 1210, in get_image_feature
    return self.visual(pixel_values, grid_thw=image_grid_thw)
  ...
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_vl.py", line 838, in forward
    x = blk(
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_vl.py", line 231, in forward
    attn = self.attn(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/vision.py", line 1173, in forward
    output = self.qkv_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/vision.py", line 491, in forward
    output = flash_attn_varlen_func(
  File "/sgl-workspace/sglang/python/sglang/jit_kernel/flash_attention.py", line 273, in flash_attn_varlen_func
    return fa4_flash_attn_varlen_func(
  File "/sgl-workspace/sglang/python/sglang/jit_kernel/flash_attention_v4.py", line 65, in flash_attn_varlen_func
    result = _flash_attn_varlen_func(
  File "/usr/local/lib/python3.12/dist-packages/flash_attn/cute/interface.py", line 2217, in flash_attn_varlen_func
    return FlashAttnVarlenFunc.apply(
  File "/usr/local/lib/python3.12/dist-packages/flash_attn/cute/interface.py", line 2039, in forward
    out, lse = _flash_attn_fwd(
  File "/usr/local/lib/python3.12/dist-packages/flash_attn/cute/interface.py", line 882, in _flash_attn_fwd
    fa_fwd = flash_fwd_obj_cls(
  File "/usr/local/lib/python3.12/dist-packages/flash_attn/cute/flash_fwd_sm100.py", line 162, in __init__
    assert self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f, "Only SM 10.x and 11.x are supported"
AssertionError: Only SM 10.x and 11.x are supported

Why a nominally in-range arch trips the assertion

The check is self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f. B300 is sm_103, which is numerically inside [sm_100, sm_110f], so naively the assertion should pass.

Dao-AILab/flash-attention#2572 explains why it doesn't (quoting):

When using B300 (sm103) with non-cu13 cutedsl, current assertion introduced in 463623e will narrow the effective supported range as the arch class in cutedsl will do:

  • For non-cu13 cutedsl: map Arch.sm_110 and Arch.sm_101 to SM101
  • For cu13 cutedsl: map Arch.sm_101 and Arch.sm_110 to SM110

As a result when using non-cu13 cutedsl, the current the effective range check will be sm100 <= ... <= sm101 which unintentionally excludes SM103.

So the comparison operators on the enum don't behave like raw integers — depending on the bundled cutedsl, sm_110f resolves to a value that's less than sm_103, collapsing the range.

The fix in PR #2572 is a one-line change:

-        assert self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f, "Only SM 10.x and 11.x are supported"
+        assert self.arch.is_family_of(Arch.sm_100f) or self.arch.is_family_of(Arch.sm_110f), \
+            "Only SM 10.x and 11.x are supported"

is_family_of() correctly matches the SM 10.x / 11.x family regardless of cutedsl version, so sm_103 / sm_103a (and sm_100, sm_100f, sm_101, sm_110, sm_110f) all pass.

What sglang needs to do

python/pyproject.toml pins flash-attn-4>=4.0.0b9. Once @tridao reviews & merges Dao-AILab/flash-attention#2572 and tags a new flash-attn-4 release, the sglang Dockerfiles (and pyproject) should bump that floor to the fixed release so B300 stops crashing on Qwen-3.5-VL (and any other vision-encoder path that flows through flash_attn_v4).

Workarounds (until the bump lands)

Workaround Notes
Pin recipe to lmsysorg/sglang:v0.5.11-cu130 ✅ works — v0.5.11 didn't bundle the broken assertion
Disable flash-attn-4 in the vision encoder path Not investigated; might be possible via --mm-attention-backend if a non-FA4 backend is available for Qwen-3.5-VL
Run on B100/B200 (sm_100) instead of B300 Not affected because the int comparison luckily holds for sm_100 itself

Sources

Happy to attach a full server.log artifact if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions