[Bug] Qwen-3.5 on B300 (sm_103) crashes in flash-attn-4 cute kernel — assertion at flash_fwd_sm100.py:162 (fix exists in Dao-AILab/flash-attention#2572; sglang needs to bump flash-attn-4)

# Human

TLDR: in v0.5.12 on b300 (sm103), hitting 

```
assert self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f, "Only SM 10.x and 11.x are supported"
AssertionError: Only SM 10.x and 11.x are supported
````

do to this line in `flash-attn` where `sm103` doesnt fit within these 2 conditions  `assert self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f`. there exists an patch for it in here https://github.com/Dao-AILab/flash-attention/pull/2572

# AI Slop below
## Summary

On NVIDIA B300 (Blackwell Ultra, `sm_103`), `lmsysorg/sglang:v0.5.12-cu130` fails to serve any Qwen-3.5-VL recipe (Qwen/Qwen3.5-397B-A17B) because the bundled `flash-attn-4` cute kernel's architecture check in `flash_attn/cute/flash_fwd_sm100.py:162` rejects `sm_103`. All TP ranks crash on the **first warmup forward pass** through the vision encoder; warmup eventually times out (600s) and the server is killed.

**The root-cause fix already exists upstream**, in [Dao-AILab/flash-attention#2572](https://github.com/Dao-AILab/flash-attention/pull/2572) by @ocss884 — sglang's pinned `flash-attn-4>=4.0.0b9` needs to be bumped to whatever Tri Dao tags after that PR merges.

## Environment

| | |
|---|---|
| sglang image | `lmsysorg/sglang:v0.5.12-cu130` |
| Hardware | NVIDIA B300 (Blackwell Ultra, `sm_103` — compute capability 10.3), 4× / 8× GPU per node |
| Model | `Qwen/Qwen3.5-397B-A17B` (Qwen-3.5-VL — vision encoder is the failing path) |
| Tensor parallelism | TP=4 and TP=8 (identical failure on both) |
| Spec decoding | Both off and EAGLE-MTP variants hit the same assertion (the vision encoder runs regardless of MTP) |
| flash-attn version | `flash-attn-4>=4.0.0b9` per `pyproject.toml` |
| Known-good image | `lmsysorg/sglang:v0.5.11-cu130` (does not bundle the buggy assertion) |

Note: `sm_120` is **consumer Blackwell** (RTX 50 series / GB20x dies), **not** B300. B300 is `sm_103`. Don't propagate the wrong arch ID.

## Exception

```
[TP0..3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 4041, in run_scheduler_process
  ...
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_vl.py", line 1210, in get_image_feature
    return self.visual(pixel_values, grid_thw=image_grid_thw)
  ...
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_vl.py", line 838, in forward
    x = blk(
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_vl.py", line 231, in forward
    attn = self.attn(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/vision.py", line 1173, in forward
    output = self.qkv_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/vision.py", line 491, in forward
    output = flash_attn_varlen_func(
  File "/sgl-workspace/sglang/python/sglang/jit_kernel/flash_attention.py", line 273, in flash_attn_varlen_func
    return fa4_flash_attn_varlen_func(
  File "/sgl-workspace/sglang/python/sglang/jit_kernel/flash_attention_v4.py", line 65, in flash_attn_varlen_func
    result = _flash_attn_varlen_func(
  File "/usr/local/lib/python3.12/dist-packages/flash_attn/cute/interface.py", line 2217, in flash_attn_varlen_func
    return FlashAttnVarlenFunc.apply(
  File "/usr/local/lib/python3.12/dist-packages/flash_attn/cute/interface.py", line 2039, in forward
    out, lse = _flash_attn_fwd(
  File "/usr/local/lib/python3.12/dist-packages/flash_attn/cute/interface.py", line 882, in _flash_attn_fwd
    fa_fwd = flash_fwd_obj_cls(
  File "/usr/local/lib/python3.12/dist-packages/flash_attn/cute/flash_fwd_sm100.py", line 162, in __init__
    assert self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f, "Only SM 10.x and 11.x are supported"
AssertionError: Only SM 10.x and 11.x are supported
```

## Why a nominally in-range arch trips the assertion

The check is `self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f`. B300 is `sm_103`, which is *numerically inside* `[sm_100, sm_110f]`, so naively the assertion should pass.

[Dao-AILab/flash-attention#2572](https://github.com/Dao-AILab/flash-attention/pull/2572) explains why it doesn't (quoting):

> When using B300 (sm103) with non-cu13 cutedsl, current assertion introduced in 463623e will narrow the effective supported range as the arch class in cutedsl will do:
>
> - For non-cu13 cutedsl: map `Arch.sm_110` and `Arch.sm_101` to `SM101`
> - For cu13 cutedsl: map `Arch.sm_101` and `Arch.sm_110` to `SM110`
>
> As a result when using non-cu13 cutedsl, the current the effective range check will be `sm100 <= ... <= sm101` which unintentionally excludes SM103.

So the comparison operators on the enum don't behave like raw integers — depending on the bundled cutedsl, `sm_110f` resolves to a value that's *less than* `sm_103`, collapsing the range.

The fix in PR #2572 is a one-line change:

```diff
-        assert self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f, "Only SM 10.x and 11.x are supported"
+        assert self.arch.is_family_of(Arch.sm_100f) or self.arch.is_family_of(Arch.sm_110f), \
+            "Only SM 10.x and 11.x are supported"
```

`is_family_of()` correctly matches the SM 10.x / 11.x family regardless of cutedsl version, so `sm_103` / `sm_103a` (and `sm_100`, `sm_100f`, `sm_101`, `sm_110`, `sm_110f`) all pass.

## What sglang needs to do

`python/pyproject.toml` pins `flash-attn-4>=4.0.0b9`. Once @tridao reviews & merges [Dao-AILab/flash-attention#2572](https://github.com/Dao-AILab/flash-attention/pull/2572) and tags a new flash-attn-4 release, the sglang Dockerfiles (and pyproject) should bump that floor to the fixed release so B300 stops crashing on Qwen-3.5-VL (and any other vision-encoder path that flows through `flash_attn_v4`).

## Workarounds (until the bump lands)

| Workaround | Notes |
|---|---|
| Pin recipe to `lmsysorg/sglang:v0.5.11-cu130` | ✅ works — v0.5.11 didn't bundle the broken assertion |
| Disable flash-attn-4 in the vision encoder path | Not investigated; might be possible via `--mm-attention-backend` if a non-FA4 backend is available for Qwen-3.5-VL |
| Run on B100/B200 (`sm_100`) instead of B300 | Not affected because the int comparison luckily holds for `sm_100` itself |

## Sources

- Failing CI run (every TP-0/1/2/3 traceback identical): https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25980018232
- Reproducer recipe (script): https://github.com/SemiAnalysisAI/InferenceX/blob/main/benchmarks/single_node/qwen3.5_bf16_b300.sh
- Master config: `qwen3.5-bf16-b300-sglang` / `qwen3.5-bf16-b300-sglang-mtp` in https://github.com/SemiAnalysisAI/InferenceX/blob/main/.github/configs/nvidia-master.yaml
- Tracking PR (image bump that exposed the regression): https://github.com/SemiAnalysisAI/InferenceX/pull/1422
- Sibling B300 regression in the same family (trtllm GEMM dispatcher selects `sm100f` kernel for `sm_103`): sgl-project/sglang#25563
- Upstream flash-attention fix: [Dao-AILab/flash-attention#2572](https://github.com/Dao-AILab/flash-attention/pull/2572)

Happy to attach a full server.log artifact if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Qwen-3.5 on B300 (sm_103) crashes in flash-attn-4 cute kernel — assertion at flash_fwd_sm100.py:162 (fix exists in Dao-AILab/flash-attention#2572; sglang needs to bump flash-attn-4) #25564

Human

AI Slop below

Summary

Environment

Exception

Why a nominally in-range arch trips the assertion

What sglang needs to do

Workarounds (until the bump lands)

Sources

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development


sglang image	`lmsysorg/sglang:v0.5.12-cu130`
Hardware	NVIDIA B300 (Blackwell Ultra, `sm_103` — compute capability 10.3), 4× / 8× GPU per node
Model	`Qwen/Qwen3.5-397B-A17B` (Qwen-3.5-VL — vision encoder is the failing path)
Tensor parallelism	TP=4 and TP=8 (identical failure on both)
Spec decoding	Both off and EAGLE-MTP variants hit the same assertion (the vision encoder runs regardless of MTP)
flash-attn version	`flash-attn-4>=4.0.0b9` per `pyproject.toml`
Known-good image	`lmsysorg/sglang:v0.5.11-cu130` (does not bundle the buggy assertion)

Workaround	Notes
Pin recipe to `lmsysorg/sglang:v0.5.11-cu130`	✅ works — v0.5.11 didn't bundle the broken assertion
Disable flash-attn-4 in the vision encoder path	Not investigated; might be possible via `--mm-attention-backend` if a non-FA4 backend is available for Qwen-3.5-VL
Run on B100/B200 (`sm_100`) instead of B300	Not affected because the int comparison luckily holds for `sm_100` itself

[Bug] Qwen-3.5 on B300 (sm_103) crashes in flash-attn-4 cute kernel — assertion at flash_fwd_sm100.py:162 (fix exists in Dao-AILab/flash-attention#2572; sglang needs to bump flash-attn-4) #25564

Description

Human

AI Slop below

Summary

Environment

Exception

Why a nominally in-range arch trips the assertion

What sglang needs to do

Workarounds (until the bump lands)

Sources

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions