Human
TLDR: in v0.5.12 on b300 (sm103), hitting
assert self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f, "Only SM 10.x and 11.x are supported"
AssertionError: Only SM 10.x and 11.x are supported
do to this line in flash-attn where sm103 doesnt fit within these 2 conditions assert self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f. there exists an patch for it in here Dao-AILab/flash-attention#2572
AI Slop below
Summary
On NVIDIA B300 (Blackwell Ultra, sm_103), lmsysorg/sglang:v0.5.12-cu130 fails to serve any Qwen-3.5-VL recipe (Qwen/Qwen3.5-397B-A17B) because the bundled flash-attn-4 cute kernel's architecture check in flash_attn/cute/flash_fwd_sm100.py:162 rejects sm_103. All TP ranks crash on the first warmup forward pass through the vision encoder; warmup eventually times out (600s) and the server is killed.
The root-cause fix already exists upstream, in Dao-AILab/flash-attention#2572 by @ocss884 — sglang's pinned flash-attn-4>=4.0.0b9 needs to be bumped to whatever Tri Dao tags after that PR merges.
Environment
|
|
| sglang image |
lmsysorg/sglang:v0.5.12-cu130 |
| Hardware |
NVIDIA B300 (Blackwell Ultra, sm_103 — compute capability 10.3), 4× / 8× GPU per node |
| Model |
Qwen/Qwen3.5-397B-A17B (Qwen-3.5-VL — vision encoder is the failing path) |
| Tensor parallelism |
TP=4 and TP=8 (identical failure on both) |
| Spec decoding |
Both off and EAGLE-MTP variants hit the same assertion (the vision encoder runs regardless of MTP) |
| flash-attn version |
flash-attn-4>=4.0.0b9 per pyproject.toml |
| Known-good image |
lmsysorg/sglang:v0.5.11-cu130 (does not bundle the buggy assertion) |
Note: sm_120 is consumer Blackwell (RTX 50 series / GB20x dies), not B300. B300 is sm_103. Don't propagate the wrong arch ID.
Exception
[TP0..3] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 4041, in run_scheduler_process
...
File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_vl.py", line 1210, in get_image_feature
return self.visual(pixel_values, grid_thw=image_grid_thw)
...
File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_vl.py", line 838, in forward
x = blk(
File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_vl.py", line 231, in forward
attn = self.attn(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/vision.py", line 1173, in forward
output = self.qkv_backend.forward(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/vision.py", line 491, in forward
output = flash_attn_varlen_func(
File "/sgl-workspace/sglang/python/sglang/jit_kernel/flash_attention.py", line 273, in flash_attn_varlen_func
return fa4_flash_attn_varlen_func(
File "/sgl-workspace/sglang/python/sglang/jit_kernel/flash_attention_v4.py", line 65, in flash_attn_varlen_func
result = _flash_attn_varlen_func(
File "/usr/local/lib/python3.12/dist-packages/flash_attn/cute/interface.py", line 2217, in flash_attn_varlen_func
return FlashAttnVarlenFunc.apply(
File "/usr/local/lib/python3.12/dist-packages/flash_attn/cute/interface.py", line 2039, in forward
out, lse = _flash_attn_fwd(
File "/usr/local/lib/python3.12/dist-packages/flash_attn/cute/interface.py", line 882, in _flash_attn_fwd
fa_fwd = flash_fwd_obj_cls(
File "/usr/local/lib/python3.12/dist-packages/flash_attn/cute/flash_fwd_sm100.py", line 162, in __init__
assert self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f, "Only SM 10.x and 11.x are supported"
AssertionError: Only SM 10.x and 11.x are supported
Why a nominally in-range arch trips the assertion
The check is self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f. B300 is sm_103, which is numerically inside [sm_100, sm_110f], so naively the assertion should pass.
Dao-AILab/flash-attention#2572 explains why it doesn't (quoting):
When using B300 (sm103) with non-cu13 cutedsl, current assertion introduced in 463623e will narrow the effective supported range as the arch class in cutedsl will do:
- For non-cu13 cutedsl: map
Arch.sm_110 and Arch.sm_101 to SM101
- For cu13 cutedsl: map
Arch.sm_101 and Arch.sm_110 to SM110
As a result when using non-cu13 cutedsl, the current the effective range check will be sm100 <= ... <= sm101 which unintentionally excludes SM103.
So the comparison operators on the enum don't behave like raw integers — depending on the bundled cutedsl, sm_110f resolves to a value that's less than sm_103, collapsing the range.
The fix in PR #2572 is a one-line change:
- assert self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f, "Only SM 10.x and 11.x are supported"
+ assert self.arch.is_family_of(Arch.sm_100f) or self.arch.is_family_of(Arch.sm_110f), \
+ "Only SM 10.x and 11.x are supported"
is_family_of() correctly matches the SM 10.x / 11.x family regardless of cutedsl version, so sm_103 / sm_103a (and sm_100, sm_100f, sm_101, sm_110, sm_110f) all pass.
What sglang needs to do
python/pyproject.toml pins flash-attn-4>=4.0.0b9. Once @tridao reviews & merges Dao-AILab/flash-attention#2572 and tags a new flash-attn-4 release, the sglang Dockerfiles (and pyproject) should bump that floor to the fixed release so B300 stops crashing on Qwen-3.5-VL (and any other vision-encoder path that flows through flash_attn_v4).
Workarounds (until the bump lands)
| Workaround |
Notes |
Pin recipe to lmsysorg/sglang:v0.5.11-cu130 |
✅ works — v0.5.11 didn't bundle the broken assertion |
| Disable flash-attn-4 in the vision encoder path |
Not investigated; might be possible via --mm-attention-backend if a non-FA4 backend is available for Qwen-3.5-VL |
Run on B100/B200 (sm_100) instead of B300 |
Not affected because the int comparison luckily holds for sm_100 itself |
Sources
Happy to attach a full server.log artifact if useful.
Human
TLDR: in v0.5.12 on b300 (sm103), hitting
do to this line in
flash-attnwheresm103doesnt fit within these 2 conditionsassert self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f. there exists an patch for it in here Dao-AILab/flash-attention#2572AI Slop below
Summary
On NVIDIA B300 (Blackwell Ultra,
sm_103),lmsysorg/sglang:v0.5.12-cu130fails to serve any Qwen-3.5-VL recipe (Qwen/Qwen3.5-397B-A17B) because the bundledflash-attn-4cute kernel's architecture check inflash_attn/cute/flash_fwd_sm100.py:162rejectssm_103. All TP ranks crash on the first warmup forward pass through the vision encoder; warmup eventually times out (600s) and the server is killed.The root-cause fix already exists upstream, in Dao-AILab/flash-attention#2572 by @ocss884 — sglang's pinned
flash-attn-4>=4.0.0b9needs to be bumped to whatever Tri Dao tags after that PR merges.Environment
lmsysorg/sglang:v0.5.12-cu130sm_103— compute capability 10.3), 4× / 8× GPU per nodeQwen/Qwen3.5-397B-A17B(Qwen-3.5-VL — vision encoder is the failing path)flash-attn-4>=4.0.0b9perpyproject.tomllmsysorg/sglang:v0.5.11-cu130(does not bundle the buggy assertion)Note:
sm_120is consumer Blackwell (RTX 50 series / GB20x dies), not B300. B300 issm_103. Don't propagate the wrong arch ID.Exception
Why a nominally in-range arch trips the assertion
The check is
self.arch >= Arch.sm_100 and self.arch <= Arch.sm_110f. B300 issm_103, which is numerically inside[sm_100, sm_110f], so naively the assertion should pass.Dao-AILab/flash-attention#2572 explains why it doesn't (quoting):
So the comparison operators on the enum don't behave like raw integers — depending on the bundled cutedsl,
sm_110fresolves to a value that's less thansm_103, collapsing the range.The fix in PR #2572 is a one-line change:
is_family_of()correctly matches the SM 10.x / 11.x family regardless of cutedsl version, sosm_103/sm_103a(andsm_100,sm_100f,sm_101,sm_110,sm_110f) all pass.What sglang needs to do
python/pyproject.tomlpinsflash-attn-4>=4.0.0b9. Once @tridao reviews & merges Dao-AILab/flash-attention#2572 and tags a new flash-attn-4 release, the sglang Dockerfiles (and pyproject) should bump that floor to the fixed release so B300 stops crashing on Qwen-3.5-VL (and any other vision-encoder path that flows throughflash_attn_v4).Workarounds (until the bump lands)
lmsysorg/sglang:v0.5.11-cu130--mm-attention-backendif a non-FA4 backend is available for Qwen-3.5-VLsm_100) instead of B300sm_100itselfSources
qwen3.5-bf16-b300-sglang/qwen3.5-bf16-b300-sglang-mtpin https://github.com/SemiAnalysisAI/InferenceX/blob/main/.github/configs/nvidia-master.yamlsm100fkernel forsm_103): [Bug] GLM-5-NVFP4 + EAGLE on B300 (sm_103): trtllm_batched_gemm_runner.cu:276 dispatches sm100f kernel — crashes at bs=128 draft graph capture (v0.5.12-cu130; v0.5.11 works) #25563Happy to attach a full server.log artifact if useful.