[Klaud Cold] Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 by functionstackx · Pull Request #1475 · SemiAnalysisAI/InferenceX

functionstackx · 2026-05-18T02:16:46Z

Summary

Update SGLang image from v0.5.11-cu130 (5d old) to v0.5.12-cu130

Recipes touched: `qwen3.5-fp4-b300-sglang`, qwen3.5-fp4-b300-sglang-mtp

Test plan

full-sweep-enabled sweep passes.

🤖 Generated with Claude Code

github-actions · 2026-05-18T02:16:54Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-18T02:16:54Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-18T02:16:55Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-18T02:17:19Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26009982423
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26009982423

claude

LGTM, straightforward SGLang image version bump (v0.5.11-cu130 → v0.5.12-cu130) with matching changelog entry.

Extended reasoning...

Overview

This PR bumps the SGLang container image for two recipes (qwen3.5-fp4-b300-sglang and qwen3.5-fp4-b300-sglang-mtp) from v0.5.11-cu130 to v0.5.12-cu130 in .github/configs/nvidia-master.yaml, and appends a corresponding entry to perf-changelog.yaml linking back to this PR.

Security risks

None. This is a pinned upstream image version tag bump in CI config. The image source (lmsysorg/sglang) is unchanged, only the version tag advances by a patch release.

Level of scrutiny

Low. This is a mechanical, config-only change with no logic, no auth/crypto, and no production code paths. It follows the exact same pattern as numerous prior image-bump PRs in this repo (e.g. the recent #1458 entry directly above it in the changelog).

Other factors

The PR carries the full-sweep-enabled label so CI sweeps will validate the new image. No bugs were found by the bug hunting system, and there are no outstanding human reviewer comments — only the standard recipe-reminder bot messages.

functionstackx · 2026-05-18T16:27:49Z

Same vision-encoder workaround as PRs #1422 / #1451

Confirmed same flash-attn-4 cute sm_103 assertion in the Qwen-3.5-VL vision encoder — all 24 8k1k jobs failed with flash_fwd_sm100.py:162 AssertionError (1k1k may survive since vision encoder isn't exercised during warmup). Same family as the bf16 (#1422) and fp8 (#1451) siblings.

Added --mm-attention-backend triton_attn to both qwen3.5_fp4_b300.sh and qwen3.5_fp4_b300_mtp.sh. Pushed as 108982c2. Text decoder stays on trtllm_mha.

Upstream / root cause cross-links

Issue: sgl-project/sglang#25564 — cutedsl Arch enum aliasing on non-cu13 path collapses sm_100..sm_110f to exclude sm_103
Suggested durable fix: bump sglang's nvidia-cutlass-dsl dep to [cu13] extra (@mmangkad comment)
Alternate fix in flight: Dao-AILab/flash-attention#2572 (awaiting @tridao)

Once either upstream lands, this workaround can come back out.

github-actions · 2026-05-18T16:51:10Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26009984417
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26009984417

github-actions · 2026-05-18T19:08:18Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26047575440
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26047575440

github-actions · 2026-05-18T19:34:10Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26047575440
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26047575440

functionstackx · 2026-05-18T19:56:08Z

Handing off to @Oseltamivir — tracked alongside 7 other stuck Klaud-Cold PRs in #1511. /loop will stop auto-retrying this one.

AI-generated via Claude Code /loop.

Update SGLang image from v0.5.11-cu130 (5d old) to v0.5.12-cu130 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Same workaround as #1422 (bf16) and #1451 (fp8) — bypass the broken flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only the multi-modal attention path to triton_attn. Text decoder still uses --attention-backend trtllm_mha. See sgl-project/sglang#25564 (root cause: cutedsl Arch enum aliasing on non-cu13 path collapses sm_100..sm_110f range to exclude sm_103) and Dao-AILab/flash-attention#2572 for the upstream fix in flight.

github-actions · 2026-05-20T06:52:03Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26144009563
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26144009563

functionstackx · 2026-05-20T07:00:17Z

/reuse-sweep-run

github-actions · 2026-05-20T07:00:59Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26146822534
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26146822534

functionstackx requested a review from a team May 18, 2026 02:16

functionstackx added the full-sweep-enabled label May 18, 2026

functionstackx requested review from jgangani and kedarpotdar-nv as code owners May 18, 2026 02:16

github-project-automation Bot added this to InferenceMAX Board May 18, 2026

functionstackx added a commit that referenced this pull request May 18, 2026

chore: fill pr-link for #1475

2562dd5

claude Bot reviewed May 18, 2026

View reviewed changes

functionstackx added full-sweep-enabled and removed full-sweep-enabled labels May 18, 2026

functionstackx mentioned this pull request May 18, 2026

[Bug] Qwen-3.5 on B300 (sm_103) crashes in flash-attn-4 cute kernel — assertion at flash_fwd_sm100.py:162 (fix exists in Dao-AILab/flash-attention#2572; sglang needs to bump flash-attn-4) sgl-project/sglang#25564

Closed

functionstackx mentioned this pull request May 18, 2026

[AI Generated] [Handoff] out of 70+ image updates, 13 stuck Klaud Cold PRs need upstream coordination / scope decisions #1511

Open

functionstackx changed the title ~~[Klaud Cold] Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130~~ [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 May 18, 2026

functionstackx and others added 3 commits May 20, 2026 01:47

Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130

f6a1048

Update SGLang image from v0.5.11-cu130 (5d old) to v0.5.12-cu130 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Re-trigger sweep (previous Run Sweep run stuck pending with 0 jobs)

b8f0cd5

functionstackx force-pushed the update-qwen3.5-fp4-b300-sglang-v0.5.12 branch from bfac96d to b8f0cd5 Compare May 20, 2026 05:47

functionstackx changed the title ~~[Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130~~ [Klaud Cold] Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 May 20, 2026

Merge branch 'main' into update-qwen3.5-fp4-b300-sglang-v0.5.12

8b4885e

functionstackx merged commit 4f63034 into main May 20, 2026
3 of 5 checks passed

github-project-automation Bot moved this to Done in InferenceMAX Board May 20, 2026

functionstackx deleted the update-qwen3.5-fp4-b300-sglang-v0.5.12 branch May 20, 2026 07:00

This was referenced May 20, 2026

Restore dpskv4 GB300 non-MTP disagg to staging image + deepep backend #1526

Closed

Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260520 #1528

Merged

Conversation

functionstackx commented May 18, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

functionstackx commented May 18, 2026

Same vision-encoder workaround as PRs #1422 / #1451

Upstream / root cause cross-links

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

functionstackx commented May 18, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

functionstackx commented May 20, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant