-
Notifications
You must be signed in to change notification settings - Fork 176
[GB300][SGLang] Bump SGLang image for dsv4-fp4-gb300-dynamo-sglang-mtp #1559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Fridge003
wants to merge
6
commits into
main
Choose a base branch
from
sgl_image_bump_dsv4
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
f66004e
[GB300][SGLang] Bump SGLang image for dsv4-fp4-gb300-dynamo-sglang-mtp
Fridge003 a6666cc
Update perf-changelog.yaml with PR #1559 link
Fridge003 27e4d9b
Clean up obsolete sglang envs in dsv4 8k1k disagg recipes
Fridge003 0be6580
Switch moe-a2a-backend from deepep to megamoe in MegaMoE blocks
Fridge003 beb0fe5
Bump dynamo hash, quote megamoe backend, drop deepep-config
Fridge003 925a974
Merge branch 'main' into sgl_image_bump_dsv4
Fridge003 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The 6 recipe YAMLs are bumped to
lmsysorg/sglang:nightly-dev-20260522-c9153da5, but the matchingimage:field on thedsv4-fp4-gb300-dynamo-sglang-mtpblock in.github/configs/nvidia-master.yaml(line 9073) is left at the stalelmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034. Per AGENTS.md the two must be bumped in lockstep — the launcher usesimage:as the container-alias key, so without this update CI will still import/run the old image and the perf-changelog claim is untrue. Fix: bump line 9073 of nvidia-master.yaml to the samenightly-dev-20260522-c9153da5tag.Extended reasoning...
What's wrong
This PR bumps
model.containerin all sixbenchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/*-mtp.yamlfiles fromlmsysorg/sglang:nightly-dev-cu13-20260510-2473659etolmsysorg/sglang:nightly-dev-20260522-c9153da5, and adds a perf-changelog entry that explicitly claims the image was updated fordsv4-fp4-gb300-dynamo-sglang-mtp. However.github/configs/nvidia-master.yamlline 9073 (theimage:field on thedsv4-fp4-gb300-dynamo-sglang-mtpblock) still readslmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034— an even older 20260509 tag from before the previous bump.Why this matters
AGENTS.mdline 115 documents the invariant explicitly: multi-node srt-slurm changes must edit the recipe yaml AND nvidia-master.yaml together, and for image bumpsmodel.containermust equalimage:because the launcher uses the latter as the container-alias key. Concretely,.github/workflows/profile.ymlreadsmatrix.config.imagefrom nvidia-master.yaml into theIMAGEenv var, andrunners/launch_gb300-cw.shuses it both to build/import the enroot squash file (enroot import -o ... docker://$image) and to register the alias in the generatedsrtslurm.yamlcontainers map (${IMAGE}: ${SQUASH_FILE}). The recipe'scontainer:is then matched against that alias by srtctl.Precedent
The sibling non-MTP PR #1528 (commit
59980fe) fordsv4-fp4-gb300-dynamo-sglangupdated BOTH.github/configs/nvidia-master.yamlAND the recipe YAMLs in lockstep. After that PR, the non-MTP block at line 8760 sits atnightly-dev-cu13-20260520-425dffbdmatching its recipe — a consistent lockstep. The MTP variant has now diverged 13 days from its recipe, and the new tag has dropped thecu13prefix.Step-by-step proof of impact
matrix.config.imagefrom nvidia-master.yaml →IMAGE=lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034.runners/launch_gb300-cw.shrunsenroot import -o $SQUASH_FILE docker://$IMAGE— squashing the 20260509 image.srtslurm.yamlregisterscontainers: { "${IMAGE}": ${SQUASH_FILE} }— keyed by the 20260509 tag.model.container: lmsysorg/sglang:nightly-dev-20260522-c9153da5— does not match the alias.Fix
Bump
.github/configs/nvidia-master.yamlline 9073 fromlmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034tolmsysorg/sglang:nightly-dev-20260522-c9153da5in this PR, matching the recipecontainer:values and the lockstep pattern established by PR #1528.