Restore dpskv4 GB300 non-MTP disagg to staging image + deepep backend#1526
Restore dpskv4 GB300 non-MTP disagg to staging image + deepep backend#1526yhyang201 wants to merge 1 commit into
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26151012719 |
There was a problem hiding this comment.
LGTM — straightforward config restore of SGLang image, deepep backend, and prior env vars; only documentation nits flagged.
Extended reasoning...
Overview
This PR is a configuration-only restore of the dsv4-fp4-gb300-dynamo-sglang non-MTP disagg recipes back to the staging container image (lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev), the deepep moe-a2a-backend, and the prior environment variable set. It touches one image alias in nvidia-master.yaml, five GB300 disagg recipe YAMLs (1p1d, 4p1d, 8p1d, 10p1d, 12p1d wideep + the 1p1d-tp4-tp4 c=1 smoke test), and adds a perf-changelog entry. The changes effectively revert PR #1492.
Security risks
None. All modifications are benchmark recipe YAMLs and a changelog file — no auth, crypto, permissions, network exposure, or input-handling code is touched. The container image alias change points to a previously-used internal staging image.
Level of scrutiny
Low. This is benchmark/recipe config under benchmarks/multi_node/srt-slurm-recipes/, not production-critical infrastructure. The pattern of changes (image swap + env-var rename + moe-a2a-backend swap) is uniform across the five disagg files, mechanical, and matches the stated PR description. No logic changes, no schema changes, no new files.
Other factors
The two bugs surfaced are both classified as nits: (1) a missing pr-link field in the new perf-changelog entry — purely a traceability gap, the YAML still parses cleanly; (2) a CAR_V2 comment whose rationale ('corrupts in 2-node decode setups') doesn't logically apply to the single-node decode topology of 1p1d-tp4-tp4-2-c1.yaml, but that comment is being restored verbatim to its pre-#1492 form, so the PR isn't introducing the mismatch — it's just reinstating it. Neither nit blocks the restore, and the inline comments already document them for any follow-up. Recent commit history shows the same restore pattern was already applied via 12c9d3c, consistent with intentional cleanup of #1492.
| - config-keys: | ||
| - dsv4-fp4-gb300-dynamo-sglang | ||
| description: | ||
| - "Restore `sglang-staging:deepseek-v4-grace-blackwell-dev` image, `deepep` moe-a2a-backend, and previous env vars for all non-MTP disagg configs" |
There was a problem hiding this comment.
🟡 The new perf-changelog.yaml entry (lines 3047-3050) for dsv4-fp4-gb300-dynamo-sglang omits a pr-link field. This is the only one of 380 entries in the file missing this field — every other entry, including the immediately-preceding PR #1475 entry and the PR #1492 entry this one effectively reverts, includes a pr-link. Suggest adding pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1526 for consistency and traceability.
Extended reasoning...
What the bug is
The new perf-changelog.yaml entry added by this PR (lines 3047-3050) is missing the pr-link field that every other entry in the file carries. The new entry currently reads:
- config-keys:
- dsv4-fp4-gb300-dynamo-sglang
description:
- "Restore `sglang-staging:deepseek-v4-grace-blackwell-dev` image, `deepep` moe-a2a-backend, and previous env vars for all non-MTP disagg configs"No pr-link: line follows.
Step-by-step proof
- The file has 380
- config-keys:entries (one per changelog item). - Counting
pr-link:lines yields 379. - An entry-by-entry scan shows the single missing entry is the one introduced by this PR at lines 3047-3050.
- The immediately-preceding entry (PR [Klaud Cold] Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 #1475, ending at line 3045) follows the standard pattern:
- config-keys: - qwen3.5-fp4-b300-sglang ... description: - "Update SGLang image from v0.5.11-cu130 (5d old) to v0.5.12-cu130" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1475
- The PR Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260519 #1492 entry that this PR effectively reverts also had a
pr-link(to PR Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260519 #1492). So the convention is established across the entire file, including the entries closest in topic and time.
Why existing code doesn't prevent it
perf-changelog.yaml is metadata and only validated as YAML — there's no schema enforcing required fields per entry, so an entry without pr-link still parses successfully. This is why the omission slipped through without any tooling complaint.
Impact
Functional impact: none. The changelog still parses and conveys the change. The cost is traceability — pr-link is the field downstream consumers (manual readers, any tooling that builds a release/changelog view, anyone hunting for the PR that introduced a regression) use to navigate back to the originating PR. With this entry only, a reader has to git-blame the YAML to find the PR.
How to fix
Add a single line under description::
- config-keys:
- dsv4-fp4-gb300-dynamo-sglang
description:
- "Restore `sglang-staging:deepseek-v4-grace-blackwell-dev` image, `deepep` moe-a2a-backend, and previous env vars for all non-MTP disagg configs"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1526This is a documentation / consistency nit, not a functional bug — recommended fix but not blocking.
| SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1" | ||
| SGLANG_OPT_FP8_WO_A_GEMM: "0" | ||
| # SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2 |
There was a problem hiding this comment.
🟡 Nit: the CAR_V2 comment restored here doesn't match this file's topology. The resources block above declares decode_nodes: 1, gpus_per_decode: 4 (single-node decode), but the comment justifies omitting SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 because "CAR_V2 is single-node only and corrupts results in 2-node decode setups" — by the comment's own logic, this is exactly the case where CAR_V2 is safe and intended. The comment was copy-pasted unchanged from the peer multi-node files (4p1d/8p1d/10p1d/12p1d, all with decode_nodes ≥ 3). Since this PR is actively rewriting this exact comment block, it'd be a good time to either (a) enable CAR_V2 in decode here to match the rationale, or (b) replace the comment with the real reason it's omitted in this c=1 smoke-test config.
Extended reasoning...
What is the bug
In disagg-gb300-1p1d-tp4-tp4-2-c1.yaml, the decode_environment now ends with:
SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1"
# SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2
# is single-node only and corrupts results in 2-node decode setups.The resources block earlier in the same file declares:
resources:
gpu_type: "gb300"
gpus_per_node: 4
prefill_nodes: 1
prefill_workers: 1
gpus_per_prefill: 4
decode_nodes: 1
decode_workers: 1
gpus_per_decode: 4This is a single-node decode (decode_nodes=1, gpus_per_decode=4). The comment, however, justifies omitting CAR_V2 on the grounds that CAR_V2 is "single-node only and corrupts results in 2-node decode setups." By the comment's own rationale, single-node is the safe case for CAR_V2 — so the comment's justification is the opposite of what this file's topology calls for.
How it got here (step-by-step proof)
- In the peer files in this PR —
4p1d-dep4-dep16,8p1d-dep4-dep16,10p1d-dep4-dep16,12p1d-dep4-dep12— the decode is genuinely multi-node (decode_nodes: 4or3,gpus_per_decode: 16or12). In those files, the CAR_V2 comment is accurate: CAR_V2 would corrupt their multi-node decodes, so it's correctly omitted. - The
1p1d-tp4-tp4-2-c1.yamlis a c=1 smoke-test config with a 1-node, 4-GPU decode (decode_nodes: 1,gpus_per_decode: 4). - This PR restores the decode_environment block of
1p1d-tp4-tp4-2-c1.yamlto its pre-Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260519 #1492 form, including reinstating the first line of the CAR_V2 comment. The result is a fully-formed two-line comment whose rationale doesn't logically apply to this file's actual topology. - Verified independently: a peer file like
4p1d-dep4-dep16-8-c1024.yamlshows the comment in its correct context (multi-node decode, comment accurate). The 1p1d-tp4 file shows the comment in its incorrect context (single-node decode, comment's logic implies CAR_V2 should be enabled).
Why existing code doesn't prevent it
This is a YAML config comment — nothing parses or validates it. The mismatch is purely a readability/intent concern that only a human reviewer can catch.
Impact
- No functional bug: env var behavior (CAR_V2 not set) is unchanged from the pre-PR state. The runtime sees the same configuration.
- Readability: future readers will be confused — the comment's rationale ("2-node decode setups") doesn't apply to a 1-node-decode file. They'll be left wondering whether the author intended CAR_V2 to be enabled here.
- Possible perf-on-the-table: if the omission was unintentional (i.e. carried in by copy-paste rather than deliberate choice for this smoke-test), enabling CAR_V2 in single-node decode is the documented intent. That said, since this is the c=1 smoke-test config, the perf angle is weak.
Addressing the refutation
The refuting verifier correctly notes that this comment existed in the file prior to PR #1492 — at commit 006bfab, the same two-line comment block was already present, and PR #1492 partially clobbered it (leaving an orphan dangling second line). This PR restores the original two-line comment intact.
That's true, but it doesn't mean the issue isn't worth flagging here:
- The PR is actively touching this exact line (diff line 115-116 of the file in question reinstates the first half of the comment). This PR is, materially, the act of restoring a misapplied comment to its full form. The fact that the misapplication is pre-existing doesn't change that this PR is the one putting the complete (and inapplicable) comment back in front of readers.
- It's the right moment to flag, even if not the right moment to require a fix. Hence: severity nit, not normal — purely a comment/documentation issue, no functional impact, no need to block the restore.
- The PR's stated goal of "verbatim restoration" is consistent with not fixing this in the same PR; the author may reasonably want to keep this PR scope-minimal. But a follow-up cleanup is worthwhile so future readers of this single-node file aren't left confused.
How to fix
Three reasonable options for a follow-up:
- Enable CAR_V2 here by adding
SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "1"to this file's decode_environment, matching what the comment's rationale implies should happen in single-node decodes. (This would also align with the peer files that do enable CAR_V2 in their prefill environments, which are similarly single-node TP=4.) - Remove the comment entirely if CAR_V2 absence is intentional for this debug-shape c=1 smoke-test.
- Rewrite the comment to state the real reason it's omitted here (e.g. "kept off for parity with debug-shape configs" or whatever the actual rationale is).
Summary
lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-devfor all non-MTP disagg configsmoe-a2a-backendback todeepepfor wideep configs