Restore dpskv4 GB300 non-MTP disagg to staging image + deepep backend by yhyang201 · Pull Request #1526 · SemiAnalysisAI/InferenceX

yhyang201 · 2026-05-20T08:30:55Z

Summary

Restore SGLang container image to lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev for all non-MTP disagg configs
Switch moe-a2a-backend back to deepep for wideep configs
Restore previous environment variables

github-actions · 2026-05-20T08:31:05Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-20T08:31:05Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-20T08:32:19Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26151012719
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26151012719

claude

LGTM — straightforward config restore of SGLang image, deepep backend, and prior env vars; only documentation nits flagged.

Extended reasoning...

Overview

This PR is a configuration-only restore of the dsv4-fp4-gb300-dynamo-sglang non-MTP disagg recipes back to the staging container image (lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev), the deepep moe-a2a-backend, and the prior environment variable set. It touches one image alias in nvidia-master.yaml, five GB300 disagg recipe YAMLs (1p1d, 4p1d, 8p1d, 10p1d, 12p1d wideep + the 1p1d-tp4-tp4 c=1 smoke test), and adds a perf-changelog entry. The changes effectively revert PR #1492.

Security risks

None. All modifications are benchmark recipe YAMLs and a changelog file — no auth, crypto, permissions, network exposure, or input-handling code is touched. The container image alias change points to a previously-used internal staging image.

Level of scrutiny

Low. This is benchmark/recipe config under benchmarks/multi_node/srt-slurm-recipes/, not production-critical infrastructure. The pattern of changes (image swap + env-var rename + moe-a2a-backend swap) is uniform across the five disagg files, mechanical, and matches the stated PR description. No logic changes, no schema changes, no new files.

Other factors

The two bugs surfaced are both classified as nits: (1) a missing pr-link field in the new perf-changelog entry — purely a traceability gap, the YAML still parses cleanly; (2) a CAR_V2 comment whose rationale ('corrupts in 2-node decode setups') doesn't logically apply to the single-node decode topology of 1p1d-tp4-tp4-2-c1.yaml, but that comment is being restored verbatim to its pre-#1492 form, so the PR isn't introducing the mismatch — it's just reinstating it. Neither nit blocks the restore, and the inline comments already document them for any follow-up. Recent commit history shows the same restore pattern was already applied via 12c9d3c, consistent with intentional cleanup of #1492.

claude · 2026-05-20T08:42:55Z

+- config-keys:
+    - dsv4-fp4-gb300-dynamo-sglang
+  description:
+    - "Restore `sglang-staging:deepseek-v4-grace-blackwell-dev` image, `deepep` moe-a2a-backend, and previous env vars for all non-MTP disagg configs"


🟡 The new perf-changelog.yaml entry (lines 3047-3050) for dsv4-fp4-gb300-dynamo-sglang omits a pr-link field. This is the only one of 380 entries in the file missing this field — every other entry, including the immediately-preceding PR #1475 entry and the PR #1492 entry this one effectively reverts, includes a pr-link. Suggest adding pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1526 for consistency and traceability.

Extended reasoning...

What the bug is

The new perf-changelog.yaml entry added by this PR (lines 3047-3050) is missing the pr-link field that every other entry in the file carries. The new entry currently reads:

- config-keys: - dsv4-fp4-gb300-dynamo-sglang description: - "Restore `sglang-staging:deepseek-v4-grace-blackwell-dev` image, `deepep` moe-a2a-backend, and previous env vars for all non-MTP disagg configs"

No pr-link: line follows.

Step-by-step proof

The file has 380 - config-keys: entries (one per changelog item).

Counting pr-link: lines yields 379.

An entry-by-entry scan shows the single missing entry is the one introduced by this PR at lines 3047-3050.

The immediately-preceding entry (PR [Klaud Cold] Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 #1475, ending at line 3045) follows the standard pattern:
- config-keys: - qwen3.5-fp4-b300-sglang ... description: - "Update SGLang image from v0.5.11-cu130 (5d old) to v0.5.12-cu130" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1475

The PR Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260519 #1492 entry that this PR effectively reverts also had a pr-link (to PR Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260519 #1492). So the convention is established across the entire file, including the entries closest in topic and time.

Why existing code doesn't prevent it

perf-changelog.yaml is metadata and only validated as YAML — there's no schema enforcing required fields per entry, so an entry without pr-link still parses successfully. This is why the omission slipped through without any tooling complaint.

Impact

Functional impact: none. The changelog still parses and conveys the change. The cost is traceability — pr-link is the field downstream consumers (manual readers, any tooling that builds a release/changelog view, anyone hunting for the PR that introduced a regression) use to navigate back to the originating PR. With this entry only, a reader has to git-blame the YAML to find the PR.

How to fix

Add a single line under description::

- config-keys: - dsv4-fp4-gb300-dynamo-sglang description: - "Restore `sglang-staging:deepseek-v4-grace-blackwell-dev` image, `deepep` moe-a2a-backend, and previous env vars for all non-MTP disagg configs" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1526

This is a documentation / consistency nit, not a functional bug — recommended fix but not blocking.

claude · 2026-05-20T08:42:56Z

    SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1"
-    SGLANG_OPT_FP8_WO_A_GEMM: "0"
+    # SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2


🟡 Nit: the CAR_V2 comment restored here doesn't match this file's topology. The resources block above declares decode_nodes: 1, gpus_per_decode: 4 (single-node decode), but the comment justifies omitting SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 because "CAR_V2 is single-node only and corrupts results in 2-node decode setups" — by the comment's own logic, this is exactly the case where CAR_V2 is safe and intended. The comment was copy-pasted unchanged from the peer multi-node files (4p1d/8p1d/10p1d/12p1d, all with decode_nodes ≥ 3). Since this PR is actively rewriting this exact comment block, it'd be a good time to either (a) enable CAR_V2 in decode here to match the rationale, or (b) replace the comment with the real reason it's omitted in this c=1 smoke-test config.

Extended reasoning...

What is the bug

In disagg-gb300-1p1d-tp4-tp4-2-c1.yaml, the decode_environment now ends with:

SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1" # SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2 # is single-node only and corrupts results in 2-node decode setups.

The resources block earlier in the same file declares:

resources: gpu_type: "gb300" gpus_per_node: 4 prefill_nodes: 1 prefill_workers: 1 gpus_per_prefill: 4 decode_nodes: 1 decode_workers: 1 gpus_per_decode: 4

This is a single-node decode (decode_nodes=1, gpus_per_decode=4). The comment, however, justifies omitting CAR_V2 on the grounds that CAR_V2 is "single-node only and corrupts results in 2-node decode setups." By the comment's own rationale, single-node is the safe case for CAR_V2 — so the comment's justification is the opposite of what this file's topology calls for.

How it got here (step-by-step proof)

In the peer files in this PR — 4p1d-dep4-dep16, 8p1d-dep4-dep16, 10p1d-dep4-dep16, 12p1d-dep4-dep12 — the decode is genuinely multi-node (decode_nodes: 4 or 3, gpus_per_decode: 16 or 12). In those files, the CAR_V2 comment is accurate: CAR_V2 would corrupt their multi-node decodes, so it's correctly omitted.

The 1p1d-tp4-tp4-2-c1.yaml is a c=1 smoke-test config with a 1-node, 4-GPU decode (decode_nodes: 1, gpus_per_decode: 4).

This PR restores the decode_environment block of 1p1d-tp4-tp4-2-c1.yaml to its pre-Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260519 #1492 form, including reinstating the first line of the CAR_V2 comment. The result is a fully-formed two-line comment whose rationale doesn't logically apply to this file's actual topology.

Verified independently: a peer file like 4p1d-dep4-dep16-8-c1024.yaml shows the comment in its correct context (multi-node decode, comment accurate). The 1p1d-tp4 file shows the comment in its incorrect context (single-node decode, comment's logic implies CAR_V2 should be enabled).

Why existing code doesn't prevent it

This is a YAML config comment — nothing parses or validates it. The mismatch is purely a readability/intent concern that only a human reviewer can catch.

Impact

No functional bug: env var behavior (CAR_V2 not set) is unchanged from the pre-PR state. The runtime sees the same configuration.

Readability: future readers will be confused — the comment's rationale ("2-node decode setups") doesn't apply to a 1-node-decode file. They'll be left wondering whether the author intended CAR_V2 to be enabled here.

Possible perf-on-the-table: if the omission was unintentional (i.e. carried in by copy-paste rather than deliberate choice for this smoke-test), enabling CAR_V2 in single-node decode is the documented intent. That said, since this is the c=1 smoke-test config, the perf angle is weak.

Addressing the refutation

The refuting verifier correctly notes that this comment existed in the file prior to PR #1492 — at commit 006bfab, the same two-line comment block was already present, and PR #1492 partially clobbered it (leaving an orphan dangling second line). This PR restores the original two-line comment intact.

That's true, but it doesn't mean the issue isn't worth flagging here:

The PR is actively touching this exact line (diff line 115-116 of the file in question reinstates the first half of the comment). This PR is, materially, the act of restoring a misapplied comment to its full form. The fact that the misapplication is pre-existing doesn't change that this PR is the one putting the complete (and inapplicable) comment back in front of readers.

It's the right moment to flag, even if not the right moment to require a fix. Hence: severity nit, not normal — purely a comment/documentation issue, no functional impact, no need to block the restore.

The PR's stated goal of "verbatim restoration" is consistent with not fixing this in the same PR; the author may reasonably want to keep this PR scope-minimal. But a follow-up cleanup is worthwhile so future readers of this single-node file aren't left confused.

How to fix

Three reasonable options for a follow-up:

Enable CAR_V2 here by adding SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "1" to this file's decode_environment, matching what the comment's rationale implies should happen in single-node decodes. (This would also align with the peer files that do enable CAR_V2 in their prefill environments, which are similarly single-node TP=4.)

Remove the comment entirely if CAR_V2 absence is intentional for this debug-shape c=1 smoke-test.

Rewrite the comment to state the real reason it's omitted here (e.g. "kept off for parity with debug-shape configs" or whatever the actual rationale is).

Restore dpskv4 GB300 non-MTP disagg to staging image + deepep backend

12c9d3c

yhyang201 requested a review from a team May 20, 2026 08:30

yhyang201 requested review from jgangani and kedarpotdar-nv as code owners May 20, 2026 08:30

github-project-automation Bot added this to InferenceMAX Board May 20, 2026

yhyang201 added the full-sweep-enabled label May 20, 2026

claude Bot reviewed May 20, 2026

View reviewed changes

yhyang201 closed this May 22, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore dpskv4 GB300 non-MTP disagg to staging image + deepep backend#1526

Restore dpskv4 GB300 non-MTP disagg to staging image + deepep backend#1526
yhyang201 wants to merge 1 commit into
mainfrom
revert-pr-1492

yhyang201 commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot May 20, 2026

Uh oh!

claude Bot May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yhyang201 commented May 20, 2026

Summary

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

claude Bot May 20, 2026

Choose a reason for hiding this comment

What the bug is

Step-by-step proof

Why existing code doesn't prevent it

Impact

How to fix

Uh oh!

claude Bot May 20, 2026

Choose a reason for hiding this comment

What is the bug

How it got here (step-by-step proof)

Why existing code doesn't prevent it

Impact

Addressing the refutation

How to fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant