[MoRI short term temp patch] GLM-5 FP8 MI355X SGLang disaggregated#1572
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Claude finished @ChangLiu0709's task in 1m 45s —— View job PR Review: [AMD] Add GLM-5 FP8 MI355X SGLang disaggregated benchmark
SummaryOne blocking issue: |
| # Summary | ||
| # ============================================================================= | ||
|
|
||
| _SETUP_END=$(date +%s) | ||
| if [[ ${#_SETUP_INSTALLED[@]} -eq 0 ]]; then | ||
| echo "[SETUP] All patches already applied (${_SETUP_END}s wallclock)" | ||
| else | ||
| echo "[SETUP] Applied: ${_SETUP_INSTALLED[*]} in $(( _SETUP_END - _SETUP_START ))s" | ||
| fi |
There was a problem hiding this comment.
🟡 Three related logging/observability nits in the new setup_deps.sh:
- L96:
_SETUP_INSTALLED+=("gluon-instr-shape-fix")runs unconditionally even when the Python heredoc takes one of its threesys.exit(0)early-return paths (target file missing, sentinel already present, pattern not found). The final "Applied: gluon-instr-shape-fix" summary therefore lies on every restart after the first run. - L135: prints the absolute Unix epoch (e.g.
1779895831s wallclock) instead of the elapsed duration — should be$(( _SETUP_END - _SETUP_START ))sto match the else branch on L137. Currently latent because of [NVIDIA] Add TRT-LLM 70B FP8 via slurm #1. - L117-119:
pip installexit status is never checked, so a network/dep/SHA-GC failure still appendstransformers-glm5to_SETUP_INSTALLEDand the script reports success; the job then crashes much later at sglang model load with a confusingunknown model type: glm_moe_dsa.install_transformers_glm5would do well to follow the explicit-return-code pattern it almost already uses.
Extended reasoning...
What the bug is
This is a small bundle of observability defects in the brand-new benchmarks/multi_node/amd_utils/setup_deps.sh (sourced from server.sh on every disagg server start). None of them break the happy path, but together they degrade the diagnostic value of the [SETUP] log — which is the only thing an operator has when an image bump or pinned-SHA drift breaks the patches.
(1) _SETUP_INSTALLED appended unconditionally for the gluon patch (L96)
patch_gluon_pa_mqa_logits_instr_shape runs a Python heredoc that has three sys.exit(0) early returns:
- L38 target file missing (
/sgl-workspace/aiter/aiter/ops/triton/gluon/pa_mqa_logits.pynot found) - L44 sentinel
[PATCHED] 3D instr_shape for base gluon variantalready in source — patch already applied - L88
oldblock not in source — aiter version drift, prints WARN
All three exit with code 0, so bash sees no failure. The very next line in bash — _SETUP_INSTALLED+=("gluon-instr-shape-fix") at L96 — runs regardless. After the first server start on a given container, the sentinel is in place, so case (2) is the common case, and yet every subsequent restart will log:
[SETUP] Applied: gluon-instr-shape-fix in 0s
…even though the function was a pure no-op. Compare with install_transformers_glm5 (L106) which only appends after pip install actually runs (its two early return 0 paths happen before the array append).
(2) Epoch printed as wallclock (L135)
_SETUP_END=$(date +%s)
if [[ ${#_SETUP_INSTALLED[@]} -eq 0 ]]; then
echo "[SETUP] All patches already applied (${_SETUP_END}s wallclock)" # ← BUG
else
echo "[SETUP] Applied: ${_SETUP_INSTALLED[*]} in $(( _SETUP_END - _SETUP_START ))s"
fi${_SETUP_END} is the absolute Unix epoch — when this branch fires it will print something like 1779895831s wallclock, which is gibberish. The else branch already does the subtraction correctly. Currently latent thanks to (1) — the array is always non-empty — but the moment (1) is fixed this branch starts firing and the log line is wrong. Should also be $(( _SETUP_END - _SETUP_START ))s.
(3) pip install exit status not checked (L117-119)
pip install --quiet -U --no-cache-dir \
"git+https://github.com/huggingface/transformers.git@6ed9ee36f608fd145168377345bfc4a5de12e1e2"
_SETUP_INSTALLED+=("transformers-glm5")The function header is bare #!/bin/bash (no set -e/set -o pipefail), so a non-zero exit from pip install just continues. Plausible failure modes: cloning github.com over a flaky link, dep resolution conflict with packages already baked into the SGLang container, the pinned commit SHA being rewritten/GC'd upstream, or git missing from the image. In all of these the function returns success, [SETUP] Applied: ... transformers-glm5 is printed, server.sh launches sglang, and the job dies at model load with AutoConfig: unknown model type: glm_moe_dsa — several minutes later, in a completely different layer of the stack.
Step-by-step proof
Imagine a clean container start where the aiter base image already contains the gluon fix (e.g. a future image bump that picks it up upstream) and we restart the server:
patch_gluon_pa_mqa_logits_instr_shaperuns.- Python heredoc reads the file, sees the sentinel
[PATCHED] 3D instr_shape for base gluon variantis not there (upstream uses different wording) — actually, let me pick the more common scenario: this is the second server start on the same container. The sentinel is in the source (we wrote it on the first start). Python printsgluon pa_mqa_logits 3D instr_shape patch already appliedandsys.exit(0)(L44 of the heredoc). - Bash returns from
python3 -c '...'with $?=0. _SETUP_INSTALLED+=("gluon-instr-shape-fix")executes (L96).install_transformers_glm5runs. IfMODEL_NAME != GLM-5-FP8it returns immediately. Otherwise the AutoConfig probe succeeds (transformers was installed on first start),echo transformers already supports GLM-5is printed, and the function returns without appending.- Final summary:
_SETUP_INSTALLED=("gluon-instr-shape-fix")— length 1, not 0, so the else branch fires:
[SETUP] Applied: gluon-instr-shape-fix in 0s— misleading: nothing was applied. - If we ever did manage to hit the if-branch (e.g. someone reorders or removes patches), we would get
[SETUP] All patches already applied (1779895831s wallclock)— gibberish. - For (3): on a hypothetical first-start where pip install fails (DNS hiccup, GC'd SHA, …), pip writes its error to stderr, returns non-zero, bash unconditionally appends
transformers-glm5, and the summary logs success. Thenserver.shcontinues, sglang launches, model load fails withunknown model_type: glm_moe_dsa.
Why the existing pattern doesn't protect us
- The Python sentinel check makes the patch itself idempotent, but doesn't constrain the bash-side array.
--quietonpip installonly suppresses progress, not errors, so a refutation argument might say "the error is visible in stderr". That's fair, but the summary line lies, and the lie comes after the error in the log — operators triaging from the end of the log (Applied: ... transformers-glm5) reasonably assume the install succeeded. Plus, the downstream sglang error (unknown model type: glm_moe_dsa) does eventually point an experienced operator at transformers compatibility, but conflating "pip failed" with "transformers shipped doesn't have glm_moe_dsa" wastes time on the wrong hypothesis (image bump? upstream model code change? need a newer pin?). Failing fast atsetup_deps.shwith a clearERROR: transformers install failedis strictly more informative.
How to fix
All three are one-line nits and worth fixing together since the file is brand new:
- Gate the gluon append on a real success signal. Easiest: have the Python script print a known token (e.g.
APPLIED) only on success, capture it in bash, and append conditionally. Or use distinct non-zero exits for the no-op cases. (sys.exit(0)everywhere is the wrong abstraction here.) - L135: change
${_SETUP_END}s wallclock→$(( _SETUP_END - _SETUP_START ))s wallclock. - L117-119:
pip install ... || { echo "ERROR: transformers install failed" >&2; return 1; }— and haveserver.shpropagate the failure (e.g.source setup_deps.sh || exit 1).
Addressing the refutation
A verifier noted that pip errors aren't fully silent (true — they go to stderr), that the downstream sglang failure is itself somewhat diagnostic (true — for someone already familiar with the stack), and that style-consistency favors leaving things as-is (the gluon patch also doesn't check its sub-process). I agree (3) on its own is nit severity — pip will scream into stderr and the job will eventually die. But: (a) the same file already has a working explicit-return-code pattern in install_transformers_glm5's AutoConfig probe, so adding a one-line guard after the pip call is consistent with this function's own style, not a foreign import; (b) the three issues compound — the summary line currently lies on the happy path (#1) and would print gibberish in the dead-code branch (#2) and would lie on the pip-failure path (#3) — together they undermine the only signal an operator has when an image bump breaks the patches. Worth bundling into one cleanup pass on a brand-new file.
There was a problem hiding this comment.
Sub-issue #2 (epoch → elapsed at L766) was picked up in 8315abd, thanks. The other two are still latent:
- [NVIDIA] Add TRT-LLM 70B FP8 via slurm #1:
_SETUP_INSTALLED+=("gluon-instr-shape-fix")at L712 still runs after everysys.exit(0)path in the heredoc — so[SETUP] Installed: gluon-instr-shape-fixwill be logged on every container restart even when the patch was a no-op (sentinel already present, the common case). One-line fix: append only when the heredoc actually wrote the file (e.g. exit non-zero on no-op paths, or print a sentinel token and grep for it). - [NVIDIA] update vllm b200 image. TODO: add logic for docker runner. #3:
pip install …at L733-734 still has its exit status discarded; a network/SHA-GC failure appendstransformers-glm5and the summary line lies.pip install … || { echo "ERROR: transformers install failed" >&2; return 1; }matches the explicit-return-code style this function already uses for the AutoConfig probe.
Both are still nit severity individually, but together with #2 fixed the file is one cleanup pass away from a clean observability story.
98f1645 to
8315abd
Compare
|
Claude finished @functionstackx's task in 2m 56s —— View job PR Review: [AMD] Add GLM-5 FP8 MI355X SGLang disaggregated benchmark
SummaryPrevious blocking issues from the first review have been fixed: Two remaining warnings posted as inline comments on
Everything else looks correct: config aligns with existing |
| open(target, "w").write(new_src) | ||
| print("[SETUP] Patched: gluon pa_mqa_logits 3D instr_shape for base variant") | ||
| ' | ||
| _SETUP_INSTALLED+=("gluon-instr-shape-fix") |
There was a problem hiding this comment.
🟡 WARNING: _SETUP_INSTALLED+=("gluon-instr-shape-fix") runs unconditionally — even when the Python heredoc exits early (file not found, sentinel present, pattern missing). All three early-return paths use sys.exit(0), so bash always reaches this line.
Why it matters: On every container restart after the first run, the summary falsely reports Applied: gluon-instr-shape-fix when nothing was actually patched. This also means the "All dependencies already present" branch on L765 can never fire for SGLang runs.
Fix: Gate the append on actual patch application. Easiest approach — check Python's stdout:
| _SETUP_INSTALLED+=("gluon-instr-shape-fix") | |
| if python3 -c '...' 2>&1 | grep -q "^\\[SETUP\\] Patched:"; then | |
| _SETUP_INSTALLED+=("gluon-instr-shape-fix") | |
| fi |
Or use distinct exit codes (e.g., sys.exit(0) for success, sys.exit(2) for no-op).
Note: This matches a pre-existing pattern from the vLLM patches (L518, L629), so not blocking, but worth cleaning up since this file is being substantially reworked in this PR anyway.
| pip install --quiet -U --no-cache-dir \ | ||
| "git+https://github.com/huggingface/transformers.git@6ed9ee36f608fd145168377345bfc4a5de12e1e2" | ||
| _SETUP_INSTALLED+=("transformers-glm5") |
There was a problem hiding this comment.
🟡 WARNING: pip install exit status is not checked. If it fails (network issue, GC'd commit SHA, dep conflict), transformers-glm5 is still appended to _SETUP_INSTALLED and the summary reports success. The job then crashes at model load with unknown model type: glm_moe_dsa — minutes later, in a different layer.
Why it matters: Operators triaging from the end of the log see Applied: transformers-glm5 and reasonably assume the install succeeded, wasting time on the wrong hypothesis.
Fix:
| pip install --quiet -U --no-cache-dir \ | |
| "git+https://github.com/huggingface/transformers.git@6ed9ee36f608fd145168377345bfc4a5de12e1e2" | |
| _SETUP_INSTALLED+=("transformers-glm5") | |
| pip install --quiet -U --no-cache-dir \ | |
| "git+https://github.com/huggingface/transformers.git@6ed9ee36f608fd145168377345bfc4a5de12e1e2" \ | |
| || { echo "[SETUP] ERROR: transformers install failed" >&2; return 1; } | |
| _SETUP_INSTALLED+=("transformers-glm5") |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26540490749 |
8315abd to
0a39b9e
Compare
| $PREFILL_NUM_WORKERS \ | ||
| $DECODE_NODES \ | ||
| $DECODE_NUM_WORKERS \ | ||
| $ISL $OSL "${CONC_LIST// /x}" inf \ | ||
| ${PREFILL_ENABLE_EP} ${PREFILL_ENABLE_DP} \ | ||
| ${DECODE_ENABLE_EP} ${DECODE_ENABLE_DP} \ | ||
| ${PREFILL_TP} ${DECODE_TP} \ | ||
| ${RANDOM_RANGE_RATIO} \ | ||
| ${NODE_LIST:-}) |
There was a problem hiding this comment.
🟡 Two copy-paste inconsistencies vs sister disagg launch scripts: (1) line 76 passes ${NODE_LIST:-} (unquoted, with underscore) while sister kimik2.5/minimaxm2.5 vllm-disagg scripts use quoted "${NODELIST:-}" — a caller that exports the documented NODELIST env var will have it silently dropped here because this script reads the (undefined) $NODE_LIST; (2) the check_env_vars list at lines 5-22 omits FRAMEWORK, which both dsr1_fp8_mi355x_sglang-disagg.sh:23 and dsr1_fp4_mi355x_sglang-disagg.sh:23 include as the 18th required var. Fix: rename to "${NODELIST:-}" (quoted) and append FRAMEWORK to the check_env_vars list to match the byte-near-identical siblings.
Extended reasoning...
What the bug is
The new glm5_fp8_mi355x_sglang-disagg.sh is essentially copy-pasted from the sister disagg launch scripts but drifts in two small ways at the file boundaries:
Issue 1 — NODE_LIST vs NODELIST (L76). This file passes the optional 16th positional arg to submit.sh as ${NODE_LIST:-} (unquoted, underscored). The two existing vllm-disagg sisters in the same directory both pass it as "${NODELIST:-}" (quoted, no underscore):
benchmarks/multi_node/kimik2.5_fp4_mi355x_vllm-disagg.sh:73→"${NODELIST:-}"benchmarks/multi_node/minimaxm2.5_fp8_mi355x_vllm-disagg.sh:71→"${NODELIST:-}"
The dsr1 sglang-disagg sisters don't pass arg 16 at all, so the only established convention for arg-16 in this directory is NODELIST. submit.sh:91 reads positional arg 16 into a local var also called NODE_LIST, which is why this is purely a launch-script-level mismatch — the caller-facing env var is what differs.
Issue 2 — FRAMEWORK omitted from check_env_vars (L5-L22). Both sglang-disagg sisters include FRAMEWORK as the 18th required var:
dsr1_fp8_mi355x_sglang-disagg.sh:23→FRAMEWORKdsr1_fp4_mi355x_sglang-disagg.sh:23→FRAMEWORK
submit.sh:70 enforces check_env FRAMEWORK and submit.sh:97 uses it to drive export ENGINE="${FRAMEWORK:-sglang}", so a missing FRAMEWORK will fail anyway — just deeper in the stack, defeating the early-validation purpose of check_env_vars.
Step-by-step proof of impact
For issue 1, imagine a CI runner or operator following the documented sister-script pattern and exporting NODELIST=mia1-01,mia1-02 before invoking the recipe:
runnerexportsNODELIST=mia1-01,mia1-02.- Launch script kimik2.5/minimaxm2.5 → expands to
"mia1-01,mia1-02"as positional arg 16 →submit.shreads it into its internalNODE_LIST→ sbatch gets the node pinning. - Launch script glm5 → expands
${NODE_LIST:-}→ empty string (becauseNODE_LISTwas never exported, onlyNODELISTwas) → arg 16 is empty → sbatch never gets the pinning.
The user's intent silently disappears. The unquoted form would also word-split a whitespace-containing value into multiple positional args, vs the quoted sister form which preserves it as a single arg.
For issue 2, if FRAMEWORK is unset for this new recipe, the failure mode is the same (job fails) but the error message originates from amd_utils/submit.sh:70 instead of the launch script. The launch script's whole purpose of doing early environment validation is bypassed for this variable.
Why existing code doesn't prevent it
- For (1),
submit.shhappens to use the same internal nameNODE_LIST, which is why this is easy to miss in code review — both names "work" at the submit boundary. The caller-side mismatch only surfaces when someone actually exports the env var. - For (2),
submit.sh:70does enforceFRAMEWORKdownstream, so the job will fail closed, not silently corrupt output. But the launch script's purpose is to surface missing config at the earliest, most-targeted location.
Addressing the verifier note on current latency
A verifier observed that no code in the repo currently exports either NODELIST or NODE_LIST, so today both kimi/minimax and glm5 disagg jobs run without explicit node pinning — the inconsistency is dormant. That's fair: this is a latent bug, not a current-CI breakage. But the file is brand new in this PR, the sister-script pattern is right there to copy, and the fix is a one-character rename plus quotes plus one line in check_env_vars. Best fixed at introduction time so future runners that adopt node-pinning don't get a silent feature-loss bug on this particular recipe.
How to fix
check_env_vars \\
...
- RANDOM_RANGE_RATIO
+ RANDOM_RANGE_RATIO \\
+ FRAMEWORK
...
${RANDOM_RANGE_RATIO} \\
- ${NODE_LIST:-})
+ "${NODELIST:-}")|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26543294366 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26544255929 |
functionstackx
left a comment
There was a problem hiding this comment.
hi @ChangLiu0709 @chunfangamd @HaiShaw thanks for the SGLang GLM5 mi355 disagg PR but unfortunately it doesn't work yet, can u look into fixing this?
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26544255929/job/78192785656?pr=1572
[2026-05-27 23:38:59.096 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 4049, in run_scheduler_process
scheduler.run_event_loop()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1547, in run_event_loop
dispatch_event_loop(self)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3932, in dispatch_event_loop
scheduler.event_loop_overlap_disagg_decode()
File "/opt/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 1606, in event_loop_overlap_disagg_decode
self.process_decode_queue()
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 1773, in process_decode_queue
req_conns, _ = self.disagg_decode_prealloc_queue.pop_preallocated()
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 735, in pop_preallocated
self._resolve_pending_reqs()
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 729, in _resolve_pending_reqs
decode_req.kv_receiver.init(prefill_dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mori/conn.py", line 1407, in init
super().init(prefill_dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/common/conn.py", line 783, in init
self._setup_bootstrap_infos()
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/common/conn.py", line 835, in _setup_bootstrap_infos
self._register_kv_args()
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mori/conn.py", line 1423, in _register_kv_args
packed_state_item_lens = b"".join(
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mori/conn.py", line 1424, in <genexpr>
struct.pack("I", item_len)
struct.error: required argument is not an integer
[2026-05-27 23:38:59.097 TP7] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 4049, in run_scheduler_process
scheduler.run_event_loop()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1547, in run_event_loop
dispatch_event_loop(self)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3932, in dispatch_event_loop
scheduler.event_loop_overlap_disagg_decode()
File "/opt/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 1606, in event_loop_overlap_disagg_decode
self.process_decode_queue()
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 1773, in process_decode_queue
req_conns, _ = self.disagg_decode_prealloc_queue.pop_preallocated()
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 735, in pop_preallocated
self._resolve_pending_reqs()
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 729, in _resolve_pending_reqs
decode_req.kv_receiver.init(prefill_dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mori/conn.py", line 1407, in init
super().init(prefill_dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/common/conn.py", line 783, in init
self._setup_bootstrap_infos()
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/common/conn.py", line 835, in _setup_bootstrap_infos
self._register_kv_args()
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mori/conn.py", line 1423, in _register_kv_args
packed_state_item_lens = b"".join(
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mori/conn.py", line 1424, in <genexpr>
struct.pack("I", item_len)
struct.error: required argument is not an integer
02e5a2b to
70ee756
Compare
| echo "[SETUP] Installing transformers with GLM-5 (glm_moe_dsa) support..." | ||
| pip install --quiet -U --no-cache-dir \ | ||
| "git+https://github.com/huggingface/transformers.git@6ed9ee36f608fd145168377345bfc4a5de12e1e2" | ||
| _SETUP_INSTALLED+=("transformers-glm5") |
There was a problem hiding this comment.
Missing post-install verification for transformers pip install
Medium Severity
install_transformers_glm5 does not verify that the pip install from the pinned git commit actually succeeded before recording it in _SETUP_INSTALLED. The analogous install_amd_quark function in the same file includes a post-install check (python3 -c "import quark"), but this function blindly proceeds. If the git clone or pip install fails (network issue, transient error), the GLM-5 model will fail to load later with a cryptic error that's hard to debug in CI.
Reviewed by Cursor Bugbot for commit c3bef8f. Configure here.
…p crash (#1578) sglang v0.5.12.post1 ships an unmigrated MoRI PD-disagg backend (legacy singular `state_type` + flat-int wire format) that crashes hybrid-attention models at PD-disagg startup. PR #1572's CI run (actions/runs/26544255929/job/78192785656) hit this exact failure: File ".../disaggregation/mori/conn.py", line 1424, in <genexpr> struct.pack("I", item_len) struct.error: required argument is not an integer Root cause: KVArgs.state_item_lens is List[List[int]] for any model with state_types: List[StateType] (Qwen3.5-MoE, GLM-5 NSA, etc.), but MoRI's `_register_kv_args` iterates and packs each element with naked `struct.pack("I", item_len)`, expecting flat List[int]. Two further sites in `send_state` and `_send_swa_dsa_state` have the same legacy-API assumption. This PR ports the validated overlay from Chun Fang's working branch (chun-chang/sglang-disagg-qwen3.5, commits 48e459b + c4e397d) and stacks it on top of PR #1572: - benchmarks/multi_node/amd_utils/patches/mori_conn.py (new, 1665 lines) Drop-in replacement for sglang v0.5.12.post1's disaggregation/mori/conn.py with four conservative patches: 1. Sender flatten — handle nested state_item_lens 2. state_type plural-API fallback (matches Mooncake/NIXL) 3. Consumer normalize state_item_lens at send_state entry 4. SWA/DSA rank+length normalize before group_concurrent_contiguous (fixes GLM-5 DSA single-component np.diff broadcast crash) - benchmarks/multi_node/amd_utils/patches/README.md (new) Bug analysis, when-to-use table, opt-out knob documentation. - benchmarks/multi_node/amd_utils/job.slurm (+25) Auto-bind-mount the overlay when DOCKER_IMAGE_NAME contains "v0.5.12.post1". Opt-out via MORI_CONN_PATCH=skip. Appends to ${EXTRA_DOCKER_MOUNTS:-} so callers can still inject other mounts. - .github/configs/amd-master.yaml (+1/-1) Image bump v0.5.12-...-20260517 → v0.5.12.post1-...-20260523 to unlock the auto-apply gate (matches chun-chang lineage). - perf-changelog.yaml (+2/-1) Document the image bump and overlay rationale. Validated on chun-chang at the same image tag: glm5-fp8-mi355x-sglang-disagg GSM8K strict-match = 0.9712 ± 0.0046 glm5-fp8-mi355x-sglang-disagg GSM8K flexible-extract = 0.9704 ± 0.0047 Stop-gap until sglang migrates MoRI to the plural state_types API that Mooncake (mooncake/conn.py:912) and NIXL (nixl/conn.py:1381) already use. Tracks sgl-project/sglang#21886 and #22665. Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: ChangLiu0709 <cliu1004@amd.com>
Introduce glm5-fp8-mi355x-sglang-disagg CI config, model server flags, launch script, setup_deps.sh image patches, and GLM-5 env tuning for MoRI PD disaggregation on MI355X. Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: ChangLiu0709 <cliu1004@amd.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: ChangLiu0709 <cliu1004@amd.com>
- Add missing FRAMEWORK to check_env_vars list to match sister sglang-disagg scripts (dsr1_fp8, dsr1_fp4) - Rename NODE_LIST to NODELIST (quoted) to match the convention used by kimik2.5/minimaxm2.5 vllm-disagg sisters Co-authored-by: functionstackx <functionstackx@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: ChangLiu0709 <cliu1004@amd.com>
ca5efd9 to
f3dcff0
Compare
…p crash (#1578) sglang v0.5.12.post1 ships an unmigrated MoRI PD-disagg backend (legacy singular `state_type` + flat-int wire format) that crashes hybrid-attention models at PD-disagg startup. PR #1572's CI run (actions/runs/26544255929/job/78192785656) hit this exact failure: File ".../disaggregation/mori/conn.py", line 1424, in <genexpr> struct.pack("I", item_len) struct.error: required argument is not an integer Root cause: KVArgs.state_item_lens is List[List[int]] for any model with state_types: List[StateType] (Qwen3.5-MoE, GLM-5 NSA, etc.), but MoRI's `_register_kv_args` iterates and packs each element with naked `struct.pack("I", item_len)`, expecting flat List[int]. Two further sites in `send_state` and `_send_swa_dsa_state` have the same legacy-API assumption. This PR ports the validated overlay from Chun Fang's working branch (chun-chang/sglang-disagg-qwen3.5, commits 48e459b + c4e397d) and stacks it on top of PR #1572: - benchmarks/multi_node/amd_utils/patches/mori_conn.py (new, 1665 lines) Drop-in replacement for sglang v0.5.12.post1's disaggregation/mori/conn.py with four conservative patches: 1. Sender flatten — handle nested state_item_lens 2. state_type plural-API fallback (matches Mooncake/NIXL) 3. Consumer normalize state_item_lens at send_state entry 4. SWA/DSA rank+length normalize before group_concurrent_contiguous (fixes GLM-5 DSA single-component np.diff broadcast crash) - benchmarks/multi_node/amd_utils/patches/README.md (new) Bug analysis, when-to-use table, opt-out knob documentation. - benchmarks/multi_node/amd_utils/job.slurm (+25) Auto-bind-mount the overlay when DOCKER_IMAGE_NAME contains "v0.5.12.post1". Opt-out via MORI_CONN_PATCH=skip. Appends to ${EXTRA_DOCKER_MOUNTS:-} so callers can still inject other mounts. - .github/configs/amd-master.yaml (+1/-1) Image bump v0.5.12-...-20260517 → v0.5.12.post1-...-20260523 to unlock the auto-apply gate (matches chun-chang lineage). - perf-changelog.yaml (+2/-1) Document the image bump and overlay rationale. Validated on chun-chang at the same image tag: glm5-fp8-mi355x-sglang-disagg GSM8K strict-match = 0.9712 ± 0.0046 glm5-fp8-mi355x-sglang-disagg GSM8K flexible-extract = 0.9704 ± 0.0047 Stop-gap until sglang migrates MoRI to the plural state_types API that Mooncake (mooncake/conn.py:912) and NIXL (nixl/conn.py:1381) already use. Tracks sgl-project/sglang#21886 and #22665. Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: ChangLiu0709 <cliu1004@amd.com>
merged short term patch via chun's patch #1578 |
| len(self.state_mem_descs), | ||
| inferred, | ||
| ) | ||
| except Exception: |
| len(dst_state_indices), | ||
| new_common, | ||
| ) | ||
| except Exception: |
| len(_raw_state_dim_per_tensor), | ||
| _raw_state_dim_per_tensor[:3], | ||
| ) | ||
| except Exception: |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 3 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit a8722db. Configure here.
| pass | ||
| self.__class__._logged_dsa_index_flatten = True | ||
| src_state_indices = src_state_indices[:new_common] | ||
| dst_state_indices = dst_state_indices[:new_common] |
There was a problem hiding this comment.
Ravel after truncation causes silent state data loss
High Severity
PATCH #4's .ravel() normalization is placed after the initial length-based truncation at lines 1135–1149, but that truncation uses len() which returns the outer dimension of a 2D array. When src_state_indices is 2D shape (1, N) (from _normalize_state_indices) and dst_state_indices is 1D shape (N,) (from np.frombuffer in TransferInfo.from_zmq), the pre-truncation computes common_len = min(1, N) = 1 and truncates dst_state_indices to a single element. The subsequent ravel then finds mismatched lengths and re-truncates both to 1 element, silently discarding N−1 state entries. The ravel needs to happen before the initial length comparison, not after it.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit a8722db. Configure here.
| out.append(int(x)) | ||
| else: | ||
| out.append(int(entry)) | ||
| return out |
There was a problem hiding this comment.
Duplicated flatten helpers across two methods
Low Severity
_flatten_state_seq (nested inside send_state) and _flatten_skip_none (nested inside _register_kv_args) implement identical flattening logic for List[List[int]] → List[int]. Having two copies means a bug fix in one won't automatically propagate to the other. A single module-level helper would eliminate this duplication.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit a8722db. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26555773349 |
|
/reuse-sweep-run |


Summary
glm5-fp8-mi355x-sglang-disagginamd-master.yaml: 1P+1D TP8/EP1 smoke sweep for ISL 1K1K and 8K1K, concurrency 8–512, imagelmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517(same as aggregatedglm5-fp8-mi355x-sglang)GLM-5-FP8entry inbenchmarks/multi_node/amd_utils/models.yaml(MoRI transfer, glm tool/reasoning parsers, multithread load)benchmarks/multi_node/glm5_fp8_mi355x_sglang-disagg.shsetup_deps.sh: idempotent in-container patches required for GLM-5 on ROCm imagespa_mqa_logits3Dinstr_shapefix (NSA / Triton ≥ 3.5)glm_moe_dsawhenMODEL_NAME=GLM-5-FP8env.sh: GLM-5-specific exports (SGLANG_ROCM_FUSED_DECODE_MLA=0, quick reduce, safetensors fast GPU)server.sh:source setup_deps.shbeforeenv.sh(patches run once per container start)Test plan
amd-master.yamlvalidatesglm5-fp8-mi355x-sglang-disagg(scenarios.fixed-seq-len)setup_deps.shis a no-op or fast on non–GLM-5 disagg jobs (transformers install gated onMODEL_NAME)glm5-fp8-mi355x-sglang-disagg1P+1D, ISL/OSL 1024/1024, conc=64/it-share/data/GLM-5-FP8(runner setsMODEL_NAMEfromzai-org/GLM-5-FP8)Dependencies
main(and optionally after PR [AMD] Add Qwen3.5 FP8 MI355X SGLang disaggregated benchmark #1570 for Qwen3.5 disagg)benchmark_servingtokenizer fallback, or MoRIconn.pyoverlay (follow-up PRs)Co-authors
@ChangLiu0709
@chunfangamd
@Duyi-Wang
@functionstackx
Note
Risk
Medium:
server.shnow sourcessetup_deps.shfor all SGLang disagg runs; gluon patch is global but idempotent; transformers install only runs forGLM-5-FP8.Overview
Enables GLM-5 PD disaggregation in InferenceX CI. Image patches address gaps not yet in upstream aiter/transformers/docker images; runtime env tweaks reflect GLM-5 NSA architecture vs MLA-based models like DeepSeek-R1.
Note
Medium Risk
Touches all SGLang disagg container starts via
setup_deps.shand can auto-overlay upstreammori/conn.pyfor v0.5.12.post1 images; changes affect PD state transfer correctness, not auth or data stores.Overview
Adds GLM-5-FP8 MI355X SGLang prefill–decode disaggregation to CI: new
glm5-fp8-mi355x-sglang-disagginamd-master.yaml(imagev0.5.12.post1-rocm720-mi35x-20260523, 1P+1D TP8/EP1,dp-attn: false, conc 8–512 for 1k1k and 8k1k),GLM-5-FP8inmodels.yaml, andglm5_fp8_mi355x_sglang-disagg.sh.Runtime fixes so GLM-5 (NSA / hybrid state) can run on that image: in-tree
patches/mori_conn.pybind-mounted fromjob.slurmwhen the image tag containsv0.5.12.post1(opt-outMORI_CONN_PATCH=skip);env.shGLM-5 exports;setup_deps.shgated forsglang-disagg(aiter gluon 3Dinstr_shape, transformersglm_moe_dsaonly forGLM-5-FP8);server_sglang.shsourcessetup_deps.shbeforeenv.sh.Reviewed by Cursor Bugbot for commit a8722db. Bugbot is set up for automated code reviews on this repo. Configure here.