Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1933,3 +1933,41 @@ glm5-fp8-mi325x-sglang-mtp:
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }

glm5-fp8-mi300x-sglang:
image: lmsysorg/sglang:v0.5.12-rocm720-mi30x
model: zai-org/GLM-5-FP8
model-prefix: glm5
runner: mi300x
precision: fp8
framework: sglang
multinode: false
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 64 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 64 }

glm5-fp8-mi300x-sglang-mtp:
image: lmsysorg/sglang:v0.5.12-rocm720-mi30x
model: zai-org/GLM-5-FP8
model-prefix: glm5
runner: mi300x
precision: fp8
framework: sglang
multinode: false
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
Comment on lines +1955 to +1973
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new glm5-fp8-mi300x-sglang-mtp recipe (with spec-decoding: mtp) will never invoke the new benchmarks/single_node/glm5_fp8_mi300x_mtp.sh script: runners/launch_mi300x-amds.sh:41 hardcodes bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%_*}_${PRECISION}_mi300x.sh and never appends a SPEC_SUFFIX, so both glm5-fp8-mi300x-sglang and glm5-fp8-mi300x-sglang-mtp resolve to glm5_fp8_mi300x.sh (the vanilla decode path). Because benchmark-tmpl.yml:180 bakes spec-${SPEC_DECODING} into RESULT_FILENAME, the run is recorded as MTP data while actually executing without speculative decoding — silently misattributed perf numbers. Fix by adding SPEC_SUFFIX dispatch to launch_mi300x-amds.sh mirroring launch_mi355x-amds.sh:182-228, or drop the -mtp recipe + _mtp.sh script from this PR until the launcher supports it.

Extended reasoning...

Bug: MI300X launcher does not dispatch to _mtp.sh

What is broken

The PR adds two recipes in .github/configs/amd-master.yaml:

  • glm5-fp8-mi300x-sglang → expected to run benchmarks/single_node/glm5_fp8_mi300x.sh
  • glm5-fp8-mi300x-sglang-mtp (with spec-decoding: mtp) → expected to run benchmarks/single_node/glm5_fp8_mi300x_mtp.sh

The second one is the one that breaks. The MI300X launcher does not know how to route to the _mtp.sh variant.

The misrouting code

runners/launch_mi300x-amds.sh:41 hardcodes:

bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%_*}_${PRECISION}_mi300x.sh

There is no SPEC_SUFFIX/FRAMEWORK_SUFFIX computation anywhere in this file (verified by re-reading the whole 43-line script — every line is shown above the bash invocation, and nothing computes SPEC_SUFFIX).

Compare runners/launch_mi355x-amds.sh:182-228, which is the working pattern:

FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "atom" ]] && printf '_atom' || printf '')
SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
...
SCRIPT_BASE="${EXP_NAME%%_*}_${PRECISION}_mi355x"
SCRIPT_FW="benchmarks/single_node/${SCENARIO_SUBDIR:-}${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh"
SCRIPT_FALLBACK="benchmarks/single_node/${SCENARIO_SUBDIR:-}${SCRIPT_BASE}${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh"

Step-by-step proof for glm5-fp8-mi300x-sglang-mtp at isl=1024, osl=1024, tp=8, conc=4:

  1. Sweep generator (utils/matrix_logic/generate_sweep_configs.py:362) produces EXP_NAME = f"{model_code}_{seq_len_str}", i.e. EXP_NAME = "glm5_1k1k".
  2. Env vars passed to launcher: MODEL=zai-org/GLM-5-FP8, PRECISION=fp8, FRAMEWORK=sglang, SPEC_DECODING=mtp, EXP_NAME=glm5_1k1k, EP_SIZE=1.
  3. In launch_mi300x-amds.sh:41, ${EXP_NAME%%_*} strips at the first underscore → "glm5".
  4. The composed path is benchmarks/single_node/glm5_fp8_mi300x.sh — the vanilla decode script, not the new glm5_fp8_mi300x_mtp.sh.
  5. glm5_fp8_mi300x.sh is invoked. It has none of --speculative-algorithm EAGLE, --speculative-num-steps, --speculative-eagle-topk, --speculative-num-draft-tokens, --ep-size, or --use-chat-template. The server starts in vanilla decode mode.
  6. .github/workflows/benchmark-tmpl.yml:180 defines RESULT_FILENAME = "${EXP_NAME}_${PRECISION}_${FRAMEWORK}_tp${TP}-ep${EP_SIZE}-dpa${DP_ATTENTION}_disagg-${DISAGG}_spec-${SPEC_DECODING}_conc${CONC}_${RUNNER_NAME}", so the result file name still contains spec-mtp. The datapoint is filed in perf history as MTP data, but it was produced by the vanilla decode path. Silent misattribution.

Why nothing catches this

  • glm5_fp8_mi300x.sh's check_env_vars requires only MODEL TP CONC ISL OSL RANDOM_RANGE_RATIO RESULT_FILENAME (no EP_SIZE), so the misrouted run does not fail loudly. It produces a plausible result file under a misleading name.
  • The new glm5_fp8_mi300x_mtp.sh does require EP_SIZE, which would have surfaced the misrouting — but it is never executed.
  • grep -n 'mi300x.*mtp' .github/configs/amd-master.yaml returns only this PR's new entry (line 1840). This is the first MI300X recipe to depend on a SPEC_SUFFIX dispatch, so the absence has gone unnoticed.
  • runners/launch_mi300x*.sh glob returns exactly one file — there is no alternate launcher that could pick up the slack.
  • .github/workflows/benchmark-tmpl.yml dispatches via bash ./runners/launch_${RUNNER_NAME%%_*}.sh, so MI300X jobs go through this one launcher only.

Impact

The -mtp variant of the new recipe is non-functional end-to-end on the targeted runner, and the resulting spec-mtp perf datapoints would be vanilla decode numbers in disguise. This is exactly the failure mode the bug description calls out, and it should block the PR.

Fix options

  1. Add SPEC_SUFFIX dispatch to runners/launch_mi300x-amds.sh mirroring launch_mi355x-amds.sh:182-228 (compute SPEC_SUFFIX, append it to the script path, optionally with a fallback chain). This is the proper fix and lets the new _mtp.sh actually run.
  2. Drop the glm5-fp8-mi300x-sglang-mtp recipe and glm5_fp8_mi300x_mtp.sh from this PR and re-add them after the launcher gains MTP support. The non-MTP glm5-fp8-mi300x-sglang recipe is unaffected and can ship as-is.

79 changes: 79 additions & 0 deletions benchmarks/single_node/glm5_fp8_mi300x.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
CONTEXT_LENGTH=$((ISL + OSL + 20))
MAX_PREFILL_TOKENS=32768

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH"
fi

start_gpu_monitor

# Launch args follow sglang issue #25672 comment 4485916205:
# tilelang NSA backends + fp8_e4m3 KV cache + multithread model load.
python3 -m sglang.launch_server \
--model-path $MODEL \
--host=0.0.0.0 \
--port $PORT \
--tensor-parallel-size $TP \
--data-parallel-size 1 \
--trust-remote-code \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--tokenizer-worker-num 6 \
--cuda-graph-max-bs $CONC \
--disable-radix-cache \
--max-prefill-tokens $MAX_PREFILL_TOKENS \
--scheduler-recv-interval 30 \
--mem-fraction-static 0.80 \
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
--nsa-prefill-backend tilelang \
--nsa-decode-backend tilelang \
--kv-cache-dtype fp8_e4m3 \
$EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/

if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

stop_gpu_monitor
set +x
89 changes: 89 additions & 0 deletions benchmarks/single_node/glm5_fp8_mi300x_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
#!/usr/bin/env bash

# GLM-5 FP8 on MI300X with EAGLE / MTP speculative decoding.
# Mirrors glm5_fp8_mi300x.sh and adds the speculative-* flags.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
CONTEXT_LENGTH=$((ISL + OSL + 20))
MAX_PREFILL_TOKENS=32768

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
Comment on lines +1 to +30
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new glm5_fp8_mi300x_mtp.sh passes the --speculative-* CLI flags but does not export SGLANG_ENABLE_SPEC_V2=1, which every other GLM-5 MTP sglang recipe in this repo sets (glm5_fp8_mi355x_mtp.sh:25, glm5_fp8_b200_mtp.sh:25, glm5_fp8_b300_mtp.sh:29, glm5_fp4_b200_mtp.sh:25, glm5_fp4_b300_mtp.sh:29). perf-changelog.yaml repeatedly describes the GLM-5 EAGLE codepath as gated "behind SGLANG_ENABLE_SPEC_V2=1" — without that env var the MTP datapoints for glm5-fp8-mi300x-sglang-mtp will run a different codepath than every other GLM-5 MTP datapoint in perf history, breaking cross-runner comparability. Fix: add export SGLANG_ENABLE_SPEC_V2=1 near the top of the script, mirroring glm5_fp8_mi355x_mtp.sh:25.

Extended reasoning...

What goes wrong

The new `\benchmarks/single_node/glm5_fp8_mi300x_mtp.sh' launches sglang with the standard EAGLE knobs:

--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \

…but it never sets \SGLANG_ENABLE_SPEC_V2=1'. Every other GLM-5 MTP launch script in this repo does set it, right next to the same --speculative-*' flags:

Script Line
`\benchmarks/single_node/glm5_fp8_mi355x_mtp.sh' 25
`\benchmarks/single_node/glm5_fp8_b200_mtp.sh' 25
`\benchmarks/single_node/glm5_fp8_b300_mtp.sh' 29
`\benchmarks/single_node/glm5_fp4_b200_mtp.sh' 25
`\benchmarks/single_node/glm5_fp4_b300_mtp.sh' 29

Why this matters

`\perf-changelog.yaml' documents the GLM-5 EAGLE path as gated on this env var on the sglang versions used here (v0.5.10–v0.5.12):

  • line 1623 / 1633 / 1643 — "Mirrors the glm5-fp8-XXX-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) behind SGLANG_ENABLE_SPEC_V2=1"
  • line 2219 — "Add MTP flags: SGLANG_ENABLE_SPEC_V2=1, EAGLE speculative decoding (steps=3, topk=1, draft=4)"

The wording "behind" indicates the new spec-v2 codepath is selected only when the env var is set; without it, sglang falls back to the v1 spec path (or in some builds, ignores the flags entirely). Either way the codepath is different from the one every other GLM-5 MTP datapoint already in perf history was collected on.

Step-by-step proof

  1. PR adds `\glm5_fp8_mi300x_mtp.sh' modeled after the mi300x non-MTP script + EAGLE knobs.
  2. Compare line-by-line against \glm5_fp8_mi355x_mtp.sh' (the canonical GLM-5 MTP sibling): mi355x sets \export SGLANG_ENABLE_SPEC_V2=1' at line 25, mi300x has no such export anywhere in the file.
  3. \perf-changelog.yaml' lines 1623/1633/1643/1653/1663/2219 all describe GLM-5 EAGLE as living **behind** \SGLANG_ENABLE_SPEC_V2=1'.
  4. Therefore on \lmsysorg/sglang:v0.5.12-rocm720-mi30x' (the image this recipe uses), starting sglang without \SGLANG_ENABLE_SPEC_V2=1' will route the `--speculative-*' flags through the spec-v1 path (or no-op), not the spec-v2 EAGLE path that mi355x/b200/b300 MTP numbers were collected on.
  5. Result: the `\glm5-fp8-mi300x-sglang-mtp' datapoints recorded by the full-sweep CI will measure a different (and per the changelog, less performant) codepath than the rest of the GLM-5 MTP perf history — cross-runner comparisons are invalid.

Fix

Add the env-var export near the top of \glm5_fp8_mi300x_mtp.sh', mirroring \glm5_fp8_mi355x_mtp.sh:25':

export SGLANG_ENABLE_SPEC_V2=1

A natural place is right after the `\source ../benchmark_lib.sh' line, alongside the other env-driven knobs.

setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH"
fi

start_gpu_monitor

# Launch args follow sglang issue #25672 comment 4485916205:
# tilelang NSA backends + fp8_e4m3 KV cache + multithread model load,
# plus EAGLE/MTP speculative decoding.
python3 -m sglang.launch_server \
--model-path $MODEL \
--host=0.0.0.0 \
--port $PORT \
--tensor-parallel-size $TP \
--ep-size $EP_SIZE \
--trust-remote-code \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--tokenizer-worker-num 6 \
--cuda-graph-max-bs $CONC \
--disable-radix-cache \
--max-prefill-tokens $MAX_PREFILL_TOKENS \
--scheduler-recv-interval 30 \
--mem-fraction-static 0.80 \
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
--nsa-prefill-backend tilelang \
--nsa-decode-backend tilelang \
--kv-cache-dtype fp8_e4m3 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
$EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--use-chat-template

if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

stop_gpu_monitor
set +x
7 changes: 7 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3050,3 +3050,10 @@
description:
- "Update SGLang image from v0.5.11-cu130 (5d old) to v0.5.12-cu130"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1475

- config-keys:
- glm5-fp8-mi300x-sglang
- glm5-fp8-mi300x-sglang-mtp
description:
- "Add GLM-5 FP8 SGLang ROCm recipes (off + MTP/EAGLE) for MI300X on lmsysorg/sglang:v0.5.12-rocm720-mi30x"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1486
Loading