-
Notifications
You must be signed in to change notification settings - Fork 176
[Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes #1486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,79 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
| CONTEXT_LENGTH=$((ISL + OSL + 20)) | ||
| MAX_PREFILL_TOKENS=32768 | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||
| else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH" | ||
| fi | ||
|
|
||
| start_gpu_monitor | ||
|
|
||
| # Launch args follow sglang issue #25672 comment 4485916205: | ||
| # tilelang NSA backends + fp8_e4m3 KV cache + multithread model load. | ||
| python3 -m sglang.launch_server \ | ||
| --model-path $MODEL \ | ||
| --host=0.0.0.0 \ | ||
| --port $PORT \ | ||
| --tensor-parallel-size $TP \ | ||
| --data-parallel-size 1 \ | ||
| --trust-remote-code \ | ||
| --tool-call-parser glm47 \ | ||
| --reasoning-parser glm45 \ | ||
| --tokenizer-worker-num 6 \ | ||
| --cuda-graph-max-bs $CONC \ | ||
| --disable-radix-cache \ | ||
| --max-prefill-tokens $MAX_PREFILL_TOKENS \ | ||
| --scheduler-recv-interval 30 \ | ||
| --mem-fraction-static 0.80 \ | ||
| --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \ | ||
| --nsa-prefill-backend tilelang \ | ||
| --nsa-decode-backend tilelang \ | ||
| --kv-cache-dtype fp8_e4m3 \ | ||
| $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ | ||
|
|
||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| stop_gpu_monitor | ||
| set +x |
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,89 @@ | ||||||||||||||
| #!/usr/bin/env bash | ||||||||||||||
|
|
||||||||||||||
| # GLM-5 FP8 on MI300X with EAGLE / MTP speculative decoding. | ||||||||||||||
| # Mirrors glm5_fp8_mi300x.sh and adds the speculative-* flags. | ||||||||||||||
|
|
||||||||||||||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||||||||||||||
|
|
||||||||||||||
| check_env_vars \ | ||||||||||||||
| MODEL \ | ||||||||||||||
| TP \ | ||||||||||||||
| CONC \ | ||||||||||||||
| ISL \ | ||||||||||||||
| OSL \ | ||||||||||||||
| RANDOM_RANGE_RATIO \ | ||||||||||||||
| RESULT_FILENAME \ | ||||||||||||||
| EP_SIZE | ||||||||||||||
|
|
||||||||||||||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||||||||||||||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||||||||||||||
| fi | ||||||||||||||
|
|
||||||||||||||
| if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi | ||||||||||||||
|
|
||||||||||||||
| SERVER_LOG=/workspace/server.log | ||||||||||||||
| PORT=${PORT:-8888} | ||||||||||||||
| CONTEXT_LENGTH=$((ISL + OSL + 20)) | ||||||||||||||
| MAX_PREFILL_TOKENS=32768 | ||||||||||||||
|
|
||||||||||||||
| EVAL_CONTEXT_ARGS="" | ||||||||||||||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||||||||||||||
|
Comment on lines
+1
to
+30
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 The new Extended reasoning...What goes wrongThe new `\benchmarks/single_node/glm5_fp8_mi300x_mtp.sh' launches sglang with the standard EAGLE knobs: …but it never sets
Why this matters`\perf-changelog.yaml' documents the GLM-5 EAGLE path as gated on this env var on the sglang versions used here (v0.5.10–v0.5.12):
The wording "behind" indicates the new spec-v2 codepath is selected only when the env var is set; without it, sglang falls back to the v1 spec path (or in some builds, ignores the flags entirely). Either way the codepath is different from the one every other GLM-5 MTP datapoint already in perf history was collected on. Step-by-step proof
FixAdd the env-var export near the top of export SGLANG_ENABLE_SPEC_V2=1A natural place is right after the `\source ../benchmark_lib.sh' line, alongside the other env-driven knobs. |
||||||||||||||
| setup_eval_context | ||||||||||||||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||||||||||||||
| else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH" | ||||||||||||||
| fi | ||||||||||||||
|
|
||||||||||||||
| start_gpu_monitor | ||||||||||||||
|
|
||||||||||||||
| # Launch args follow sglang issue #25672 comment 4485916205: | ||||||||||||||
| # tilelang NSA backends + fp8_e4m3 KV cache + multithread model load, | ||||||||||||||
| # plus EAGLE/MTP speculative decoding. | ||||||||||||||
| python3 -m sglang.launch_server \ | ||||||||||||||
| --model-path $MODEL \ | ||||||||||||||
| --host=0.0.0.0 \ | ||||||||||||||
| --port $PORT \ | ||||||||||||||
| --tensor-parallel-size $TP \ | ||||||||||||||
| --ep-size $EP_SIZE \ | ||||||||||||||
| --trust-remote-code \ | ||||||||||||||
| --tool-call-parser glm47 \ | ||||||||||||||
| --reasoning-parser glm45 \ | ||||||||||||||
| --tokenizer-worker-num 6 \ | ||||||||||||||
| --cuda-graph-max-bs $CONC \ | ||||||||||||||
| --disable-radix-cache \ | ||||||||||||||
| --max-prefill-tokens $MAX_PREFILL_TOKENS \ | ||||||||||||||
| --scheduler-recv-interval 30 \ | ||||||||||||||
| --mem-fraction-static 0.80 \ | ||||||||||||||
| --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \ | ||||||||||||||
| --nsa-prefill-backend tilelang \ | ||||||||||||||
| --nsa-decode-backend tilelang \ | ||||||||||||||
| --kv-cache-dtype fp8_e4m3 \ | ||||||||||||||
| --speculative-algorithm EAGLE \ | ||||||||||||||
| --speculative-num-steps 3 \ | ||||||||||||||
| --speculative-eagle-topk 1 \ | ||||||||||||||
| --speculative-num-draft-tokens 4 \ | ||||||||||||||
| $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & | ||||||||||||||
|
|
||||||||||||||
| SERVER_PID=$! | ||||||||||||||
|
|
||||||||||||||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||||||||||||||
|
|
||||||||||||||
| run_benchmark_serving \ | ||||||||||||||
| --model "$MODEL" \ | ||||||||||||||
| --port "$PORT" \ | ||||||||||||||
| --backend vllm \ | ||||||||||||||
| --input-len "$ISL" \ | ||||||||||||||
| --output-len "$OSL" \ | ||||||||||||||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||||||||||||||
| --num-prompts "$((CONC * 10))" \ | ||||||||||||||
| --max-concurrency "$CONC" \ | ||||||||||||||
| --result-filename "$RESULT_FILENAME" \ | ||||||||||||||
| --result-dir /workspace/ \ | ||||||||||||||
| --use-chat-template | ||||||||||||||
|
|
||||||||||||||
| if [ "${RUN_EVAL}" = "true" ]; then | ||||||||||||||
| run_eval --framework lm-eval --port "$PORT" | ||||||||||||||
| append_lm_eval_summary | ||||||||||||||
| fi | ||||||||||||||
|
|
||||||||||||||
| stop_gpu_monitor | ||||||||||||||
| set +x | ||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The new
glm5-fp8-mi300x-sglang-mtprecipe (withspec-decoding: mtp) will never invoke the newbenchmarks/single_node/glm5_fp8_mi300x_mtp.shscript:runners/launch_mi300x-amds.sh:41hardcodesbash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%_*}_${PRECISION}_mi300x.shand never appends aSPEC_SUFFIX, so bothglm5-fp8-mi300x-sglangandglm5-fp8-mi300x-sglang-mtpresolve toglm5_fp8_mi300x.sh(the vanilla decode path). Becausebenchmark-tmpl.yml:180bakesspec-${SPEC_DECODING}intoRESULT_FILENAME, the run is recorded as MTP data while actually executing without speculative decoding — silently misattributed perf numbers. Fix by addingSPEC_SUFFIXdispatch tolaunch_mi300x-amds.shmirroringlaunch_mi355x-amds.sh:182-228, or drop the-mtprecipe +_mtp.shscript from this PR until the launcher supports it.Extended reasoning...
Bug: MI300X launcher does not dispatch to
_mtp.shWhat is broken
The PR adds two recipes in
.github/configs/amd-master.yaml:glm5-fp8-mi300x-sglang→ expected to runbenchmarks/single_node/glm5_fp8_mi300x.shglm5-fp8-mi300x-sglang-mtp(withspec-decoding: mtp) → expected to runbenchmarks/single_node/glm5_fp8_mi300x_mtp.shThe second one is the one that breaks. The MI300X launcher does not know how to route to the
_mtp.shvariant.The misrouting code
runners/launch_mi300x-amds.sh:41hardcodes:There is no
SPEC_SUFFIX/FRAMEWORK_SUFFIXcomputation anywhere in this file (verified by re-reading the whole 43-line script — every line is shown above thebashinvocation, and nothing computesSPEC_SUFFIX).Compare
runners/launch_mi355x-amds.sh:182-228, which is the working pattern:Step-by-step proof for
glm5-fp8-mi300x-sglang-mtpatisl=1024, osl=1024, tp=8, conc=4:utils/matrix_logic/generate_sweep_configs.py:362) producesEXP_NAME = f"{model_code}_{seq_len_str}", i.e.EXP_NAME = "glm5_1k1k".MODEL=zai-org/GLM-5-FP8,PRECISION=fp8,FRAMEWORK=sglang,SPEC_DECODING=mtp,EXP_NAME=glm5_1k1k,EP_SIZE=1.launch_mi300x-amds.sh:41,${EXP_NAME%%_*}strips at the first underscore →"glm5".benchmarks/single_node/glm5_fp8_mi300x.sh— the vanilla decode script, not the newglm5_fp8_mi300x_mtp.sh.glm5_fp8_mi300x.shis invoked. It has none of--speculative-algorithm EAGLE,--speculative-num-steps,--speculative-eagle-topk,--speculative-num-draft-tokens,--ep-size, or--use-chat-template. The server starts in vanilla decode mode..github/workflows/benchmark-tmpl.yml:180definesRESULT_FILENAME = "${EXP_NAME}_${PRECISION}_${FRAMEWORK}_tp${TP}-ep${EP_SIZE}-dpa${DP_ATTENTION}_disagg-${DISAGG}_spec-${SPEC_DECODING}_conc${CONC}_${RUNNER_NAME}", so the result file name still containsspec-mtp. The datapoint is filed in perf history as MTP data, but it was produced by the vanilla decode path. Silent misattribution.Why nothing catches this
glm5_fp8_mi300x.sh'scheck_env_varsrequires onlyMODEL TP CONC ISL OSL RANDOM_RANGE_RATIO RESULT_FILENAME(noEP_SIZE), so the misrouted run does not fail loudly. It produces a plausible result file under a misleading name.glm5_fp8_mi300x_mtp.shdoes requireEP_SIZE, which would have surfaced the misrouting — but it is never executed.grep -n 'mi300x.*mtp' .github/configs/amd-master.yamlreturns only this PR's new entry (line 1840). This is the first MI300X recipe to depend on aSPEC_SUFFIXdispatch, so the absence has gone unnoticed.runners/launch_mi300x*.shglob returns exactly one file — there is no alternate launcher that could pick up the slack..github/workflows/benchmark-tmpl.ymldispatches viabash ./runners/launch_${RUNNER_NAME%%_*}.sh, so MI300X jobs go through this one launcher only.Impact
The
-mtpvariant of the new recipe is non-functional end-to-end on the targeted runner, and the resultingspec-mtpperf datapoints would be vanilla decode numbers in disguise. This is exactly the failure mode the bug description calls out, and it should block the PR.Fix options
SPEC_SUFFIXdispatch torunners/launch_mi300x-amds.shmirroringlaunch_mi355x-amds.sh:182-228(computeSPEC_SUFFIX, append it to the script path, optionally with a fallback chain). This is the proper fix and lets the new_mtp.shactually run.glm5-fp8-mi300x-sglang-mtprecipe andglm5_fp8_mi300x_mtp.shfrom this PR and re-add them after the launcher gains MTP support. The non-MTPglm5-fp8-mi300x-sglangrecipe is unaffected and can ship as-is.