-
Notifications
You must be signed in to change notification settings - Fork 181
[NV] add minimax fp4 h100 vllm #1517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| EP_SIZE \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| nvidia-smi | ||
|
|
||
| export PYTHONNOUSERSITE=1 | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| fi | ||
|
|
||
| if [ "$EP_SIZE" -gt 1 ]; then | ||
| EP=" --enable-expert-parallel" | ||
| else | ||
| EP=" " | ||
| fi | ||
|
|
||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| vllm serve $MODEL --host 0.0.0.0 --port $PORT \ | ||
| --tensor-parallel-size=$TP \ | ||
| $EP \ | ||
| --trust-remote-code \ | ||
| --enable-auto-tool-choice \ | ||
| --tool-call-parser minimax_m2 \ | ||
| --reasoning-parser minimax_m2_append_think \ | ||
| --max-num-seqs $CONC \ | ||
| > $SERVER_LOG 2>&1 & | ||
|
Comment on lines
+41
to
+50
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 This new vLLM script omits four flags/handling that every sibling Extended reasoning...What the bug is
The siblings —
This new fp4_h100 script omits all of (1)–(5). Step-by-step proof of impactBenchmark phase (e.g.
EVAL_ONLY phase:
How to fixBring the script in line with check_env_vars \
MODEL \
TP \
EP_SIZE \
CONC \
ISL \
OSL \
+ MAX_MODEL_LEN \
RANDOM_RANGE_RATIO \
RESULT_FILENAME if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
+ MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
fi vllm serve $MODEL --host 0.0.0.0 --port $PORT \
--tensor-parallel-size=$TP \
$EP \
+--gpu-memory-utilization 0.90 \
+--max-model-len $MAX_MODEL_LEN \
+--no-enable-prefix-caching \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--max-num-seqs $CONC \Why this isn't pre-existingThis script is added in this PR (the diff shows it created from |
||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --trust-remote-code | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| # Stop GPU monitoring | ||
| stop_gpu_monitor | ||
| set +x | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Line 21 unconditionally invokes
hf download "$MODEL", but every sibling minimaxm2.5 script (fp8_h100.sh:20, fp4_b200.sh:23, fp4_b300.sh:27, fp4_mi355x.sh:20, fp8_b300.sh:26, fp8_mi300x.sh:19, fp8_mi325x.sh:20, fp8_mi355x.sh:20) wraps it inif [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi. When CI pre-stages the model and passes an absolute local path asMODEL(a documented pattern — see the explicit comment in dsv4_fp4_b300_sglang.sh:19-23),hf downloadwill treat the path as a HuggingFace repo id and either fail outright or trigger a redundant network attempt. Please add the same guard for consistency.Extended reasoning...
Bug
benchmarks/single_node/minimaxm2.5_fp4_h100.shline 21 callshf download "$MODEL"unconditionally. Acrossbenchmarks/single_node/, the dominant convention — and specifically the one used by every otherminimaxm2.5_*script — is to guard this call with a path check:The guard exists because CI runners (and EVAL_ONLY flows) routinely pre-stage models to local absolute paths and set
MODELto that path rather than to a HuggingFace repo id. The rationale is spelled out explicitly inbenchmarks/single_node/dsv4_fp4_b300_sglang.sh:19-23: "The B300 runner overrides MODEL to a pre-staged /data/models path, so skip hf download. Only fetch when MODEL looks like a HF repo ID."How it manifests
When the runner sets
MODEL=/data/models/nvidia/MiniMax-M2.5-NVFP4(an absolute path), execution reaches line 21 and invokeshf download /data/models/nvidia/MiniMax-M2.5-NVFP4. The Hugging Face CLI treats its positional argument as a repo id of the formnamespace/name, so it will either reject the path (since/data/models/nvidia/MiniMax-M2.5-NVFP4is not a valid repo id) and abort the script with a non-zero exit, or — if it interprets some prefix as a valid id — perform an unnecessary network download and shadow the pre-staged copy. Either way it breaks the pre-staged-model contract that every sibling script honors.Step-by-step proof
MODEL=/data/models/MiniMax-M2.5-NVFP4,EVAL_ONLY=true, etc.benchmarks/single_node/minimaxm2.5_fp4_h100.shis invoked.check_env_varspasses becauseMODELis set.hf download /data/models/MiniMax-M2.5-NVFP4.repo_idregex (contains leading/and slashes beyond the one allowed separator), and exits non-zero — the script aborts beforevllm serveis reached.minimaxm2.5_fp8_h100.sh:20, where the guard[[ "$MODEL" != /* ]]evaluates false, thehf downloadis skipped, and execution proceeds tovllm serveusing the pre-staged path.Why existing code does not prevent it
There is no upstream early-exit for absolute-path
MODELvalues in this script.check_env_varsonly verifies presence, not shape. TheEVAL_ONLYbranch on line 28 runs after line 21, so it cannot rescue an already-failed download.vllm serve $MODELwould happily accept a local path, but the script never reaches it.Impact
Any CI lane that pre-stages
nvidia/MiniMax-M2.5-NVFP4locally (the same pattern used for B300, MI355x, and other large-model runners) will fail at line 21 the first time this benchmark runs. Lanes that always use HF repo ids will work, so the bug is silent until a pre-staged path is wired in.Fix
Replace line 21:
This matches the convention in every other minimaxm2.5 sibling script and the explicit guidance in
dsv4_fp4_b300_sglang.sh.