[NV] add minimax fp4 h100 vllm#1517
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
2 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| set -x | ||
| vllm serve $MODEL --host 0.0.0.0 --port $PORT \ | ||
| --tensor-parallel-size=$TP \ | ||
| $EP \ | ||
| --trust-remote-code \ | ||
| --enable-auto-tool-choice \ | ||
| --tool-call-parser minimax_m2 \ | ||
| --reasoning-parser minimax_m2_append_think \ | ||
| --max-num-seqs $CONC \ | ||
| > $SERVER_LOG 2>&1 & |
There was a problem hiding this comment.
🔴 This new vLLM script omits four flags/handling that every sibling minimaxm2.5 vLLM script sets: --max-model-len $MAX_MODEL_LEN (with MAX_MODEL_LEN in check_env_vars), --gpu-memory-utilization 0.90, --no-enable-prefix-caching, and the MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" assignment inside the EVAL_ONLY block. Without these, the server is likely to OOM at conc-end: 512 on tp=8 H100 (MiniMax-M2.5 has a 192k+ default context), default-on prefix caching will inflate benchmark throughput once CONC*10 prompts share prefixes (making numbers incomparable to the other minimaxm2.5 configs), and EVAL_ONLY mode is silently broken because setup_eval_context's output is never consumed.
Extended reasoning...
What the bug is
benchmarks/single_node/minimaxm2.5_fp4_h100.sh was written as a new minimaxm2.5 vLLM benchmark script, but it diverges from every sibling minimaxm2.5 vLLM script in four ways that all affect correctness/comparability of the benchmark, not just style.
The siblings — minimaxm2.5_fp8_h100.sh, minimaxm2.5_fp4_b200.sh, and minimaxm2.5_fp4_b300.sh — uniformly:
- Include
MAX_MODEL_LENincheck_env_vars(fp8_h100 line 12, fp4_b200 line 13). - Inside the
EVAL_ONLYblock, assignMAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"immediately aftersetup_eval_context(fp8_h100 line 31, fp4_b200 line 40). - Pass
--gpu-memory-utilization 0.90tovllm serve(fp8_h100 line 47, fp4_b200 line 48). - Pass
--max-model-len $MAX_MODEL_LEN(fp8_h100 line 48, fp4_b200 line 49). - Pass
--no-enable-prefix-caching(fp8_h100 line 50, fp4_b200 line 53).
This new fp4_h100 script omits all of (1)–(5).
Step-by-step proof of impact
Benchmark phase (e.g. isl: 1024, osl: 1024, tp: 8, ep: 1, conc-end: 512 from nvidia-master.yaml):
- CI invokes the script with
CONC=512,ISL=1024,OSL=1024.MAX_MODEL_LENis not incheck_env_vars, so the script does not require it and proceeds. vllm serveis launched without--max-model-len. vLLM falls back to the model config'smax_position_embeddings. MiniMax-M2.5's published config is well over 192k.- KV cache memory scales with
max_model_len * max_num_seqs. With--max-num-seqs 512and a 192k+ model length, KV cache allocation on 8×H100 is likely to OOM or aggressively pre-empt. Sibling scripts dodge this by pinning--max-model-lento the env-supplied value (typically ~ISL+OSL+margin). - Even if the server boots,
run_benchmark_servingissuesCONC*10 = 5120random prompts. With prefix caching defaulted to on in recent vLLM, the random prompts share the chat-template prefix and common tokens, so cache hits accumulate over the run, inflating TPS in a way that the sibling configs (which set--no-enable-prefix-caching) explicitly avoid for benchmark consistency. The recorded number is not directly comparable to the fp8_h100 or fp4_b200/b300 numbers inperf-changelog.yaml.
EVAL_ONLY phase:
- Operator sets
EVAL_ONLY=true. Lines 28–30 callsetup_eval_context, which (per sibling pattern) populatesEVAL_MAX_MODEL_LEN. - The sibling pattern is
MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"— this script omits that line. - Because
MAX_MODEL_LENis also never passed tovllm servehere, the missing assignment doesn't change behavior in this file, but it does meansetup_eval_contextis a no-op — eval mode runs against whatever default context vLLM picks. Combined with the missing--max-model-lenflag, eval-mode runs use the full model context and have the same OOM risk as benchmark mode, defeating the purpose of having a separate eval context.
How to fix
Bring the script in line with minimaxm2.5_fp8_h100.sh:
check_env_vars \
MODEL \
TP \
EP_SIZE \
CONC \
ISL \
OSL \
+ MAX_MODEL_LEN \
RANDOM_RANGE_RATIO \
RESULT_FILENAME if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
+ MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
fi vllm serve $MODEL --host 0.0.0.0 --port $PORT \
--tensor-parallel-size=$TP \
$EP \
+--gpu-memory-utilization 0.90 \
+--max-model-len $MAX_MODEL_LEN \
+--no-enable-prefix-caching \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--max-num-seqs $CONC \Why this isn't pre-existing
This script is added in this PR (the diff shows it created from /dev/null), so the divergence from the sibling minimaxm2.5 scripts is introduced here.
|
|
||
| hf download "$MODEL" | ||
|
|
||
| nvidia-smi |
There was a problem hiding this comment.
🔴 Line 21 unconditionally invokes hf download "$MODEL", but every sibling minimaxm2.5 script (fp8_h100.sh:20, fp4_b200.sh:23, fp4_b300.sh:27, fp4_mi355x.sh:20, fp8_b300.sh:26, fp8_mi300x.sh:19, fp8_mi325x.sh:20, fp8_mi355x.sh:20) wraps it in if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi. When CI pre-stages the model and passes an absolute local path as MODEL (a documented pattern — see the explicit comment in dsv4_fp4_b300_sglang.sh:19-23), hf download will treat the path as a HuggingFace repo id and either fail outright or trigger a redundant network attempt. Please add the same guard for consistency.
Extended reasoning...
Bug
benchmarks/single_node/minimaxm2.5_fp4_h100.sh line 21 calls hf download "$MODEL" unconditionally. Across benchmarks/single_node/, the dominant convention — and specifically the one used by every other minimaxm2.5_* script — is to guard this call with a path check:
if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fiThe guard exists because CI runners (and EVAL_ONLY flows) routinely pre-stage models to local absolute paths and set MODEL to that path rather than to a HuggingFace repo id. The rationale is spelled out explicitly in benchmarks/single_node/dsv4_fp4_b300_sglang.sh:19-23: "The B300 runner overrides MODEL to a pre-staged /data/models path, so skip hf download. Only fetch when MODEL looks like a HF repo ID."
How it manifests
When the runner sets MODEL=/data/models/nvidia/MiniMax-M2.5-NVFP4 (an absolute path), execution reaches line 21 and invokes hf download /data/models/nvidia/MiniMax-M2.5-NVFP4. The Hugging Face CLI treats its positional argument as a repo id of the form namespace/name, so it will either reject the path (since /data/models/nvidia/MiniMax-M2.5-NVFP4 is not a valid repo id) and abort the script with a non-zero exit, or — if it interprets some prefix as a valid id — perform an unnecessary network download and shadow the pre-staged copy. Either way it breaks the pre-staged-model contract that every sibling script honors.
Step-by-step proof
- CI runner pre-stages the model and exports
MODEL=/data/models/MiniMax-M2.5-NVFP4,EVAL_ONLY=true, etc. benchmarks/single_node/minimaxm2.5_fp4_h100.shis invoked.check_env_varspasses becauseMODELis set.- Line 21 executes
hf download /data/models/MiniMax-M2.5-NVFP4. - The HF CLI parses the argument as a repo id, fails the
repo_idregex (contains leading/and slashes beyond the one allowed separator), and exits non-zero — the script aborts beforevllm serveis reached. - Contrast with the sibling
minimaxm2.5_fp8_h100.sh:20, where the guard[[ "$MODEL" != /* ]]evaluates false, thehf downloadis skipped, and execution proceeds tovllm serveusing the pre-staged path.
Why existing code does not prevent it
There is no upstream early-exit for absolute-path MODEL values in this script. check_env_vars only verifies presence, not shape. The EVAL_ONLY branch on line 28 runs after line 21, so it cannot rescue an already-failed download. vllm serve $MODEL would happily accept a local path, but the script never reaches it.
Impact
Any CI lane that pre-stages nvidia/MiniMax-M2.5-NVFP4 locally (the same pattern used for B300, MI355x, and other large-model runners) will fail at line 21 the first time this benchmark runs. Lanes that always use HF repo ids will work, so the bug is silent until a pre-staged path is wired in.
Fix
Replace line 21:
if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fiThis matches the convention in every other minimaxm2.5 sibling script and the explicit guidance in dsv4_fp4_b300_sglang.sh.
| - "Model: nvidia/MiniMax-M2.5-NVFP4" | ||
| - "TP=8, EP=1, --tool-call-parser minimax_m2, --reasoning-parser minimax_m2_append_think" | ||
| - "Configs: 1k1k conc 4-512, 8k1k conc 4-512" | ||
| pr-link: https://github.com/NVIDIA/InferenceMAX/pull/47 |
There was a problem hiding this comment.
🔴 The pr-link for the new minimaxm2.5-fp4-h100-vllm entry points to https://github.com/NVIDIA/InferenceMAX/pull/47, but all 366 other entries in perf-changelog.yaml use https://github.com/SemiAnalysisAI/InferenceX/pull/<N>. This looks like a copy-paste leftover from an upstream/mirror PR — please change it to https://github.com/SemiAnalysisAI/InferenceX/pull/1517 (this PR) so the changelog stays internally consistent and the link resolves.
Extended reasoning...
What the bug is
In perf-changelog.yaml at line 2961, the new entry added by this PR sets:
pr-link: https://github.com/NVIDIA/InferenceMAX/pull/47This is the only entry in the file that does not point to a SemiAnalysisAI/InferenceX PR.
Step-by-step proof
- The file has 367
pr-link:entries in total (366 pre-existing + 1 new in this PR). - A grep of
perf-changelog.yamlshows 366 entries match the patternpr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/<N>. - The one outlier is the new entry on line 2961, which points to
NVIDIA/InferenceMAX/pull/47. - This PR is
SemiAnalysisAI/InferenceX#1517(per the PR metadata), so both the repo and the PR number are wrong. The correct value ishttps://github.com/SemiAnalysisAI/InferenceX/pull/1517.
Why this matters
The pr-link field is the changelog's traceability hook — given a config-key, a reader should be able to click straight through to the PR that introduced or changed it. The current value:
- Breaks the convention used by every other entry in the file (cross-repo link instead of in-repo).
- Will 404 for anyone without access to the (apparently internal/mirror)
NVIDIA/InferenceMAXrepository, severing the changelog → PR linkage for this entry. - The
pull/47number is also unrelated to this PR, so even readers with access to the mirror would land in the wrong place.
Root cause
This appears to be a copy-paste leftover from an upstream/mirror PR (NVIDIA/InferenceMAX#47) that this contribution was originally drafted against. When the PR was opened against SemiAnalysisAI/InferenceX, the pr-link value was not updated.
How to fix
Change line 2961 from:
pr-link: https://github.com/NVIDIA/InferenceMAX/pull/47to:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1517This is a one-line change to changelog metadata; no runtime or benchmark behavior is affected.
functionstackx
left a comment
There was a problem hiding this comment.
idk how this works, is this fp4 weights?
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26126396884 |
|
@functionstackx Yes, its FP4. |
|
Recipe: vllm-project/recipes#484 |
No description provided.