Skip to content

[NV] add minimax fp4 h100 vllm#1517

Closed
hshrivastava-droid wants to merge 2 commits into
mainfrom
nv/minimaxm2.5-fp4-h100-vllm-add
Closed

[NV] add minimax fp4 h100 vllm#1517
hshrivastava-droid wants to merge 2 commits into
mainfrom
nv/minimaxm2.5-fp4-h100-vllm-add

Conversation

@hshrivastava-droid
Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

2 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines +41 to +50
set -x
vllm serve $MODEL --host 0.0.0.0 --port $PORT \
--tensor-parallel-size=$TP \
$EP \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--max-num-seqs $CONC \
> $SERVER_LOG 2>&1 &
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This new vLLM script omits four flags/handling that every sibling minimaxm2.5 vLLM script sets: --max-model-len $MAX_MODEL_LEN (with MAX_MODEL_LEN in check_env_vars), --gpu-memory-utilization 0.90, --no-enable-prefix-caching, and the MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" assignment inside the EVAL_ONLY block. Without these, the server is likely to OOM at conc-end: 512 on tp=8 H100 (MiniMax-M2.5 has a 192k+ default context), default-on prefix caching will inflate benchmark throughput once CONC*10 prompts share prefixes (making numbers incomparable to the other minimaxm2.5 configs), and EVAL_ONLY mode is silently broken because setup_eval_context's output is never consumed.

Extended reasoning...

What the bug is

benchmarks/single_node/minimaxm2.5_fp4_h100.sh was written as a new minimaxm2.5 vLLM benchmark script, but it diverges from every sibling minimaxm2.5 vLLM script in four ways that all affect correctness/comparability of the benchmark, not just style.

The siblings — minimaxm2.5_fp8_h100.sh, minimaxm2.5_fp4_b200.sh, and minimaxm2.5_fp4_b300.sh — uniformly:

  1. Include MAX_MODEL_LEN in check_env_vars (fp8_h100 line 12, fp4_b200 line 13).
  2. Inside the EVAL_ONLY block, assign MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" immediately after setup_eval_context (fp8_h100 line 31, fp4_b200 line 40).
  3. Pass --gpu-memory-utilization 0.90 to vllm serve (fp8_h100 line 47, fp4_b200 line 48).
  4. Pass --max-model-len $MAX_MODEL_LEN (fp8_h100 line 48, fp4_b200 line 49).
  5. Pass --no-enable-prefix-caching (fp8_h100 line 50, fp4_b200 line 53).

This new fp4_h100 script omits all of (1)–(5).

Step-by-step proof of impact

Benchmark phase (e.g. isl: 1024, osl: 1024, tp: 8, ep: 1, conc-end: 512 from nvidia-master.yaml):

  1. CI invokes the script with CONC=512, ISL=1024, OSL=1024. MAX_MODEL_LEN is not in check_env_vars, so the script does not require it and proceeds.
  2. vllm serve is launched without --max-model-len. vLLM falls back to the model config's max_position_embeddings. MiniMax-M2.5's published config is well over 192k.
  3. KV cache memory scales with max_model_len * max_num_seqs. With --max-num-seqs 512 and a 192k+ model length, KV cache allocation on 8×H100 is likely to OOM or aggressively pre-empt. Sibling scripts dodge this by pinning --max-model-len to the env-supplied value (typically ~ISL+OSL+margin).
  4. Even if the server boots, run_benchmark_serving issues CONC*10 = 5120 random prompts. With prefix caching defaulted to on in recent vLLM, the random prompts share the chat-template prefix and common tokens, so cache hits accumulate over the run, inflating TPS in a way that the sibling configs (which set --no-enable-prefix-caching) explicitly avoid for benchmark consistency. The recorded number is not directly comparable to the fp8_h100 or fp4_b200/b300 numbers in perf-changelog.yaml.

EVAL_ONLY phase:

  1. Operator sets EVAL_ONLY=true. Lines 28–30 call setup_eval_context, which (per sibling pattern) populates EVAL_MAX_MODEL_LEN.
  2. The sibling pattern is MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" — this script omits that line.
  3. Because MAX_MODEL_LEN is also never passed to vllm serve here, the missing assignment doesn't change behavior in this file, but it does mean setup_eval_context is a no-op — eval mode runs against whatever default context vLLM picks. Combined with the missing --max-model-len flag, eval-mode runs use the full model context and have the same OOM risk as benchmark mode, defeating the purpose of having a separate eval context.

How to fix

Bring the script in line with minimaxm2.5_fp8_h100.sh:

 check_env_vars \
     MODEL \
     TP \
     EP_SIZE \
     CONC \
     ISL \
     OSL \
+    MAX_MODEL_LEN \
     RANDOM_RANGE_RATIO \
     RESULT_FILENAME
 if [ "${EVAL_ONLY}" = "true" ]; then
     setup_eval_context
+    MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
 fi
 vllm serve $MODEL --host 0.0.0.0 --port $PORT \
 --tensor-parallel-size=$TP \
 $EP \
+--gpu-memory-utilization 0.90 \
+--max-model-len $MAX_MODEL_LEN \
+--no-enable-prefix-caching \
 --trust-remote-code \
 --enable-auto-tool-choice \
 --tool-call-parser minimax_m2 \
 --reasoning-parser minimax_m2_append_think \
 --max-num-seqs $CONC \

Why this isn't pre-existing

This script is added in this PR (the diff shows it created from /dev/null), so the divergence from the sibling minimaxm2.5 scripts is introduced here.


hf download "$MODEL"

nvidia-smi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Line 21 unconditionally invokes hf download "$MODEL", but every sibling minimaxm2.5 script (fp8_h100.sh:20, fp4_b200.sh:23, fp4_b300.sh:27, fp4_mi355x.sh:20, fp8_b300.sh:26, fp8_mi300x.sh:19, fp8_mi325x.sh:20, fp8_mi355x.sh:20) wraps it in if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi. When CI pre-stages the model and passes an absolute local path as MODEL (a documented pattern — see the explicit comment in dsv4_fp4_b300_sglang.sh:19-23), hf download will treat the path as a HuggingFace repo id and either fail outright or trigger a redundant network attempt. Please add the same guard for consistency.

Extended reasoning...

Bug

benchmarks/single_node/minimaxm2.5_fp4_h100.sh line 21 calls hf download "$MODEL" unconditionally. Across benchmarks/single_node/, the dominant convention — and specifically the one used by every other minimaxm2.5_* script — is to guard this call with a path check:

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

The guard exists because CI runners (and EVAL_ONLY flows) routinely pre-stage models to local absolute paths and set MODEL to that path rather than to a HuggingFace repo id. The rationale is spelled out explicitly in benchmarks/single_node/dsv4_fp4_b300_sglang.sh:19-23: "The B300 runner overrides MODEL to a pre-staged /data/models path, so skip hf download. Only fetch when MODEL looks like a HF repo ID."

How it manifests

When the runner sets MODEL=/data/models/nvidia/MiniMax-M2.5-NVFP4 (an absolute path), execution reaches line 21 and invokes hf download /data/models/nvidia/MiniMax-M2.5-NVFP4. The Hugging Face CLI treats its positional argument as a repo id of the form namespace/name, so it will either reject the path (since /data/models/nvidia/MiniMax-M2.5-NVFP4 is not a valid repo id) and abort the script with a non-zero exit, or — if it interprets some prefix as a valid id — perform an unnecessary network download and shadow the pre-staged copy. Either way it breaks the pre-staged-model contract that every sibling script honors.

Step-by-step proof

  1. CI runner pre-stages the model and exports MODEL=/data/models/MiniMax-M2.5-NVFP4, EVAL_ONLY=true, etc.
  2. benchmarks/single_node/minimaxm2.5_fp4_h100.sh is invoked. check_env_vars passes because MODEL is set.
  3. Line 21 executes hf download /data/models/MiniMax-M2.5-NVFP4.
  4. The HF CLI parses the argument as a repo id, fails the repo_id regex (contains leading / and slashes beyond the one allowed separator), and exits non-zero — the script aborts before vllm serve is reached.
  5. Contrast with the sibling minimaxm2.5_fp8_h100.sh:20, where the guard [[ "$MODEL" != /* ]] evaluates false, the hf download is skipped, and execution proceeds to vllm serve using the pre-staged path.

Why existing code does not prevent it

There is no upstream early-exit for absolute-path MODEL values in this script. check_env_vars only verifies presence, not shape. The EVAL_ONLY branch on line 28 runs after line 21, so it cannot rescue an already-failed download. vllm serve $MODEL would happily accept a local path, but the script never reaches it.

Impact

Any CI lane that pre-stages nvidia/MiniMax-M2.5-NVFP4 locally (the same pattern used for B300, MI355x, and other large-model runners) will fail at line 21 the first time this benchmark runs. Lanes that always use HF repo ids will work, so the bug is silent until a pre-staged path is wired in.

Fix

Replace line 21:

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

This matches the convention in every other minimaxm2.5 sibling script and the explicit guidance in dsv4_fp4_b300_sglang.sh.

Comment thread perf-changelog.yaml Outdated
- "Model: nvidia/MiniMax-M2.5-NVFP4"
- "TP=8, EP=1, --tool-call-parser minimax_m2, --reasoning-parser minimax_m2_append_think"
- "Configs: 1k1k conc 4-512, 8k1k conc 4-512"
pr-link: https://github.com/NVIDIA/InferenceMAX/pull/47
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The pr-link for the new minimaxm2.5-fp4-h100-vllm entry points to https://github.com/NVIDIA/InferenceMAX/pull/47, but all 366 other entries in perf-changelog.yaml use https://github.com/SemiAnalysisAI/InferenceX/pull/<N>. This looks like a copy-paste leftover from an upstream/mirror PR — please change it to https://github.com/SemiAnalysisAI/InferenceX/pull/1517 (this PR) so the changelog stays internally consistent and the link resolves.

Extended reasoning...

What the bug is

In perf-changelog.yaml at line 2961, the new entry added by this PR sets:

pr-link: https://github.com/NVIDIA/InferenceMAX/pull/47

This is the only entry in the file that does not point to a SemiAnalysisAI/InferenceX PR.

Step-by-step proof

  1. The file has 367 pr-link: entries in total (366 pre-existing + 1 new in this PR).
  2. A grep of perf-changelog.yaml shows 366 entries match the pattern pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/<N>.
  3. The one outlier is the new entry on line 2961, which points to NVIDIA/InferenceMAX/pull/47.
  4. This PR is SemiAnalysisAI/InferenceX#1517 (per the PR metadata), so both the repo and the PR number are wrong. The correct value is https://github.com/SemiAnalysisAI/InferenceX/pull/1517.

Why this matters

The pr-link field is the changelog's traceability hook — given a config-key, a reader should be able to click straight through to the PR that introduced or changed it. The current value:

  • Breaks the convention used by every other entry in the file (cross-repo link instead of in-repo).
  • Will 404 for anyone without access to the (apparently internal/mirror) NVIDIA/InferenceMAX repository, severing the changelog → PR linkage for this entry.
  • The pull/47 number is also unrelated to this PR, so even readers with access to the mirror would land in the wrong place.

Root cause

This appears to be a copy-paste leftover from an upstream/mirror PR (NVIDIA/InferenceMAX#47) that this contribution was originally drafted against. When the PR was opened against SemiAnalysisAI/InferenceX, the pr-link value was not updated.

How to fix

Change line 2961 from:

  pr-link: https://github.com/NVIDIA/InferenceMAX/pull/47

to:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1517

This is a one-line change to changelog metadata; no runtime or benchmark behavior is affected.

Copy link
Copy Markdown
Collaborator

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk how this works, is this fp4 weights?

@github-actions
Copy link
Copy Markdown
Contributor

@Ankur-singh
Copy link
Copy Markdown
Collaborator

@functionstackx Yes, its FP4.

@Ankur-singh
Copy link
Copy Markdown
Collaborator

Recipe: vllm-project/recipes#484

@Ankur-singh Ankur-singh changed the title [WIP][NV] add minimax fp4 h100 vllm [NV] add minimax fp4 h100 vllm May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants