Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4497,6 +4497,25 @@ minimaxm2.5-fp8-h100-vllm:
# - { tp: 8, ep: 8, conc-start: 4, conc-end: 64 }
- { tp: 4, ep: 4, conc-start: 4, conc-end: 64 }

minimaxm2.5-fp4-h100-vllm:
image: vllm/vllm-openai:v0.19.1-cu130
model: nvidia/MiniMax-M2.5-NVFP4
model-prefix: minimaxm2.5
runner: h100
precision: fp4
framework: vllm
multinode: false
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 512 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 512 }

# Diverged from minimaxm2.5-fp8-h100-vllm (agentic-coding sibling). Metadata is
# identical to origin/main's minimaxm2.5-fp8-h100-vllm; the split exists because this
# PR adds an agentic-coding scenarios block that differs from main
Expand Down
78 changes: 78 additions & 0 deletions benchmarks/single_node/minimaxm2.5_fp4_h100.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
EP_SIZE \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

hf download "$MODEL"

nvidia-smi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Line 21 unconditionally invokes hf download "$MODEL", but every sibling minimaxm2.5 script (fp8_h100.sh:20, fp4_b200.sh:23, fp4_b300.sh:27, fp4_mi355x.sh:20, fp8_b300.sh:26, fp8_mi300x.sh:19, fp8_mi325x.sh:20, fp8_mi355x.sh:20) wraps it in if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi. When CI pre-stages the model and passes an absolute local path as MODEL (a documented pattern — see the explicit comment in dsv4_fp4_b300_sglang.sh:19-23), hf download will treat the path as a HuggingFace repo id and either fail outright or trigger a redundant network attempt. Please add the same guard for consistency.

Extended reasoning...

Bug

benchmarks/single_node/minimaxm2.5_fp4_h100.sh line 21 calls hf download "$MODEL" unconditionally. Across benchmarks/single_node/, the dominant convention — and specifically the one used by every other minimaxm2.5_* script — is to guard this call with a path check:

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

The guard exists because CI runners (and EVAL_ONLY flows) routinely pre-stage models to local absolute paths and set MODEL to that path rather than to a HuggingFace repo id. The rationale is spelled out explicitly in benchmarks/single_node/dsv4_fp4_b300_sglang.sh:19-23: "The B300 runner overrides MODEL to a pre-staged /data/models path, so skip hf download. Only fetch when MODEL looks like a HF repo ID."

How it manifests

When the runner sets MODEL=/data/models/nvidia/MiniMax-M2.5-NVFP4 (an absolute path), execution reaches line 21 and invokes hf download /data/models/nvidia/MiniMax-M2.5-NVFP4. The Hugging Face CLI treats its positional argument as a repo id of the form namespace/name, so it will either reject the path (since /data/models/nvidia/MiniMax-M2.5-NVFP4 is not a valid repo id) and abort the script with a non-zero exit, or — if it interprets some prefix as a valid id — perform an unnecessary network download and shadow the pre-staged copy. Either way it breaks the pre-staged-model contract that every sibling script honors.

Step-by-step proof

  1. CI runner pre-stages the model and exports MODEL=/data/models/MiniMax-M2.5-NVFP4, EVAL_ONLY=true, etc.
  2. benchmarks/single_node/minimaxm2.5_fp4_h100.sh is invoked. check_env_vars passes because MODEL is set.
  3. Line 21 executes hf download /data/models/MiniMax-M2.5-NVFP4.
  4. The HF CLI parses the argument as a repo id, fails the repo_id regex (contains leading / and slashes beyond the one allowed separator), and exits non-zero — the script aborts before vllm serve is reached.
  5. Contrast with the sibling minimaxm2.5_fp8_h100.sh:20, where the guard [[ "$MODEL" != /* ]] evaluates false, the hf download is skipped, and execution proceeds to vllm serve using the pre-staged path.

Why existing code does not prevent it

There is no upstream early-exit for absolute-path MODEL values in this script. check_env_vars only verifies presence, not shape. The EVAL_ONLY branch on line 28 runs after line 21, so it cannot rescue an already-failed download. vllm serve $MODEL would happily accept a local path, but the script never reaches it.

Impact

Any CI lane that pre-stages nvidia/MiniMax-M2.5-NVFP4 locally (the same pattern used for B300, MI355x, and other large-model runners) will fail at line 21 the first time this benchmark runs. Lanes that always use HF repo ids will work, so the bug is silent until a pre-staged path is wired in.

Fix

Replace line 21:

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

This matches the convention in every other minimaxm2.5 sibling script and the explicit guidance in dsv4_fp4_b300_sglang.sh.


export PYTHONNOUSERSITE=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
fi

if [ "$EP_SIZE" -gt 1 ]; then
EP=" --enable-expert-parallel"
else
EP=" "
fi

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
vllm serve $MODEL --host 0.0.0.0 --port $PORT \
--tensor-parallel-size=$TP \
$EP \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--max-num-seqs $CONC \
> $SERVER_LOG 2>&1 &
Comment on lines +41 to +50
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This new vLLM script omits four flags/handling that every sibling minimaxm2.5 vLLM script sets: --max-model-len $MAX_MODEL_LEN (with MAX_MODEL_LEN in check_env_vars), --gpu-memory-utilization 0.90, --no-enable-prefix-caching, and the MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" assignment inside the EVAL_ONLY block. Without these, the server is likely to OOM at conc-end: 512 on tp=8 H100 (MiniMax-M2.5 has a 192k+ default context), default-on prefix caching will inflate benchmark throughput once CONC*10 prompts share prefixes (making numbers incomparable to the other minimaxm2.5 configs), and EVAL_ONLY mode is silently broken because setup_eval_context's output is never consumed.

Extended reasoning...

What the bug is

benchmarks/single_node/minimaxm2.5_fp4_h100.sh was written as a new minimaxm2.5 vLLM benchmark script, but it diverges from every sibling minimaxm2.5 vLLM script in four ways that all affect correctness/comparability of the benchmark, not just style.

The siblings — minimaxm2.5_fp8_h100.sh, minimaxm2.5_fp4_b200.sh, and minimaxm2.5_fp4_b300.sh — uniformly:

  1. Include MAX_MODEL_LEN in check_env_vars (fp8_h100 line 12, fp4_b200 line 13).
  2. Inside the EVAL_ONLY block, assign MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" immediately after setup_eval_context (fp8_h100 line 31, fp4_b200 line 40).
  3. Pass --gpu-memory-utilization 0.90 to vllm serve (fp8_h100 line 47, fp4_b200 line 48).
  4. Pass --max-model-len $MAX_MODEL_LEN (fp8_h100 line 48, fp4_b200 line 49).
  5. Pass --no-enable-prefix-caching (fp8_h100 line 50, fp4_b200 line 53).

This new fp4_h100 script omits all of (1)–(5).

Step-by-step proof of impact

Benchmark phase (e.g. isl: 1024, osl: 1024, tp: 8, ep: 1, conc-end: 512 from nvidia-master.yaml):

  1. CI invokes the script with CONC=512, ISL=1024, OSL=1024. MAX_MODEL_LEN is not in check_env_vars, so the script does not require it and proceeds.
  2. vllm serve is launched without --max-model-len. vLLM falls back to the model config's max_position_embeddings. MiniMax-M2.5's published config is well over 192k.
  3. KV cache memory scales with max_model_len * max_num_seqs. With --max-num-seqs 512 and a 192k+ model length, KV cache allocation on 8×H100 is likely to OOM or aggressively pre-empt. Sibling scripts dodge this by pinning --max-model-len to the env-supplied value (typically ~ISL+OSL+margin).
  4. Even if the server boots, run_benchmark_serving issues CONC*10 = 5120 random prompts. With prefix caching defaulted to on in recent vLLM, the random prompts share the chat-template prefix and common tokens, so cache hits accumulate over the run, inflating TPS in a way that the sibling configs (which set --no-enable-prefix-caching) explicitly avoid for benchmark consistency. The recorded number is not directly comparable to the fp8_h100 or fp4_b200/b300 numbers in perf-changelog.yaml.

EVAL_ONLY phase:

  1. Operator sets EVAL_ONLY=true. Lines 28–30 call setup_eval_context, which (per sibling pattern) populates EVAL_MAX_MODEL_LEN.
  2. The sibling pattern is MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" — this script omits that line.
  3. Because MAX_MODEL_LEN is also never passed to vllm serve here, the missing assignment doesn't change behavior in this file, but it does mean setup_eval_context is a no-op — eval mode runs against whatever default context vLLM picks. Combined with the missing --max-model-len flag, eval-mode runs use the full model context and have the same OOM risk as benchmark mode, defeating the purpose of having a separate eval context.

How to fix

Bring the script in line with minimaxm2.5_fp8_h100.sh:

 check_env_vars \
     MODEL \
     TP \
     EP_SIZE \
     CONC \
     ISL \
     OSL \
+    MAX_MODEL_LEN \
     RANDOM_RANGE_RATIO \
     RESULT_FILENAME
 if [ "${EVAL_ONLY}" = "true" ]; then
     setup_eval_context
+    MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
 fi
 vllm serve $MODEL --host 0.0.0.0 --port $PORT \
 --tensor-parallel-size=$TP \
 $EP \
+--gpu-memory-utilization 0.90 \
+--max-model-len $MAX_MODEL_LEN \
+--no-enable-prefix-caching \
 --trust-remote-code \
 --enable-auto-tool-choice \
 --tool-call-parser minimax_m2 \
 --reasoning-parser minimax_m2_append_think \
 --max-num-seqs $CONC \

Why this isn't pre-existing

This script is added in this PR (the diff shows it created from /dev/null), so the divergence from the sibling minimaxm2.5 scripts is introduced here.


SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
10 changes: 10 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2949,3 +2949,13 @@
- "Following recipe from https://github.com/vllm-project/recipes/pull/433"
- "Add DEP8 dp-attn=true validation probes at conc=64 for 1k1k and 8k1k"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1374

- config-keys:
- minimaxm2.5-fp4-h100-vllm
description:
- "Add MiniMax-M2.5-NVFP4 H100 vLLM benchmark"
- "Image: vllm/vllm-openai:v0.19.1-cu130"
- "Model: nvidia/MiniMax-M2.5-NVFP4"
- "TP=8, EP=1, --tool-call-parser minimax_m2, --reasoning-parser minimax_m2_append_think"
- "Configs: 1k1k conc 4-512, 8k1k conc 4-512"
pr-link: https://github.com/NVIDIA/InferenceMAX/pull/1517