Update MiniMax M2.5 FP8 H200 vLLM agg recipes by anish-shanbhag · Pull Request #1354 · SemiAnalysisAI/InferenceX

anish-shanbhag · 2026-05-12T22:49:38Z

(Identical to #1298 except the source branch is no longer from a fork so that CI can run)

Set vLLM serving knobs in benchmarks/single_node/minimaxm2.5_fp8_h200.sh: generated benchmark max-model-len, previous eval max-model-len handling, fp8 KV cache, FlashInfer attention/autotune, Triton MoE, and MiniMax QK norm fusion.

Note

Low Risk
Benchmark and CI config only—no production auth or app logic; changes affect how perf jobs run and what hardware knobs are swept.

Overview
Tightens MiniMax-M2.5 FP8 on H200 (vLLM) benchmark recipes by changing how the server is launched and how CI sweeps concurrency/tensor parallelism.

benchmarks/single_node/minimaxm2.5_fp8_h200.sh now sets runtime env (e.g. disabling Deep GEMM / blockscale FlashInfer FP8 GEMM), defaults for max-num-seqs, max-num-batched-tokens, and a compilation-config with MiniMax QK-norm fusion, and passes new vLLM flags: fp8 KV cache, Triton MoE, FlashInfer attention with autotune, plus safer quoting and array-style expert-parallel args.

In .github/configs/nvidia-master.yaml, fixed-seq-len search for minimaxm2.5-fp8-h200-vllm shifts from tp 8 / conc 4–128 to tp 4 / conc 1–256 (two ISL/OSL points). perf-changelog.yaml documents the recipe update (image v0.20.1-ubuntu2404 and serving knobs).

^{Reviewed by Cursor Bugbot for commit 111df5b. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-05-12T22:49:51Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-13T05:17:34Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25772346949
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25772346949

github-actions · 2026-05-13T18:32:39Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25814571481
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25814571481

github-actions · 2026-05-13T19:13:24Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25814571481
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25814571481

kedarpotdar-nv · 2026-05-17T23:47:26Z

@claude please review

Klaud-Cold · 2026-05-17T23:47:50Z

Claude finished @kedarpotdar-nv's task in 2m 38s —— View job

Review: Update MiniMax M2.5 FP8 H200 vLLM agg recipes

Gather context and read changed files
Review .github/configs/nvidia-master.yaml changes
Review benchmarks/single_node/minimaxm2.5_fp8_h200.sh changes
Review perf-changelog.yaml changes
Post review feedback

Overall: Looks good. The PR modernizes the MiniMax M2.5 FP8 H200 vLLM benchmark script with proper quoting, FP8 KV cache, FlashInfer attention with autotune, Triton MoE backend, and MiniMax QK norm fusion via compilation config. CI is green.

`.github/configs/nvidia-master.yaml`

TP 8 → TP 4: Makes sense — TP=4 on 8×H200 leaves more memory per shard and allows wider concurrency sweeps. The H100 config (minimaxm2.5-fp8-h100-vllm) also uses TP=4 with EP=4.
Concurrency range widened from 4–128 to 1–256: Provides broader coverage, consistent with other configs that use conc-start: 1.
No ep field: Without ep in the search-space, EP_SIZE defaults to 1, so the script's if [ "$EP_SIZE" -gt 1 ] guard correctly skips --enable-expert-parallel. However, the H100 config uses ep: 4 with TP=4 — was EP intentionally omitted here, or should it be { tp: 4, ep: 4, conc-start: 1, conc-end: 256 } to match? With TP=4 on an 8-GPU node, you have 4 unused GPUs unless EP fills them. Worth confirming this is intentional.

`benchmarks/single_node/minimaxm2.5_fp8_h200.sh`

Proper quoting: All variable expansions are now quoted ("$MODEL", "$PORT", etc.) — good shell hygiene.
EP guard fix: Changed from -ge 1 (always true since EP_SIZE defaults to 1) to -gt 1 and switched from string concatenation (EP=" --enable-expert-parallel") to array syntax (EP=(--enable-expert-parallel) / "${EP[@]}"). This matches the canonical pattern from the repo guidelines.
New vLLM flags:
- --kv-cache-dtype fp8 — FP8 KV cache for memory savings
- --moe-backend triton — Triton MoE kernels
- --attention-backend FLASHINFER + --enable-flashinfer-autotune — FlashInfer attention with autotuning
- --compilation-config with fuse_minimax_qk_norm — model-specific QK norm fusion
- --max-num-seqs 512 and --max-num-batched-tokens 32768 — configurable via env with reasonable defaults
New env vars: PYTHONNOUSERSITE=1, SAFETENSORS_FAST_GPU=1, VLLM_USE_DEEP_GEMM=0, VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER=0, VLLM_FLOAT32_MATMUL_PRECISION=high — these align with typical vLLM tuning knobs for FP8 MoE workloads.

`perf-changelog.yaml`

Entry looks well-formed with the correct config key and PR link.
Minor note: description mentions updating to v0.20.1-ubuntu2404 but the actual image in the config is v0.20.2. This was likely true in an earlier iteration (PR (Replaced by #1354) Update MiniMax M2.5 FP8 H200 vLLM agg recipes #1298) but is now stale.

Questions / Nits

Missing ep in H200 config (see above): The H100 MiniMax FP8 config uses { tp: 4, ep: 4 }. If TP=4 without EP on 8×H200, that means only 4 of 8 GPUs are utilized. Is this the intended configuration, or should EP=4 be added? Fix this →
Stale perf-changelog description: The changelog mentions v0.20.1-ubuntu2404 but the config uses v0.20.2. Fix this →

| CI run (green)

functionstackx

update recieps docs plz

anish-shanbhag · 2026-05-18T21:03:45Z

Opened a recipes PR here: vllm-project/recipes#474

functionstackx

lgtm

anish-shanbhag · 2026-05-20T02:26:43Z

@functionstackx vllm-project/recipes#474 was merged, could you please approve this PR? thanks

github-actions · 2026-05-20T02:54:43Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26137455891
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26137455891

functionstackx

lgtm

github-actions · 2026-05-26T19:18:35Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26468137754
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26468137754

anish-shanbhag requested a review from a team May 12, 2026 22:49

anish-shanbhag requested review from jgangani and kedarpotdar-nv as code owners May 12, 2026 22:49

github-project-automation Bot added this to InferenceMAX Board May 12, 2026

anish-shanbhag mentioned this pull request May 12, 2026

(Replaced by #1354) Update MiniMax M2.5 FP8 H200 vLLM agg recipes #1298

Closed

kedarpotdar-nv added the NVIDIA label May 13, 2026

anish-shanbhag added the full-sweep-enabled label May 13, 2026

anish-shanbhag force-pushed the ashanbhag/minimax-h200 branch from 3dea91d to c3d1ef6 Compare May 13, 2026 17:11

kedarpotdar-nv approved these changes May 17, 2026

View reviewed changes

functionstackx requested changes May 17, 2026

View reviewed changes

anish-shanbhag mentioned this pull request May 18, 2026

Update MiniMax M2.5 H200 recipe vllm-project/recipes#474

Merged

functionstackx requested changes May 18, 2026

View reviewed changes

Tune MiniMax M2.5 FP8 H200 vLLM agg

8e3037d

anish-shanbhag force-pushed the ashanbhag/minimax-h200 branch from c3d1ef6 to 8e3037d Compare May 20, 2026 02:25

functionstackx self-requested a review May 26, 2026 18:25

functionstackx approved these changes May 26, 2026

View reviewed changes

Merge branch 'main' into ashanbhag/minimax-h200

111df5b

Ankur-singh merged commit 8014ee6 into main May 26, 2026
6 checks passed

Ankur-singh deleted the ashanbhag/minimax-h200 branch May 26, 2026 18:45

github-project-automation Bot moved this to Done in InferenceMAX Board May 26, 2026

Conversation

anish-shanbhag commented May 12, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

kedarpotdar-nv commented May 17, 2026

Uh oh!

Klaud-Cold commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Update MiniMax M2.5 FP8 H200 vLLM agg recipes

.github/configs/nvidia-master.yaml

benchmarks/single_node/minimaxm2.5_fp8_h200.sh

perf-changelog.yaml

Questions / Nits

Uh oh!

functionstackx left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anish-shanbhag commented May 18, 2026

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

anish-shanbhag commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

anish-shanbhag commented May 12, 2026 •

edited by cursor Bot

Loading

Klaud-Cold commented May 17, 2026 •

edited

Loading

`.github/configs/nvidia-master.yaml`

`benchmarks/single_node/minimaxm2.5_fp8_h200.sh`

`perf-changelog.yaml`

functionstackx left a comment •

edited

Loading