Skip to content

Update MiniMax M2.5 FP8 H200 vLLM agg recipes#1354

Merged
Ankur-singh merged 2 commits into
mainfrom
ashanbhag/minimax-h200
May 26, 2026
Merged

Update MiniMax M2.5 FP8 H200 vLLM agg recipes#1354
Ankur-singh merged 2 commits into
mainfrom
ashanbhag/minimax-h200

Conversation

@anish-shanbhag
Copy link
Copy Markdown
Collaborator

@anish-shanbhag anish-shanbhag commented May 12, 2026

(Identical to #1298 except the source branch is no longer from a fork so that CI can run)

Set vLLM serving knobs in benchmarks/single_node/minimaxm2.5_fp8_h200.sh: generated benchmark max-model-len, previous eval max-model-len handling, fp8 KV cache, FlashInfer attention/autotune, Triton MoE, and MiniMax QK norm fusion.


Note

Low Risk
Benchmark and CI config only—no production auth or app logic; changes affect how perf jobs run and what hardware knobs are swept.

Overview
Tightens MiniMax-M2.5 FP8 on H200 (vLLM) benchmark recipes by changing how the server is launched and how CI sweeps concurrency/tensor parallelism.

benchmarks/single_node/minimaxm2.5_fp8_h200.sh now sets runtime env (e.g. disabling Deep GEMM / blockscale FlashInfer FP8 GEMM), defaults for max-num-seqs, max-num-batched-tokens, and a compilation-config with MiniMax QK-norm fusion, and passes new vLLM flags: fp8 KV cache, Triton MoE, FlashInfer attention with autotune, plus safer quoting and array-style expert-parallel args.

In .github/configs/nvidia-master.yaml, fixed-seq-len search for minimaxm2.5-fp8-h200-vllm shifts from tp 8 / conc 4–128 to tp 4 / conc 1–256 (two ISL/OSL points). perf-changelog.yaml documents the recipe update (image v0.20.1-ubuntu2404 and serving knobs).

Reviewed by Cursor Bugbot for commit 111df5b. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

@anish-shanbhag anish-shanbhag force-pushed the ashanbhag/minimax-h200 branch from 3dea91d to c3d1ef6 Compare May 13, 2026 17:11
@github-actions
Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator

@claude please review

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented May 17, 2026

Claude finished @kedarpotdar-nv's task in 2m 38s —— View job


Review: Update MiniMax M2.5 FP8 H200 vLLM agg recipes

  • Gather context and read changed files
  • Review .github/configs/nvidia-master.yaml changes
  • Review benchmarks/single_node/minimaxm2.5_fp8_h200.sh changes
  • Review perf-changelog.yaml changes
  • Post review feedback

Overall: Looks good. The PR modernizes the MiniMax M2.5 FP8 H200 vLLM benchmark script with proper quoting, FP8 KV cache, FlashInfer attention with autotune, Triton MoE backend, and MiniMax QK norm fusion via compilation config. CI is green.

.github/configs/nvidia-master.yaml

  • TP 8 → TP 4: Makes sense — TP=4 on 8×H200 leaves more memory per shard and allows wider concurrency sweeps. The H100 config (minimaxm2.5-fp8-h100-vllm) also uses TP=4 with EP=4.
  • Concurrency range widened from 4–128 to 1–256: Provides broader coverage, consistent with other configs that use conc-start: 1.
  • No ep field: Without ep in the search-space, EP_SIZE defaults to 1, so the script's if [ "$EP_SIZE" -gt 1 ] guard correctly skips --enable-expert-parallel. However, the H100 config uses ep: 4 with TP=4 — was EP intentionally omitted here, or should it be { tp: 4, ep: 4, conc-start: 1, conc-end: 256 } to match? With TP=4 on an 8-GPU node, you have 4 unused GPUs unless EP fills them. Worth confirming this is intentional.

benchmarks/single_node/minimaxm2.5_fp8_h200.sh

  • Proper quoting: All variable expansions are now quoted ("$MODEL", "$PORT", etc.) — good shell hygiene.
  • EP guard fix: Changed from -ge 1 (always true since EP_SIZE defaults to 1) to -gt 1 and switched from string concatenation (EP=" --enable-expert-parallel") to array syntax (EP=(--enable-expert-parallel) / "${EP[@]}"). This matches the canonical pattern from the repo guidelines.
  • New vLLM flags:
    • --kv-cache-dtype fp8 — FP8 KV cache for memory savings
    • --moe-backend triton — Triton MoE kernels
    • --attention-backend FLASHINFER + --enable-flashinfer-autotune — FlashInfer attention with autotuning
    • --compilation-config with fuse_minimax_qk_norm — model-specific QK norm fusion
    • --max-num-seqs 512 and --max-num-batched-tokens 32768 — configurable via env with reasonable defaults
  • New env vars: PYTHONNOUSERSITE=1, SAFETENSORS_FAST_GPU=1, VLLM_USE_DEEP_GEMM=0, VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER=0, VLLM_FLOAT32_MATMUL_PRECISION=high — these align with typical vLLM tuning knobs for FP8 MoE workloads.

perf-changelog.yaml

Questions / Nits

  1. Missing ep in H200 config (see above): The H100 MiniMax FP8 config uses { tp: 4, ep: 4 }. If TP=4 without EP on 8×H200, that means only 4 of 8 GPUs are utilized. Is this the intended configuration, or should EP=4 be added? Fix this →

  2. Stale perf-changelog description: The changelog mentions v0.20.1-ubuntu2404 but the config uses v0.20.2. Fix this →


| CI run (green)

Copy link
Copy Markdown
Collaborator

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update recieps docs plz

@anish-shanbhag
Copy link
Copy Markdown
Collaborator Author

Opened a recipes PR here: vllm-project/recipes#474

Copy link
Copy Markdown
Collaborator

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@anish-shanbhag anish-shanbhag force-pushed the ashanbhag/minimax-h200 branch from c3d1ef6 to 8e3037d Compare May 20, 2026 02:25
@anish-shanbhag
Copy link
Copy Markdown
Collaborator Author

@functionstackx vllm-project/recipes#474 was merged, could you please approve this PR? thanks

@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx self-requested a review May 26, 2026 18:25
Copy link
Copy Markdown
Collaborator

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Ankur-singh Ankur-singh merged commit 8014ee6 into main May 26, 2026
6 checks passed
@Ankur-singh Ankur-singh deleted the ashanbhag/minimax-h200 branch May 26, 2026 18:45
@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

5 participants