Add B200 workload variants for all H200-only models by khluu · Pull Request #12 · vllm-project/perf-eval

khluu · 2026-05-28T22:55:29Z

Summary

Adds B200 variants for 7 H200-only workloads: DeepSeek-V3.2, DeepSeek-V4-Pro, GLM-5.1, Kimi-K2.5, MiniMax-M2.5, Nemotron-3-Super, Qwen3.5
B200-specific optimizations: --kv-cache-dtype fp8, --block-size 256, --moe-backend deep_gemm_mega_moe (MoE models), --attention_config.use_fp4_indexer_cache=True (DeepSeek architecture)
TP/DP/EP strategy kept identical to H200 counterparts

Test plan

All 7 workloads pass parser smoke tests locally
Buildkite build #131 validates on real B200 hardware

This PR was authored with assistance from Claude Code.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

deep_gemm_mega_moe is not yet available in the current nightly image. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use NVIDIA NVFP4 model variants and official vLLM recipe serve args for B200 workloads (recipes.vllm.ai). Key changes: - deepseek_v3_2: NVFP4 + FLASHINFER_MLA attention - qwen3_5: NVFP4 + flashinfer MoE FP4 - nemotron_3_super: NVFP4, TP=1 - glm_5_1: NVFP4 + flashinfer MoE FP4 - kimi_k2_5: NVFP4 + Eagle3 speculative decoding - minimax_m2_5: match B200 recipe (TP=4, compilation config) - deepseek_v4_flash: unchanged (already correct) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

server.sh word-splits $serve_args unquoted, so bash does not strip the single quotes wrapping the --speculative-config / --compilation-config JSON. vLLM received the literal '{...}' (quotes included) and rejected it with "cannot be converted". The compact JSON has no spaces, so dropping the surrounding single quotes lets word-splitting yield one valid token. AI-assisted change. Co-Authored-By: Claude <noreply@anthropic.com>

… JSON) AI-assisted change. Co-Authored-By: Claude <noreply@anthropic.com>

DeepSeek-V3.2 uses sparse attention; forcing --attention-backend FLASHINFER_MLA crashes engine init with "Selected backend AttentionBackendEnum.FLASHINFER_MLA is not valid ... ['sparse not supported']". The vLLM recipe sets no --attention-backend, letting vLLM auto-select a sparse-capable MLA backend. Remove the override. AI-assisted change. Co-Authored-By: Claude <noreply@anthropic.com>

/mnt/shared is NOT a shared mount on the DGX B200 nodes — it's a plain directory on each node's ~1.8T root disk. With HF_HOME=/mnt/shared/hf_cache, model weights accumulated on root until the kubelet hit disk-pressure and evicted job pods ~20s into setup (seen as "signal: terminated" on dgxb200-11). Point the B200 HF cache at the 28T /raid volume (already mounted into the pod) in both lib/gpu_profiles.yaml (authoritative, exported as HF_HOME by parse_workload) and the K8s pod env in generate_pipeline.py. AI-assisted change. Co-Authored-By: Claude <noreply@anthropic.com>

kimi: --attention-config.use_trtllm_ragged_deepseek_prefill is not a valid AttentionConfig field. The TRT-LLM ragged DeepSeek prefill path is selected via the MLA prefill backend, so use --attention-config.mla_prefill_backend= TRTLLM_RAGGED (MLAPrefillBackendEnum.TRTLLM_RAGGED). bench: native `vllm bench serve` runs under the container's system Python (the interpreter backing the vllm CLI), not the job .venv, so the pandas SpeedBench needs wasn't installed there — only the docker branch handled it. Add a native branch that installs pandas into that interpreter. AI-assisted change. Co-Authored-By: Claude <noreply@anthropic.com>

Add B200 workload variants for all H200-only models

6cdb1cf

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

khluu force-pushed the worktree-add-b200-workloads branch from 9886c4e to 6cdb1cf Compare May 28, 2026 23:40

khluu and others added 7 commits May 29, 2026 02:11

Use deep_gemm MoE backend for B200 workloads

47f0c16

deep_gemm_mega_moe is not yet available in the current nightly image. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: note serve_args word-split is unquoted (no quoted/space-bearing…

c96769b

… JSON) AI-assisted change. Co-Authored-By: Claude <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add B200 workload variants for all H200-only models#12

Add B200 workload variants for all H200-only models#12
khluu wants to merge 8 commits into
mainfrom
worktree-add-b200-workloads

khluu commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

khluu commented May 28, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant