Skip to content

Add B200 workload variants for all H200-only models#12

Open
khluu wants to merge 8 commits into
mainfrom
worktree-add-b200-workloads
Open

Add B200 workload variants for all H200-only models#12
khluu wants to merge 8 commits into
mainfrom
worktree-add-b200-workloads

Conversation

@khluu

@khluu khluu commented May 28, 2026

Copy link
Copy Markdown
Member

Summary

  • Adds B200 variants for 7 H200-only workloads: DeepSeek-V3.2, DeepSeek-V4-Pro, GLM-5.1, Kimi-K2.5, MiniMax-M2.5, Nemotron-3-Super, Qwen3.5
  • B200-specific optimizations: --kv-cache-dtype fp8, --block-size 256, --moe-backend deep_gemm_mega_moe (MoE models), --attention_config.use_fp4_indexer_cache=True (DeepSeek architecture)
  • TP/DP/EP strategy kept identical to H200 counterparts

Test plan

  • All 7 workloads pass parser smoke tests locally
  • Buildkite build #131 validates on real B200 hardware

This PR was authored with assistance from Claude Code.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@khluu khluu force-pushed the worktree-add-b200-workloads branch from 9886c4e to 6cdb1cf Compare May 28, 2026 23:40
khluu and others added 7 commits May 29, 2026 02:11
deep_gemm_mega_moe is not yet available in the current nightly image.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use NVIDIA NVFP4 model variants and official vLLM recipe serve args
for B200 workloads (recipes.vllm.ai). Key changes:
- deepseek_v3_2: NVFP4 + FLASHINFER_MLA attention
- qwen3_5: NVFP4 + flashinfer MoE FP4
- nemotron_3_super: NVFP4, TP=1
- glm_5_1: NVFP4 + flashinfer MoE FP4
- kimi_k2_5: NVFP4 + Eagle3 speculative decoding
- minimax_m2_5: match B200 recipe (TP=4, compilation config)
- deepseek_v4_flash: unchanged (already correct)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
server.sh word-splits $serve_args unquoted, so bash does not strip the
single quotes wrapping the --speculative-config / --compilation-config
JSON. vLLM received the literal '{...}' (quotes included) and rejected it
with "cannot be converted". The compact JSON has no spaces, so dropping
the surrounding single quotes lets word-splitting yield one valid token.

AI-assisted change.

Co-Authored-By: Claude <noreply@anthropic.com>
… JSON)

AI-assisted change.

Co-Authored-By: Claude <noreply@anthropic.com>
DeepSeek-V3.2 uses sparse attention; forcing --attention-backend
FLASHINFER_MLA crashes engine init with "Selected backend
AttentionBackendEnum.FLASHINFER_MLA is not valid ... ['sparse not
supported']". The vLLM recipe sets no --attention-backend, letting vLLM
auto-select a sparse-capable MLA backend. Remove the override.

AI-assisted change.

Co-Authored-By: Claude <noreply@anthropic.com>
/mnt/shared is NOT a shared mount on the DGX B200 nodes — it's a plain
directory on each node's ~1.8T root disk. With HF_HOME=/mnt/shared/hf_cache,
model weights accumulated on root until the kubelet hit disk-pressure and
evicted job pods ~20s into setup (seen as "signal: terminated" on dgxb200-11).

Point the B200 HF cache at the 28T /raid volume (already mounted into the
pod) in both lib/gpu_profiles.yaml (authoritative, exported as HF_HOME by
parse_workload) and the K8s pod env in generate_pipeline.py.

AI-assisted change.

Co-Authored-By: Claude <noreply@anthropic.com>
kimi: --attention-config.use_trtllm_ragged_deepseek_prefill is not a valid
AttentionConfig field. The TRT-LLM ragged DeepSeek prefill path is selected
via the MLA prefill backend, so use --attention-config.mla_prefill_backend=
TRTLLM_RAGGED (MLAPrefillBackendEnum.TRTLLM_RAGGED).

bench: native `vllm bench serve` runs under the container's system Python
(the interpreter backing the vllm CLI), not the job .venv, so the pandas
SpeedBench needs wasn't installed there — only the docker branch handled it.
Add a native branch that installs pandas into that interpreter.

AI-assisted change.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant