Add B200 workload variants for all H200-only models#12
Open
khluu wants to merge 8 commits into
Open
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9886c4e to
6cdb1cf
Compare
deep_gemm_mega_moe is not yet available in the current nightly image. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use NVIDIA NVFP4 model variants and official vLLM recipe serve args for B200 workloads (recipes.vllm.ai). Key changes: - deepseek_v3_2: NVFP4 + FLASHINFER_MLA attention - qwen3_5: NVFP4 + flashinfer MoE FP4 - nemotron_3_super: NVFP4, TP=1 - glm_5_1: NVFP4 + flashinfer MoE FP4 - kimi_k2_5: NVFP4 + Eagle3 speculative decoding - minimax_m2_5: match B200 recipe (TP=4, compilation config) - deepseek_v4_flash: unchanged (already correct) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
server.sh word-splits $serve_args unquoted, so bash does not strip the
single quotes wrapping the --speculative-config / --compilation-config
JSON. vLLM received the literal '{...}' (quotes included) and rejected it
with "cannot be converted". The compact JSON has no spaces, so dropping
the surrounding single quotes lets word-splitting yield one valid token.
AI-assisted change.
Co-Authored-By: Claude <noreply@anthropic.com>
… JSON) AI-assisted change. Co-Authored-By: Claude <noreply@anthropic.com>
DeepSeek-V3.2 uses sparse attention; forcing --attention-backend FLASHINFER_MLA crashes engine init with "Selected backend AttentionBackendEnum.FLASHINFER_MLA is not valid ... ['sparse not supported']". The vLLM recipe sets no --attention-backend, letting vLLM auto-select a sparse-capable MLA backend. Remove the override. AI-assisted change. Co-Authored-By: Claude <noreply@anthropic.com>
/mnt/shared is NOT a shared mount on the DGX B200 nodes — it's a plain directory on each node's ~1.8T root disk. With HF_HOME=/mnt/shared/hf_cache, model weights accumulated on root until the kubelet hit disk-pressure and evicted job pods ~20s into setup (seen as "signal: terminated" on dgxb200-11). Point the B200 HF cache at the 28T /raid volume (already mounted into the pod) in both lib/gpu_profiles.yaml (authoritative, exported as HF_HOME by parse_workload) and the K8s pod env in generate_pipeline.py. AI-assisted change. Co-Authored-By: Claude <noreply@anthropic.com>
kimi: --attention-config.use_trtllm_ragged_deepseek_prefill is not a valid AttentionConfig field. The TRT-LLM ragged DeepSeek prefill path is selected via the MLA prefill backend, so use --attention-config.mla_prefill_backend= TRTLLM_RAGGED (MLAPrefillBackendEnum.TRTLLM_RAGGED). bench: native `vllm bench serve` runs under the container's system Python (the interpreter backing the vllm CLI), not the job .venv, so the pandas SpeedBench needs wasn't installed there — only the docker branch handled it. Add a native branch that installs pandas into that interpreter. AI-assisted change. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--kv-cache-dtype fp8,--block-size 256,--moe-backend deep_gemm_mega_moe(MoE models),--attention_config.use_fp4_indexer_cache=True(DeepSeek architecture)Test plan
This PR was authored with assistance from Claude Code.
🤖 Generated with Claude Code