dsv4-fp4-mi355x-vllm and adopt recipes#433#1374
Conversation
…#433 The deepseek-ai/DeepSeek-V4-Pro checkpoint is FP4+FP8 mixed (FP4 MoE expert weights dominate the ~960 GB footprint, with FP8 only on attention/norm/router and FP8 KV cache). Reclassify the vLLM MI355X benchmark as fp4 — matching dsv4-fp4-mi355x-sglang and dsv4-fp4-mi355x-atom, which use the same checkpoint. Also apply the validated MI355X serving recipe from vllm-project/recipes#433 (DeepSeek-V4-Pro, TP=8): * Rename benchmarks/single_node/dsv4_fp8_mi355x_vllm.sh -> dsv4_fp4_mi355x_vllm.sh; remove dsv4-fp8-mi355x-vllm from amd-master.yaml; add dsv4-fp4-mi355x-vllm next to its fp4 siblings * Add VLLM_ROCM_USE_AITER_LINEAR=1 env var * Add --distributed-executor-backend mp, --max-num-batched-tokens 8192, --async-scheduling server flags * Tune --gpu-memory-utilization 0.90 -> 0.6 and --max-num-seqs 32 -> 128 * Drop --tool-call-parser / --enable-auto-tool-choice (not in recipe, not exercised by throughput benchmarks) * Expand sweep from conc=1 to conc 4-64 to match dsv4-fp4-mi355x-sglang for vLLM<->SGLang comparability now that max-num-seqs=128 allows it
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25833234121 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25833238956 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25833259497 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25833728949 |
With the image switched to vllm/vllm-openai-rocm:nightly (which already includes vllm-project/vllm#40871 DSv4 base ROCm support), the rebuild overlay and the workarounds that propped up the rocm/atom base image are no longer needed: * Remove the vLLM PR #40889 clone + editable rebuild block * Remove sanitize_stale_triton_test_metadata() (/triton-test was an atom-image metadata quirk; the new /install/torch...whl bug exposed in run 25833728949 stems entirely from this rebuild path, so dropping the rebuild removes both) * Remove ensure_amdsmi_python() — nightly ships the amdsmi Python wheel * Remove install_tilelang_runtime_deps() — only the rebuilt vLLM needed it * Remove patch_vllm_rocm_platform_detection() — nightly detects ROCm correctly without the amdsmi/torch fallback patches * Remove triton_kernels install — only needed by PR #40889's MoE path * Drop VLLM_TARGET_DEVICE / VLLM_PLUGINS env vars (atom-specific) Keep env vars (VLLM_ROCM_USE_AITER, VLLM_ROCM_USE_AITER_LINEAR, VLLM_ENGINE_READY_TIMEOUT_S for the slow cold-cache load), the recipe vllm serve invocation, and the benchmark/eval driver calls. Also refresh the amd-master.yaml comment block above the entry to drop the rebuild references. Script: 539 -> 94 lines.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25834532618 |
The recipe (vllm-project/recipes#433) specifies --moe-backend triton_unfused, but that choice was never accepted into vLLM main — likely it lived on the #40871 PR branch and was renamed/removed before merge. In vllm/vllm-openai-rocm:nightly (which the recipe itself uses), the legal choices are: aiter, auto, cutlass, deep_gemm, emulation, flashinfer_cutedsl, flashinfer_cutlass, flashinfer_trtllm, marlin, triton. Drop the flag entirely and let vLLM's `auto` selector pick the backend. With VLLM_ROCM_USE_AITER=1 set, that resolves to the AITER MoE path on ROCm — the same kernel family the recipe was steering toward. All other remaining flags and env vars verified valid in vLLM 0.20.2.
The previous run errored with: Model architectures ['DeepseekV4ForCausalLM'] are not supported for now. Supported architectures: dict_keys([..., 'DeepseekV32ForCausalLM', ...]) even though vllm-project/vllm#40871 (which registers DeepseekV4ForCausalLM) merged on 2026-05-05 and vllm/vllm-openai-rocm:nightly has been bumped multiple times since. Root cause: runners/launch_mi355x-amds.sh caches enroot squashfs files keyed on the image string and short-circuits re-import if the squash already exists. The runner's cached squash for ':nightly' predates the #40871 merge (the container reported vllm 0.19.2rc1.dev212 ~ Apr 25), so docker hub updates never reached the runner. Switch to an immutable digest-suffixed tag — the squash cache key now changes whenever we bump, forcing a fresh import. Picking nightly-dcacdf9a8860a86401127d1c8f93ebf3cfbfd026 (2026-05-13, most recent at time of pin), which is well past the #40871 merge. Also update the script header and yaml comment block to document the caching pitfall so the next bumper doesn't revert to ':nightly'.
I dropped --moe-backend triton_unfused based on a stale error message
("invalid choice ... choose from aiter, auto, ...") from the previous
run, but that error came from the cached squashfs of an April 25 build
that pre-dated #40871. The pinned nightly-dcacdf9a8860a8640 DOES have
triton_unfused in MoEBackend — verified by reading vllm/config/kernel.py
at that exact commit on GitHub.
Without --moe-backend triton_unfused, vLLM's auto selector picks a
backend that doesn't register w13_weight_scale / w2_weight_scale on the
FP4 expert layers, so safetensors loading throws:
KeyError: 'layers.0.ffn.experts.w13_weight_scale'
at vllm/model_executor/models/deepseek_v4.py:1492
This matches the recipe (vllm-project/recipes#433) line-for-line now,
with the only intentional deviations being InferenceX conventions:
* --max-model-len $MAX_MODEL_LEN (sized to ISL+OSL+256)
* --no-enable-prefix-caching (fair benchmark comparisons)
* VLLM_ENGINE_READY_TIMEOUT_S=3600 (cold HF-cache tolerance)
None of those interact with weight loading; they were not implicated
in either failure.
The previous run errored with:
ValueError: moe_backend='triton_unfused' is not supported for FP8 MoE.
Expected one of ['triton','deep_gemm','cutlass','flashinfer_trtllm',
'flashinfer_cutlass','marlin','aiter']
even though the DeepSeek-V4-Pro config explicitly declares
`expert_dtype: "fp4"`. The cause is vLLM's auto-detection of the
DSv4-aware quant config:
DeepseekV4FP8Config.override_quantization_method returns
"deepseek_v4_fp8" only when:
hf_quant_cfg.quant_method in ("fp8","deepseek_v4_fp8") AND
(hf_config.model_type == "deepseek_v4" OR
user_quant == "deepseek_v4_fp8")
The HF config has model_type=deepseek_v4, but the sister SGLang
script (dsv4_fp8_mi355x.sh) documents that the bundled transformers
in these container images does NOT recognize that model_type and the
cached config has to be patched. When the auto-detection silently
fails, vLLM falls back to plain Fp8Config, which:
* Treats the FusedMoE layer as FP8 block-quantized (registers
weight_scale_inv params instead of FP4 w13_weight_scale /
w2_weight_scale → KeyError on load_weights — the prior failure)
* Routes through select_fp8_moe_backend, which doesn't accept
triton_unfused as a valid choice (the current failure)
Pass --quantization deepseek_v4_fp8 to take the user_quant branch
explicitly and bypass the model_type check entirely. This is the only
remaining recipe-vs-runtime deviation needed to make recipes#433 work
on this container; document the why in the script header.
|
Blocked by vllm-project/vllm#41946, waiting for image |
…recipe-433 # Conflicts: # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26011426044 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26109152317 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26110188026 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26110663181 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26122410203 |
…xups
Resolutions:
- perf-changelog.yaml: took main verbatim.
- runners/launch_b300-nv.sh: took main (drops --nodelist pin entirely;
supersedes our narrower 017-019 fix).
- benchmarks/single_node/fixed_seq_len/dsv4_fp8_mi355x{,_vllm}.sh:
accepted main's deletes (orphan recipes removed in #1374, #1501).
- .github/configs/amd-master.yaml: took main as the base, then re-applied
our agentic-only additions on top:
* qwen3.5-fp8-mi355x-sglang-agentic-hicache (new entry)
* dsv4-fp4-mi355x-vllm-agentic (new entry)
* dsv4-fp4-mi355x-sglang-agentic (new entry)
* kimik2.5-fp4-mi355x-vllm-agentic (cpu -> lmcache)
Dropped our comment-path edit for dsv4_fp8_mi355x_vllm.sh since main
deleted that entry.
Fixed_seq_len reorg fixups for files added on main during our branch's
lifetime:
- git mv 14 stranded scripts from benchmarks/single_node/*.sh into
benchmarks/single_node/fixed_seq_len/ (dsr1_fp4_b200_mtp,
dsr1_fp4_mi355x_mtp, dsr1_fp8_h200_mtp, dsr1_fp8_mi325x_mtp,
dsr1_fp8_mi355x_mtp, dsv4_fp4_mi355x_vllm, glm5_fp8_h200_mtp,
glm5_fp8_mi325x, glm5_fp8_mi325x_mtp, qwen3.5_bf16_mi325x_mtp,
qwen3.5_fp4_mi355x_mtp, qwen3.5_fp8_h100, qwen3.5_fp8_h100_mtp,
qwen3.5_fp8_mi325x_mtp). Patched their source paths from
../benchmark_lib.sh to ../../benchmark_lib.sh.
- runners/launch_mi355x-amds.sh: multinode-non-disagg BENCHMARK_SUBDIR
bumped from `single_node` to `single_node/fixed_seq_len`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
The DeepSeek-V4-Pro
deepseek-ai/DeepSeek-V4-Procheckpoint used by thedsv4-fp8-mi355x-vllmbenchmark is actually the FP4+FP8 mixed-precision weights (FP4 MoE expert weights dominate the ~960 GB footprint; FP8 only on attention/norm/router and FP8 KV cache at runtime). The matching sister benchmarks already useprecision: fp4on this same checkpoint:dsv4-fp4-mi355x-sglangdeepseek-ai/DeepSeek-V4-Profp4dsv4-fp4-mi355x-atomdeepseek-ai/DeepSeek-V4-Profp4dsv4-fp4-mi355x-vllm(wasdsv4-fp8-mi355x-vllm)deepseek-ai/DeepSeek-V4-Profp4dsv4-fp8-mi355x-sglangsgl-project/DeepSeek-V4-Pro-FP8(re-quantized full FP8)fp8(unchanged)This PR also applies the validated MI355X serving recipe from vllm-project/recipes#433 (DeepSeek-V4-Pro, TP=8), which sources its config from vllm-project/vllm#40871. The base vLLM build (PR #40889 with AITER-accelerated sparse MLA decode, pinned in the script) is unchanged; only serving flags / env vars / sweep range are updated.
File / config changes
git mv benchmarks/single_node/dsv4_fp8_mi355x_vllm.sh -> dsv4_fp4_mi355x_vllm.shamd-master.yaml: removedsv4-fp8-mi355x-vllmblock, adddsv4-fp4-mi355x-vllmblock next to itsdsv4-fp4-*MI355X siblings withprecision: fp4perf-changelog.yaml: append entry documenting the rename + recipe adoptionServer-side changes
VLLM_ROCM_USE_AITER_LINEAR1--distributed-executor-backendmp--max-num-batched-tokens8192--async-scheduling--gpu-memory-utilization0.900.6--max-num-seqs32128--tool-call-parser deepseek_v4--enable-auto-tool-choiceTool-call flags were removed because the recipe omits them and throughput benchmarks here don't exercise tool calling. All other existing flags (
--kv-cache-dtype fp8,--moe-backend triton_unfused,--enforce-eager,--no-enable-prefix-caching,--tokenizer-mode deepseek_v4,--reasoning-parser deepseek_v4) are preserved.Sweep changes
dsv4-fp4-mi355x-vllm(formerlydsv4-fp8-mi355x-vllm) was previously pinned toconc=1only. With--max-num-seqs=128validated by the recipe, the sweep is expanded toconc 4-64for both1k1kand8k1k, matchingdsv4-fp4-mi355x-sglangso vLLM↔SGLang results are directly comparable on the same MI355X runner.Validated locally:
Notes
dsv4-fp8-mi355x-sglangis unchanged — it uses the genuinely-FP8sgl-project/DeepSeek-V4-Pro-FP8checkpoint, not the mixed-precision one.Test plan
sweep-enabledlabel sorun-sweep.ymlexercises the renamed key + new flags end-to-end at trimmed concurrency.b3a4a44) still installs cleanly with the new server flags.agg_bmk.jsonfordsv4mi355x-vllm entries; compare throughput vs.dsv4-fp4-mi355x-sglanganddsv4-fp4-mi355x-atomat matching concurrencies.