dsv4-fp4-mi355x-vllm and adopt recipes#433 by Oseltamivir · Pull Request #1374 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-05-13T23:53:57Z

Summary

The DeepSeek-V4-Pro deepseek-ai/DeepSeek-V4-Pro checkpoint used by the dsv4-fp8-mi355x-vllm benchmark is actually the FP4+FP8 mixed-precision weights (FP4 MoE expert weights dominate the ~960 GB footprint; FP8 only on attention/norm/router and FP8 KV cache at runtime). The matching sister benchmarks already use precision: fp4 on this same checkpoint:

key	model	precision (after this PR)
`dsv4-fp4-mi355x-sglang`	`deepseek-ai/DeepSeek-V4-Pro`	`fp4`
`dsv4-fp4-mi355x-atom`	`deepseek-ai/DeepSeek-V4-Pro`	`fp4`
`dsv4-fp4-mi355x-vllm` (was `dsv4-fp8-mi355x-vllm`)	`deepseek-ai/DeepSeek-V4-Pro`	`fp4`
`dsv4-fp8-mi355x-sglang`	`sgl-project/DeepSeek-V4-Pro-FP8` (re-quantized full FP8)	`fp8` (unchanged)

This PR also applies the validated MI355X serving recipe from vllm-project/recipes#433 (DeepSeek-V4-Pro, TP=8), which sources its config from vllm-project/vllm#40871. The base vLLM build (PR #40889 with AITER-accelerated sparse MLA decode, pinned in the script) is unchanged; only serving flags / env vars / sweep range are updated.

File / config changes

git mv benchmarks/single_node/dsv4_fp8_mi355x_vllm.sh -> dsv4_fp4_mi355x_vllm.sh
amd-master.yaml: remove dsv4-fp8-mi355x-vllm block, add dsv4-fp4-mi355x-vllm block next to its dsv4-fp4-* MI355X siblings with precision: fp4
perf-changelog.yaml: append entry documenting the rename + recipe adoption

Server-side changes

	Before	After (recipe #433)
`VLLM_ROCM_USE_AITER_LINEAR`	unset	`1`
`--distributed-executor-backend`	(default)	`mp`
`--max-num-batched-tokens`	(default)	`8192`
`--async-scheduling`	off	on
`--gpu-memory-utilization`	`0.90`	`0.6`
`--max-num-seqs`	`32`	`128`
`--tool-call-parser deepseek_v4`	on	dropped
`--enable-auto-tool-choice`	on	dropped

Tool-call flags were removed because the recipe omits them and throughput benchmarks here don't exercise tool calling. All other existing flags (--kv-cache-dtype fp8, --moe-backend triton_unfused, --enforce-eager, --no-enable-prefix-caching, --tokenizer-mode deepseek_v4, --reasoning-parser deepseek_v4) are preserved.

Sweep changes

dsv4-fp4-mi355x-vllm (formerly dsv4-fp8-mi355x-vllm) was previously pinned to conc=1 only. With --max-num-seqs=128 validated by the recipe, the sweep is expanded to conc 4-64 for both 1k1k and 8k1k, matching dsv4-fp4-mi355x-sglang so vLLM↔SGLang results are directly comparable on the same MI355X runner.

Validated locally:

$ python utils/matrix_logic/generate_sweep_configs.py full-sweep \
    --config-files .github/configs/amd-master.yaml \
    --framework vllm --runner-type mi355x
# 10 dsv4-fp4-mi355x-vllm configs generated (5 per ISL/OSL, tp=8, conc 4..64)

Notes

dsv4-fp8-mi355x-sglang is unchanged — it uses the genuinely-FP8 sgl-project/DeepSeek-V4-Pro-FP8 checkpoint, not the mixed-precision one.
Supersedes Improve dsv4-fp8-mi355x-vllm with vllm-project/recipes#433 MI355X recipe #1373.

Test plan

Apply sweep-enabled label so run-sweep.yml exercises the renamed key + new flags end-to-end at trimmed concurrency.
Confirm the AITER MLA decode build (vLLM PR #40889 SHA b3a4a44) still installs cleanly with the new server flags.
Inspect agg_bmk.json for dsv4 mi355x-vllm entries; compare throughput vs. dsv4-fp4-mi355x-sglang and dsv4-fp4-mi355x-atom at matching concurrencies.

…#433 The deepseek-ai/DeepSeek-V4-Pro checkpoint is FP4+FP8 mixed (FP4 MoE expert weights dominate the ~960 GB footprint, with FP8 only on attention/norm/router and FP8 KV cache). Reclassify the vLLM MI355X benchmark as fp4 — matching dsv4-fp4-mi355x-sglang and dsv4-fp4-mi355x-atom, which use the same checkpoint. Also apply the validated MI355X serving recipe from vllm-project/recipes#433 (DeepSeek-V4-Pro, TP=8): * Rename benchmarks/single_node/dsv4_fp8_mi355x_vllm.sh -> dsv4_fp4_mi355x_vllm.sh; remove dsv4-fp8-mi355x-vllm from amd-master.yaml; add dsv4-fp4-mi355x-vllm next to its fp4 siblings * Add VLLM_ROCM_USE_AITER_LINEAR=1 env var * Add --distributed-executor-backend mp, --max-num-batched-tokens 8192, --async-scheduling server flags * Tune --gpu-memory-utilization 0.90 -> 0.6 and --max-num-seqs 32 -> 128 * Drop --tool-call-parser / --enable-auto-tool-choice (not in recipe, not exercised by throughput benchmarks) * Expand sweep from conc=1 to conc 4-64 to match dsv4-fp4-mi355x-sglang for vLLM<->SGLang comparability now that max-num-seqs=128 allows it

github-actions · 2026-05-13T23:54:05Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-13T23:54:46Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25833234121
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25833234121

github-actions · 2026-05-13T23:55:18Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25833238956
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25833238956

github-actions · 2026-05-14T00:05:09Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25833259497
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25833259497

github-actions · 2026-05-14T00:31:31Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25833728949
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25833728949

With the image switched to vllm/vllm-openai-rocm:nightly (which already includes vllm-project/vllm#40871 DSv4 base ROCm support), the rebuild overlay and the workarounds that propped up the rocm/atom base image are no longer needed: * Remove the vLLM PR #40889 clone + editable rebuild block * Remove sanitize_stale_triton_test_metadata() (/triton-test was an atom-image metadata quirk; the new /install/torch...whl bug exposed in run 25833728949 stems entirely from this rebuild path, so dropping the rebuild removes both) * Remove ensure_amdsmi_python() — nightly ships the amdsmi Python wheel * Remove install_tilelang_runtime_deps() — only the rebuilt vLLM needed it * Remove patch_vllm_rocm_platform_detection() — nightly detects ROCm correctly without the amdsmi/torch fallback patches * Remove triton_kernels install — only needed by PR #40889's MoE path * Drop VLLM_TARGET_DEVICE / VLLM_PLUGINS env vars (atom-specific) Keep env vars (VLLM_ROCM_USE_AITER, VLLM_ROCM_USE_AITER_LINEAR, VLLM_ENGINE_READY_TIMEOUT_S for the slow cold-cache load), the recipe vllm serve invocation, and the benchmark/eval driver calls. Also refresh the amd-master.yaml comment block above the entry to drop the rebuild references. Script: 539 -> 94 lines.

github-actions · 2026-05-14T00:36:22Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25834532618
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25834532618

The recipe (vllm-project/recipes#433) specifies --moe-backend triton_unfused, but that choice was never accepted into vLLM main — likely it lived on the #40871 PR branch and was renamed/removed before merge. In vllm/vllm-openai-rocm:nightly (which the recipe itself uses), the legal choices are: aiter, auto, cutlass, deep_gemm, emulation, flashinfer_cutedsl, flashinfer_cutlass, flashinfer_trtllm, marlin, triton. Drop the flag entirely and let vLLM's `auto` selector pick the backend. With VLLM_ROCM_USE_AITER=1 set, that resolves to the AITER MoE path on ROCm — the same kernel family the recipe was steering toward. All other remaining flags and env vars verified valid in vLLM 0.20.2.

The previous run errored with: Model architectures ['DeepseekV4ForCausalLM'] are not supported for now. Supported architectures: dict_keys([..., 'DeepseekV32ForCausalLM', ...]) even though vllm-project/vllm#40871 (which registers DeepseekV4ForCausalLM) merged on 2026-05-05 and vllm/vllm-openai-rocm:nightly has been bumped multiple times since. Root cause: runners/launch_mi355x-amds.sh caches enroot squashfs files keyed on the image string and short-circuits re-import if the squash already exists. The runner's cached squash for ':nightly' predates the #40871 merge (the container reported vllm 0.19.2rc1.dev212 ~ Apr 25), so docker hub updates never reached the runner. Switch to an immutable digest-suffixed tag — the squash cache key now changes whenever we bump, forcing a fresh import. Picking nightly-dcacdf9a8860a86401127d1c8f93ebf3cfbfd026 (2026-05-13, most recent at time of pin), which is well past the #40871 merge. Also update the script header and yaml comment block to document the caching pitfall so the next bumper doesn't revert to ':nightly'.

I dropped --moe-backend triton_unfused based on a stale error message ("invalid choice ... choose from aiter, auto, ...") from the previous run, but that error came from the cached squashfs of an April 25 build that pre-dated #40871. The pinned nightly-dcacdf9a8860a8640 DOES have triton_unfused in MoEBackend — verified by reading vllm/config/kernel.py at that exact commit on GitHub. Without --moe-backend triton_unfused, vLLM's auto selector picks a backend that doesn't register w13_weight_scale / w2_weight_scale on the FP4 expert layers, so safetensors loading throws: KeyError: 'layers.0.ffn.experts.w13_weight_scale' at vllm/model_executor/models/deepseek_v4.py:1492 This matches the recipe (vllm-project/recipes#433) line-for-line now, with the only intentional deviations being InferenceX conventions: * --max-model-len $MAX_MODEL_LEN (sized to ISL+OSL+256) * --no-enable-prefix-caching (fair benchmark comparisons) * VLLM_ENGINE_READY_TIMEOUT_S=3600 (cold HF-cache tolerance) None of those interact with weight loading; they were not implicated in either failure.

The previous run errored with: ValueError: moe_backend='triton_unfused' is not supported for FP8 MoE. Expected one of ['triton','deep_gemm','cutlass','flashinfer_trtllm', 'flashinfer_cutlass','marlin','aiter'] even though the DeepSeek-V4-Pro config explicitly declares `expert_dtype: "fp4"`. The cause is vLLM's auto-detection of the DSv4-aware quant config: DeepseekV4FP8Config.override_quantization_method returns "deepseek_v4_fp8" only when: hf_quant_cfg.quant_method in ("fp8","deepseek_v4_fp8") AND (hf_config.model_type == "deepseek_v4" OR user_quant == "deepseek_v4_fp8") The HF config has model_type=deepseek_v4, but the sister SGLang script (dsv4_fp8_mi355x.sh) documents that the bundled transformers in these container images does NOT recognize that model_type and the cached config has to be patched. When the auto-detection silently fails, vLLM falls back to plain Fp8Config, which: * Treats the FusedMoE layer as FP8 block-quantized (registers weight_scale_inv params instead of FP4 w13_weight_scale / w2_weight_scale → KeyError on load_weights — the prior failure) * Routes through select_fp8_moe_backend, which doesn't accept triton_unfused as a valid choice (the current failure) Pass --quantization deepseek_v4_fp8 to take the user_quant branch explicitly and bypass the model_type check entirely. This is the only remaining recipe-vs-runtime deviation needed to make recipes#433 work on this container; document the why in the script header.

Oseltamivir · 2026-05-14T03:11:12Z

Blocked by vllm-project/vllm#41946, waiting for image

…recipe-433 # Conflicts: # perf-changelog.yaml

github-actions · 2026-05-18T17:32:29Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26011426044
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26011426044

github-actions · 2026-05-19T16:19:42Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26109152317
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26109152317

github-actions · 2026-05-19T16:29:09Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26110188026
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26110188026

github-actions · 2026-05-19T19:15:25Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26110663181
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26110663181

Oseltamivir · 2026-05-19T20:10:41Z

/reuse-sweep-run

github-actions · 2026-05-19T20:11:32Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26122410203
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26122410203

…xups Resolutions: - perf-changelog.yaml: took main verbatim. - runners/launch_b300-nv.sh: took main (drops --nodelist pin entirely; supersedes our narrower 017-019 fix). - benchmarks/single_node/fixed_seq_len/dsv4_fp8_mi355x{,_vllm}.sh: accepted main's deletes (orphan recipes removed in #1374, #1501). - .github/configs/amd-master.yaml: took main as the base, then re-applied our agentic-only additions on top: * qwen3.5-fp8-mi355x-sglang-agentic-hicache (new entry) * dsv4-fp4-mi355x-vllm-agentic (new entry) * dsv4-fp4-mi355x-sglang-agentic (new entry) * kimik2.5-fp4-mi355x-vllm-agentic (cpu -> lmcache) Dropped our comment-path edit for dsv4_fp8_mi355x_vllm.sh since main deleted that entry. Fixed_seq_len reorg fixups for files added on main during our branch's lifetime: - git mv 14 stranded scripts from benchmarks/single_node/*.sh into benchmarks/single_node/fixed_seq_len/ (dsr1_fp4_b200_mtp, dsr1_fp4_mi355x_mtp, dsr1_fp8_h200_mtp, dsr1_fp8_mi325x_mtp, dsr1_fp8_mi355x_mtp, dsv4_fp4_mi355x_vllm, glm5_fp8_h200_mtp, glm5_fp8_mi325x, glm5_fp8_mi325x_mtp, qwen3.5_bf16_mi325x_mtp, qwen3.5_fp4_mi355x_mtp, qwen3.5_fp8_h100, qwen3.5_fp8_h100_mtp, qwen3.5_fp8_mi325x_mtp). Patched their source paths from ../benchmark_lib.sh to ../../benchmark_lib.sh. - runners/launch_mi355x-amds.sh: multinode-non-disagg BENCHMARK_SUBDIR bumped from `single_node` to `single_node/fixed_seq_len`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Oseltamivir requested a review from a team May 13, 2026 23:53

Oseltamivir requested a review from billishyahao as a code owner May 13, 2026 23:53

github-project-automation Bot added this to InferenceMAX Board May 13, 2026

Oseltamivir requested review from 1am9trash, chunfangamd, seungrokj and yctseng0211 as code owners May 13, 2026 23:53

Oseltamivir added the sweep-enabled label May 13, 2026

Backfill PR #1374 link in perf-changelog

5480cf5

Oseltamivir added the sweep-enabled label May 13, 2026

Oseltamivir changed the title ~~Rename dsv4-fp8-mi355x-vllm to dsv4-fp4-mi355x-vllm and adopt recipes#433~~ dsv4-fp4-mi355x-vllm and adopt recipes#433 May 13, 2026

Update amd-master.yaml

1509732

Oseltamivir added 5 commits May 13, 2026 17:46

Merge branch 'main' into dsv4-fp4-mi355x-vllm-recipe-433

abb8acf

Oseltamivir closed this May 14, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board May 14, 2026

Oseltamivir reopened this May 14, 2026

Oseltamivir changed the title ~~dsv4-fp4-mi355x-vllm and adopt recipes#433~~ [waiting for image update] dsv4-fp4-mi355x-vllm and adopt recipes#433 May 14, 2026

Oseltamivir added 2 commits May 17, 2026 20:07

Merge remote-tracking branch 'origin/main' into dsv4-fp4-mi355x-vllm-…

65c3c0f

…recipe-433 # Conflicts: # perf-changelog.yaml

Update image

3c761bf

Oseltamivir changed the title ~~[waiting for image update] dsv4-fp4-mi355x-vllm and adopt recipes#433~~ dsv4-fp4-mi355x-vllm and adopt recipes#433 May 18, 2026

Oseltamivir added 2 commits May 18, 2026 13:35

Update DSv4 MI355X vLLM ROCm nightly image

7f4874e

Merge branch 'main' into dsv4-fp4-mi355x-vllm-recipe-433

963c5f9

Oseltamivir removed the sweep-enabled label May 19, 2026

Oseltamivir added 2 commits May 19, 2026 08:54

Merge branch 'main' into dsv4-fp4-mi355x-vllm-recipe-433

c5e282d

Add MI355X DSv4 vLLM DEP validation probes

ec268e1

Oseltamivir added the full-sweep-enabled label May 19, 2026

SemiAnalysisAI deleted a comment from github-actions Bot May 19, 2026

disable DPA

5ed629e

disable EP

ac04da0

final

7c2f1f4

Oseltamivir merged commit 3091785 into main May 19, 2026
4 of 5 checks passed

Oseltamivir deleted the dsv4-fp4-mi355x-vllm-recipe-433 branch May 19, 2026 20:10

Conversation

Oseltamivir commented May 13, 2026

Summary

File / config changes

Server-side changes

Sweep changes

Notes

Test plan

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

Oseltamivir commented May 14, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Oseltamivir commented May 19, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant