Skip to content

dsv4-fp4-mi355x-vllm and adopt recipes#433#1374

Merged
Oseltamivir merged 18 commits into
mainfrom
dsv4-fp4-mi355x-vllm-recipe-433
May 19, 2026
Merged

dsv4-fp4-mi355x-vllm and adopt recipes#433#1374
Oseltamivir merged 18 commits into
mainfrom
dsv4-fp4-mi355x-vllm-recipe-433

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

Summary

The DeepSeek-V4-Pro deepseek-ai/DeepSeek-V4-Pro checkpoint used by the dsv4-fp8-mi355x-vllm benchmark is actually the FP4+FP8 mixed-precision weights (FP4 MoE expert weights dominate the ~960 GB footprint; FP8 only on attention/norm/router and FP8 KV cache at runtime). The matching sister benchmarks already use precision: fp4 on this same checkpoint:

key model precision (after this PR)
dsv4-fp4-mi355x-sglang deepseek-ai/DeepSeek-V4-Pro fp4
dsv4-fp4-mi355x-atom deepseek-ai/DeepSeek-V4-Pro fp4
dsv4-fp4-mi355x-vllm (was dsv4-fp8-mi355x-vllm) deepseek-ai/DeepSeek-V4-Pro fp4
dsv4-fp8-mi355x-sglang sgl-project/DeepSeek-V4-Pro-FP8 (re-quantized full FP8) fp8 (unchanged)

This PR also applies the validated MI355X serving recipe from vllm-project/recipes#433 (DeepSeek-V4-Pro, TP=8), which sources its config from vllm-project/vllm#40871. The base vLLM build (PR #40889 with AITER-accelerated sparse MLA decode, pinned in the script) is unchanged; only serving flags / env vars / sweep range are updated.

File / config changes

  • git mv benchmarks/single_node/dsv4_fp8_mi355x_vllm.sh -> dsv4_fp4_mi355x_vllm.sh
  • amd-master.yaml: remove dsv4-fp8-mi355x-vllm block, add dsv4-fp4-mi355x-vllm block next to its dsv4-fp4-* MI355X siblings with precision: fp4
  • perf-changelog.yaml: append entry documenting the rename + recipe adoption

Server-side changes

Before After (recipe #433)
VLLM_ROCM_USE_AITER_LINEAR unset 1
--distributed-executor-backend (default) mp
--max-num-batched-tokens (default) 8192
--async-scheduling off on
--gpu-memory-utilization 0.90 0.6
--max-num-seqs 32 128
--tool-call-parser deepseek_v4 on dropped
--enable-auto-tool-choice on dropped

Tool-call flags were removed because the recipe omits them and throughput benchmarks here don't exercise tool calling. All other existing flags (--kv-cache-dtype fp8, --moe-backend triton_unfused, --enforce-eager, --no-enable-prefix-caching, --tokenizer-mode deepseek_v4, --reasoning-parser deepseek_v4) are preserved.

Sweep changes

dsv4-fp4-mi355x-vllm (formerly dsv4-fp8-mi355x-vllm) was previously pinned to conc=1 only. With --max-num-seqs=128 validated by the recipe, the sweep is expanded to conc 4-64 for both 1k1k and 8k1k, matching dsv4-fp4-mi355x-sglang so vLLM↔SGLang results are directly comparable on the same MI355X runner.

Validated locally:

$ python utils/matrix_logic/generate_sweep_configs.py full-sweep \
    --config-files .github/configs/amd-master.yaml \
    --framework vllm --runner-type mi355x
# 10 dsv4-fp4-mi355x-vllm configs generated (5 per ISL/OSL, tp=8, conc 4..64)

Notes

Test plan

  • Apply sweep-enabled label so run-sweep.yml exercises the renamed key + new flags end-to-end at trimmed concurrency.
  • Confirm the AITER MLA decode build (vLLM PR #40889 SHA b3a4a44) still installs cleanly with the new server flags.
  • Inspect agg_bmk.json for dsv4 mi355x-vllm entries; compare throughput vs. dsv4-fp4-mi355x-sglang and dsv4-fp4-mi355x-atom at matching concurrencies.

…#433

The deepseek-ai/DeepSeek-V4-Pro checkpoint is FP4+FP8 mixed (FP4 MoE
expert weights dominate the ~960 GB footprint, with FP8 only on
attention/norm/router and FP8 KV cache). Reclassify the vLLM MI355X
benchmark as fp4 — matching dsv4-fp4-mi355x-sglang and
dsv4-fp4-mi355x-atom, which use the same checkpoint.

Also apply the validated MI355X serving recipe from
vllm-project/recipes#433 (DeepSeek-V4-Pro, TP=8):

* Rename benchmarks/single_node/dsv4_fp8_mi355x_vllm.sh ->
  dsv4_fp4_mi355x_vllm.sh; remove dsv4-fp8-mi355x-vllm from
  amd-master.yaml; add dsv4-fp4-mi355x-vllm next to its fp4 siblings
* Add VLLM_ROCM_USE_AITER_LINEAR=1 env var
* Add --distributed-executor-backend mp, --max-num-batched-tokens 8192,
  --async-scheduling server flags
* Tune --gpu-memory-utilization 0.90 -> 0.6 and --max-num-seqs 32 -> 128
* Drop --tool-call-parser / --enable-auto-tool-choice (not in recipe,
  not exercised by throughput benchmarks)
* Expand sweep from conc=1 to conc 4-64 to match dsv4-fp4-mi355x-sglang
  for vLLM<->SGLang comparability now that max-num-seqs=128 allows it
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@Oseltamivir Oseltamivir changed the title Rename dsv4-fp8-mi355x-vllm to dsv4-fp4-mi355x-vllm and adopt recipes#433 dsv4-fp4-mi355x-vllm and adopt recipes#433 May 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

With the image switched to vllm/vllm-openai-rocm:nightly (which already
includes vllm-project/vllm#40871 DSv4 base ROCm support), the rebuild
overlay and the workarounds that propped up the rocm/atom base image
are no longer needed:

* Remove the vLLM PR #40889 clone + editable rebuild block
* Remove sanitize_stale_triton_test_metadata() (/triton-test was an
  atom-image metadata quirk; the new /install/torch...whl bug exposed
  in run 25833728949 stems entirely from this rebuild path, so dropping
  the rebuild removes both)
* Remove ensure_amdsmi_python() — nightly ships the amdsmi Python wheel
* Remove install_tilelang_runtime_deps() — only the rebuilt vLLM needed it
* Remove patch_vllm_rocm_platform_detection() — nightly detects ROCm
  correctly without the amdsmi/torch fallback patches
* Remove triton_kernels install — only needed by PR #40889's MoE path
* Drop VLLM_TARGET_DEVICE / VLLM_PLUGINS env vars (atom-specific)

Keep env vars (VLLM_ROCM_USE_AITER, VLLM_ROCM_USE_AITER_LINEAR,
VLLM_ENGINE_READY_TIMEOUT_S for the slow cold-cache load), the recipe
vllm serve invocation, and the benchmark/eval driver calls.

Also refresh the amd-master.yaml comment block above the entry to drop
the rebuild references.

Script: 539 -> 94 lines.
@github-actions
Copy link
Copy Markdown
Contributor

The recipe (vllm-project/recipes#433) specifies --moe-backend
triton_unfused, but that choice was never accepted into vLLM main —
likely it lived on the #40871 PR branch and was renamed/removed before
merge. In vllm/vllm-openai-rocm:nightly (which the recipe itself uses),
the legal choices are: aiter, auto, cutlass, deep_gemm, emulation,
flashinfer_cutedsl, flashinfer_cutlass, flashinfer_trtllm, marlin,
triton.

Drop the flag entirely and let vLLM's `auto` selector pick the backend.
With VLLM_ROCM_USE_AITER=1 set, that resolves to the AITER MoE path on
ROCm — the same kernel family the recipe was steering toward.

All other remaining flags and env vars verified valid in vLLM 0.20.2.
The previous run errored with:

  Model architectures ['DeepseekV4ForCausalLM'] are not supported for now.
  Supported architectures: dict_keys([..., 'DeepseekV32ForCausalLM', ...])

even though vllm-project/vllm#40871 (which registers DeepseekV4ForCausalLM)
merged on 2026-05-05 and vllm/vllm-openai-rocm:nightly has been bumped
multiple times since.

Root cause: runners/launch_mi355x-amds.sh caches enroot squashfs files
keyed on the image string and short-circuits re-import if the squash
already exists. The runner's cached squash for ':nightly' predates the
#40871 merge (the container reported vllm 0.19.2rc1.dev212 ~ Apr 25),
so docker hub updates never reached the runner.

Switch to an immutable digest-suffixed tag — the squash cache key now
changes whenever we bump, forcing a fresh import. Picking
nightly-dcacdf9a8860a86401127d1c8f93ebf3cfbfd026 (2026-05-13, most
recent at time of pin), which is well past the #40871 merge.

Also update the script header and yaml comment block to document the
caching pitfall so the next bumper doesn't revert to ':nightly'.
I dropped --moe-backend triton_unfused based on a stale error message
("invalid choice ... choose from aiter, auto, ...") from the previous
run, but that error came from the cached squashfs of an April 25 build
that pre-dated #40871. The pinned nightly-dcacdf9a8860a8640 DOES have
triton_unfused in MoEBackend — verified by reading vllm/config/kernel.py
at that exact commit on GitHub.

Without --moe-backend triton_unfused, vLLM's auto selector picks a
backend that doesn't register w13_weight_scale / w2_weight_scale on the
FP4 expert layers, so safetensors loading throws:

  KeyError: 'layers.0.ffn.experts.w13_weight_scale'
  at vllm/model_executor/models/deepseek_v4.py:1492

This matches the recipe (vllm-project/recipes#433) line-for-line now,
with the only intentional deviations being InferenceX conventions:
* --max-model-len $MAX_MODEL_LEN (sized to ISL+OSL+256)
* --no-enable-prefix-caching (fair benchmark comparisons)
* VLLM_ENGINE_READY_TIMEOUT_S=3600 (cold HF-cache tolerance)

None of those interact with weight loading; they were not implicated
in either failure.
The previous run errored with:

  ValueError: moe_backend='triton_unfused' is not supported for FP8 MoE.
  Expected one of ['triton','deep_gemm','cutlass','flashinfer_trtllm',
                   'flashinfer_cutlass','marlin','aiter']

even though the DeepSeek-V4-Pro config explicitly declares
`expert_dtype: "fp4"`. The cause is vLLM's auto-detection of the
DSv4-aware quant config:

  DeepseekV4FP8Config.override_quantization_method returns
  "deepseek_v4_fp8" only when:
    hf_quant_cfg.quant_method in ("fp8","deepseek_v4_fp8")  AND
    (hf_config.model_type == "deepseek_v4" OR
     user_quant == "deepseek_v4_fp8")

The HF config has model_type=deepseek_v4, but the sister SGLang
script (dsv4_fp8_mi355x.sh) documents that the bundled transformers
in these container images does NOT recognize that model_type and the
cached config has to be patched. When the auto-detection silently
fails, vLLM falls back to plain Fp8Config, which:

  * Treats the FusedMoE layer as FP8 block-quantized (registers
    weight_scale_inv params instead of FP4 w13_weight_scale /
    w2_weight_scale → KeyError on load_weights — the prior failure)
  * Routes through select_fp8_moe_backend, which doesn't accept
    triton_unfused as a valid choice (the current failure)

Pass --quantization deepseek_v4_fp8 to take the user_quant branch
explicitly and bypass the model_type check entirely. This is the only
remaining recipe-vs-runtime deviation needed to make recipes#433 work
on this container; document the why in the script header.
@Oseltamivir
Copy link
Copy Markdown
Collaborator Author

Blocked by vllm-project/vllm#41946, waiting for image

@Oseltamivir Oseltamivir reopened this May 14, 2026
@Oseltamivir Oseltamivir changed the title dsv4-fp4-mi355x-vllm and adopt recipes#433 [waiting for image update] dsv4-fp4-mi355x-vllm and adopt recipes#433 May 14, 2026
@Oseltamivir Oseltamivir changed the title [waiting for image update] dsv4-fp4-mi355x-vllm and adopt recipes#433 dsv4-fp4-mi355x-vllm and adopt recipes#433 May 18, 2026
@github-actions
Copy link
Copy Markdown
Contributor

@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions Bot May 19, 2026
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions Bot May 19, 2026
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions Bot May 19, 2026
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions Bot May 19, 2026
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions Bot May 19, 2026
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@Oseltamivir
Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

@Oseltamivir Oseltamivir merged commit 3091785 into main May 19, 2026
4 of 5 checks passed
@Oseltamivir Oseltamivir deleted the dsv4-fp4-mi355x-vllm-recipe-433 branch May 19, 2026 20:10
@github-actions
Copy link
Copy Markdown
Contributor

cquil11 added a commit that referenced this pull request May 27, 2026
…xups

Resolutions:
- perf-changelog.yaml: took main verbatim.
- runners/launch_b300-nv.sh: took main (drops --nodelist pin entirely;
  supersedes our narrower 017-019 fix).
- benchmarks/single_node/fixed_seq_len/dsv4_fp8_mi355x{,_vllm}.sh:
  accepted main's deletes (orphan recipes removed in #1374, #1501).
- .github/configs/amd-master.yaml: took main as the base, then re-applied
  our agentic-only additions on top:
    * qwen3.5-fp8-mi355x-sglang-agentic-hicache  (new entry)
    * dsv4-fp4-mi355x-vllm-agentic               (new entry)
    * dsv4-fp4-mi355x-sglang-agentic             (new entry)
    * kimik2.5-fp4-mi355x-vllm-agentic           (cpu -> lmcache)
  Dropped our comment-path edit for dsv4_fp8_mi355x_vllm.sh since main
  deleted that entry.

Fixed_seq_len reorg fixups for files added on main during our branch's
lifetime:
- git mv 14 stranded scripts from benchmarks/single_node/*.sh into
  benchmarks/single_node/fixed_seq_len/ (dsr1_fp4_b200_mtp,
  dsr1_fp4_mi355x_mtp, dsr1_fp8_h200_mtp, dsr1_fp8_mi325x_mtp,
  dsr1_fp8_mi355x_mtp, dsv4_fp4_mi355x_vllm, glm5_fp8_h200_mtp,
  glm5_fp8_mi325x, glm5_fp8_mi325x_mtp, qwen3.5_bf16_mi325x_mtp,
  qwen3.5_fp4_mi355x_mtp, qwen3.5_fp8_h100, qwen3.5_fp8_h100_mtp,
  qwen3.5_fp8_mi325x_mtp). Patched their source paths from
  ../benchmark_lib.sh to ../../benchmark_lib.sh.
- runners/launch_mi355x-amds.sh: multinode-non-disagg BENCHMARK_SUBDIR
  bumped from `single_node` to `single_node/fixed_seq_len`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant