Skip to content

[Do not merge] Add the Deepseek-V4-Pro supported on MI355x#433

Open
wuhuikx wants to merge 10 commits into
vllm-project:mainfrom
wuhuikx:hattiw/deepseek-v4
Open

[Do not merge] Add the Deepseek-V4-Pro supported on MI355x#433
wuhuikx wants to merge 10 commits into
vllm-project:mainfrom
wuhuikx:hattiw/deepseek-v4

Conversation

@wuhuikx
Copy link
Copy Markdown

@wuhuikx wuhuikx commented May 1, 2026

No description provided.

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 1, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
vllm-recipes Ready Ready Preview, Comment May 7, 2026 2:51am

Request Review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive usage guide for running DeepSeek-V4 on AMD ROCm hardware (specifically MI355X) and updates the DeepSeek-V4-Pro model configuration to include verified AMD support and optimized hardware overrides. The review feedback focuses on improving the portability and consistency of the documentation, including updating installation commands to modern standards, ensuring path consistency for profilers and caches, and correcting model name mismatches in example code.

Comment thread DeepSeek/DeepSeek-V4-AMD.md Outdated
cd vllm
git fetch origin pull/40871/head:pr_dsv4
git checkout pr_dsv4
python3 setup.py develop
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using python3 setup.py develop is deprecated in favor of pip install -e .. It is recommended to use the modern approach for installing the package in editable mode.

Suggested change
python3 setup.py develop
pip install -e .

Comment thread DeepSeek/DeepSeek-V4-AMD.md Outdated
--max-num-batched-tokens ${max_num_batched_tokens} \
--distributed-executor-backend mp \
--trust-remote-code \
--profiler-config '{"profiler":"torch","torch_profiler_dir":"./vllm_profile"}' \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a mismatch between the environment variable VLLM_TORCH_PROFILER_DIR (set to /app/vllm_profile on line 45) and the path provided in --profiler-config (./vllm_profile). It is better to use a consistent path to avoid confusion and ensure profiles are stored in the expected location.

Suggested change
--profiler-config '{"profiler":"torch","torch_profiler_dir":"./vllm_profile"}' \
--profiler-config '{"profiler":"torch","torch_profiler_dir":"/app/vllm_profile"}' \

Comment thread DeepSeek/DeepSeek-V4-AMD.md Outdated
```bash
MODEL=/home/models/DeepSeek-V4-Flash
lm_eval --model local-completions \
--model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=4,max_retries=10,max_gen_toks=2048,timeout=60000 \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using 0.0.0.0 as a connection address is non-standard and may not work as expected on all operating systems. It is safer to use 127.0.0.1 or localhost for connecting to a service running on the same machine.

Suggested change
--model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=4,max_retries=10,max_gen_toks=2048,timeout=60000 \
--model_args model=$MODEL,base_url=http://127.0.0.1:8001/v1/completions,num_concurrent=4,max_retries=10,max_gen_toks=2048,timeout=60000 \

Comment thread DeepSeek/DeepSeek-V4-AMD.md Outdated
export HF_HOME=/data/huggingface-cache
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_LINEAR=1
rm -rf /root/.cache/vllm/torch_compile_cache
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding the /root/ directory is not portable and will fail if the user is not running as the root user. Using ~/.cache or $HOME/.cache is a more robust approach.

Suggested change
rm -rf /root/.cache/vllm/torch_compile_cache
rm -rf ~/.cache/vllm/torch_compile_cache

Comment thread DeepSeek/DeepSeek-V4-AMD.md Outdated
```bash
MODEL=/home/models/DeepSeek-V4-Pro
lm_eval --model local-completions \
--model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using 0.0.0.0 as a connection address is non-standard. It is recommended to use 127.0.0.1 or localhost for local connections.

Suggested change
--model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 \
--model_args model=$MODEL,base_url=http://127.0.0.1:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 \

Comment thread DeepSeek/DeepSeek-V4-AMD.md Outdated
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8001/v1", api_key="EMPTY")
model = "deepseek-ai/DeepSeek-V4-Pro"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model name used in the OpenAI client example (deepseek-ai/DeepSeek-V4-Pro) does not match the model path used to launch the server (/home/models/DeepSeek-V4-Pro on line 139). vLLM requires the model name in the request to match the name or path provided at startup unless --served-model-name is used.

Suggested change
model = "deepseek-ai/DeepSeek-V4-Pro"
model = "/home/models/DeepSeek-V4-Pro"

wuhuikx added 9 commits May 6, 2026 10:11
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Oseltamivir added a commit to SemiAnalysisAI/InferenceX that referenced this pull request May 14, 2026
The recipe (vllm-project/recipes#433) specifies --moe-backend
triton_unfused, but that choice was never accepted into vLLM main —
likely it lived on the #40871 PR branch and was renamed/removed before
merge. In vllm/vllm-openai-rocm:nightly (which the recipe itself uses),
the legal choices are: aiter, auto, cutlass, deep_gemm, emulation,
flashinfer_cutedsl, flashinfer_cutlass, flashinfer_trtllm, marlin,
triton.

Drop the flag entirely and let vLLM's `auto` selector pick the backend.
With VLLM_ROCM_USE_AITER=1 set, that resolves to the AITER MoE path on
ROCm — the same kernel family the recipe was steering toward.

All other remaining flags and env vars verified valid in vLLM 0.20.2.
Oseltamivir added a commit to SemiAnalysisAI/InferenceX that referenced this pull request May 14, 2026
I dropped --moe-backend triton_unfused based on a stale error message
("invalid choice ... choose from aiter, auto, ...") from the previous
run, but that error came from the cached squashfs of an April 25 build
that pre-dated #40871. The pinned nightly-dcacdf9a8860a8640 DOES have
triton_unfused in MoEBackend — verified by reading vllm/config/kernel.py
at that exact commit on GitHub.

Without --moe-backend triton_unfused, vLLM's auto selector picks a
backend that doesn't register w13_weight_scale / w2_weight_scale on the
FP4 expert layers, so safetensors loading throws:

  KeyError: 'layers.0.ffn.experts.w13_weight_scale'
  at vllm/model_executor/models/deepseek_v4.py:1492

This matches the recipe (vllm-project/recipes#433) line-for-line now,
with the only intentional deviations being InferenceX conventions:
* --max-model-len $MAX_MODEL_LEN (sized to ISL+OSL+256)
* --no-enable-prefix-caching (fair benchmark comparisons)
* VLLM_ENGINE_READY_TIMEOUT_S=3600 (cold HF-cache tolerance)

None of those interact with weight loading; they were not implicated
in either failure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant