[Do not merge] Add the Deepseek-V4-Pro supported on MI355x#433
[Do not merge] Add the Deepseek-V4-Pro supported on MI355x#433wuhuikx wants to merge 10 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive usage guide for running DeepSeek-V4 on AMD ROCm hardware (specifically MI355X) and updates the DeepSeek-V4-Pro model configuration to include verified AMD support and optimized hardware overrides. The review feedback focuses on improving the portability and consistency of the documentation, including updating installation commands to modern standards, ensuring path consistency for profilers and caches, and correcting model name mismatches in example code.
| cd vllm | ||
| git fetch origin pull/40871/head:pr_dsv4 | ||
| git checkout pr_dsv4 | ||
| python3 setup.py develop |
| --max-num-batched-tokens ${max_num_batched_tokens} \ | ||
| --distributed-executor-backend mp \ | ||
| --trust-remote-code \ | ||
| --profiler-config '{"profiler":"torch","torch_profiler_dir":"./vllm_profile"}' \ |
There was a problem hiding this comment.
There is a mismatch between the environment variable VLLM_TORCH_PROFILER_DIR (set to /app/vllm_profile on line 45) and the path provided in --profiler-config (./vllm_profile). It is better to use a consistent path to avoid confusion and ensure profiles are stored in the expected location.
| --profiler-config '{"profiler":"torch","torch_profiler_dir":"./vllm_profile"}' \ | |
| --profiler-config '{"profiler":"torch","torch_profiler_dir":"/app/vllm_profile"}' \ |
| ```bash | ||
| MODEL=/home/models/DeepSeek-V4-Flash | ||
| lm_eval --model local-completions \ | ||
| --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=4,max_retries=10,max_gen_toks=2048,timeout=60000 \ |
There was a problem hiding this comment.
Using 0.0.0.0 as a connection address is non-standard and may not work as expected on all operating systems. It is safer to use 127.0.0.1 or localhost for connecting to a service running on the same machine.
| --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=4,max_retries=10,max_gen_toks=2048,timeout=60000 \ | |
| --model_args model=$MODEL,base_url=http://127.0.0.1:8001/v1/completions,num_concurrent=4,max_retries=10,max_gen_toks=2048,timeout=60000 \ |
| export HF_HOME=/data/huggingface-cache | ||
| export VLLM_ROCM_USE_AITER=1 | ||
| export VLLM_ROCM_USE_AITER_LINEAR=1 | ||
| rm -rf /root/.cache/vllm/torch_compile_cache |
There was a problem hiding this comment.
| ```bash | ||
| MODEL=/home/models/DeepSeek-V4-Pro | ||
| lm_eval --model local-completions \ | ||
| --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 \ |
There was a problem hiding this comment.
Using 0.0.0.0 as a connection address is non-standard. It is recommended to use 127.0.0.1 or localhost for local connections.
| --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 \ | |
| --model_args model=$MODEL,base_url=http://127.0.0.1:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 \ |
| from openai import OpenAI | ||
|
|
||
| client = OpenAI(base_url="http://localhost:8001/v1", api_key="EMPTY") | ||
| model = "deepseek-ai/DeepSeek-V4-Pro" |
There was a problem hiding this comment.
The model name used in the OpenAI client example (deepseek-ai/DeepSeek-V4-Pro) does not match the model path used to launch the server (/home/models/DeepSeek-V4-Pro on line 139). vLLM requires the model name in the request to match the name or path provided at startup unless --served-model-name is used.
| model = "deepseek-ai/DeepSeek-V4-Pro" | |
| model = "/home/models/DeepSeek-V4-Pro" |
df89987 to
ff6cd47
Compare
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
cebf4c1 to
13db197
Compare
Signed-off-by: wuhuikx <hattie.wu@amd.com>
The recipe (vllm-project/recipes#433) specifies --moe-backend triton_unfused, but that choice was never accepted into vLLM main — likely it lived on the #40871 PR branch and was renamed/removed before merge. In vllm/vllm-openai-rocm:nightly (which the recipe itself uses), the legal choices are: aiter, auto, cutlass, deep_gemm, emulation, flashinfer_cutedsl, flashinfer_cutlass, flashinfer_trtllm, marlin, triton. Drop the flag entirely and let vLLM's `auto` selector pick the backend. With VLLM_ROCM_USE_AITER=1 set, that resolves to the AITER MoE path on ROCm — the same kernel family the recipe was steering toward. All other remaining flags and env vars verified valid in vLLM 0.20.2.
I dropped --moe-backend triton_unfused based on a stale error message
("invalid choice ... choose from aiter, auto, ...") from the previous
run, but that error came from the cached squashfs of an April 25 build
that pre-dated #40871. The pinned nightly-dcacdf9a8860a8640 DOES have
triton_unfused in MoEBackend — verified by reading vllm/config/kernel.py
at that exact commit on GitHub.
Without --moe-backend triton_unfused, vLLM's auto selector picks a
backend that doesn't register w13_weight_scale / w2_weight_scale on the
FP4 expert layers, so safetensors loading throws:
KeyError: 'layers.0.ffn.experts.w13_weight_scale'
at vllm/model_executor/models/deepseek_v4.py:1492
This matches the recipe (vllm-project/recipes#433) line-for-line now,
with the only intentional deviations being InferenceX conventions:
* --max-model-len $MAX_MODEL_LEN (sized to ISL+OSL+256)
* --no-enable-prefix-caching (fair benchmark comparisons)
* VLLM_ENGINE_READY_TIMEOUT_S=3600 (cold HF-cache tolerance)
None of those interact with weight loading; they were not implicated
in either failure.
No description provided.