vllm-project · wuhuikx · May 1, 2026 · May 1, 2026 · May 1, 2026 · May 1, 2026
diff --git a/DeepSeek/DeepSeek-V4-AMD.md b/DeepSeek/DeepSeek-V4-AMD.md
@@ -0,0 +1,159 @@
+# DeepSeek-V4 on AMD (ROCm) Usage Guide
+
+This page is aligned with the DeepSeek-V4-Pro recipe layout on recipes.vllm.ai and
+captures the AMD MI355X validated settings from [vllm-project/vllm#40871](https://github.com/vllm-project/vllm/pull/40871).
+
+## Overview
+
+DeepSeek-V4-Pro is the flagship of the V4 preview family: a 1.6T-total / 49B-active
+Mixture-of-Experts model. It pairs a **hybrid attention stack** — Compressed Sparse
+Attention (CSA) + Heavily Compressed Attention (HCA) — with **Manifold-Constrained
+Hyper-Connections (mHC)** to reach 27% of V3.2's per-token inference FLOPs and 10% of
+V3.2's KV cache at 1M context. Pre-trained on 32T+ tokens with the **Muon optimizer**
+for faster convergence; post-training is a two-stage pipeline (domain-specific expert
+cultivation + unified consolidation via on-policy distillation).
+
+Checkpoint is **FP4+FP8 mixed**: MoE expert weights are stored in FP4 while the
+remaining (attention / norm / router) params stay in FP8.
+
+## Docker image (AMD ROCm)
+
+```bash
+docker pull vllm/vllm-openai-rocm:nightly
+```
+
+## Recommended deployments
+
+- **MI355X (8× GPU)**: validated with ROCm + AITER
+  (`VLLM_ROCM_USE_AITER=1`, `VLLM_ROCM_USE_AITER_LINEAR=1`), `--moe-backend triton_unfused`,
+  `--gpu-memory-utilization 0.6`, `--max-num-seqs 128`,
+  `--max-num-batched-tokens 8192`, and `--distributed-executor-backend mp`.
+
+## Feature matrix
+
+The table below is a static equivalent of the interactive matrix shown on
+recipes.vllm.ai (hardware / variant / strategy / features).
+
+| Model | Hardware | Variant | Recommended strategies | Tool calling | Reasoning | Spec decoding |
+| --- | --- | --- | --- | --- | --- | --- |
+| [DeepSeek-V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) | MI355X (8x288GB) | FP8 (~960GB) | Tensor Parallel (TP) | Yes (`deepseek_v4`) | Yes (`deepseek_v4`) | No (`false`) |
+| [DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | MI355X (8x288GB) | FP8 (~170GB) | Tensor Parallel (TP) | Yes (`deepseek_v4`) | Yes (`deepseek_v4`) | No (`false`) |
+
+### MI355X recommended presets
+
+| Model | TP | Max num seqs | Max batched tokens | GPU memory utilization | Key ROCm env |
+| --- | --- | ---: | ---: | ---: | --- |
+| [DeepSeek-V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) | 8 | 128 | 8192 | 0.6 | `VLLM_ROCM_USE_AITER=1`, `VLLM_ROCM_USE_AITER_LINEAR=1` |
+| [DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | 4 | 16 | 1024 | 0.35 | `VLLM_ROCM_USE_AITER=1` |
+
+### Feature toggles
+
+| Feature | Server args |
+| --- | --- |
+| Tool Calling | `--tokenizer-mode deepseek_v4 --tool-call-parser deepseek_v4 --enable-auto-tool-choice` |
+| Reasoning | `--reasoning-parser deepseek_v4` |
+| Spec Decoding | Disabled (`false`) |
+
+## DeepSeek-V4-Pro validation (MI355X, TP=8)
+
+### 1) Serve command
+
+```bash
+export HF_HOME=/data/huggingface-cache
+export VLLM_ROCM_USE_AITER=1
+export VLLM_ROCM_USE_AITER_LINEAR=1
+
+vllm serve /home/models/DeepSeek-V4-Pro \
+  --host localhost \
+  --port 8001 \
+  --dtype auto \
+  --kv-cache-dtype fp8 \
+  --tensor-parallel-size 8 \
+  --max-num-seqs 128 \
+  --max-num-batched-tokens 8192 \
+  --distributed-executor-backend mp \
+  --trust-remote-code \
+  --gpu-memory-utilization 0.6 \
+  --moe-backend triton_unfused \
+  --tokenizer-mode deepseek_v4 \
+  --reasoning-parser deepseek_v4 \
+  --async-scheduling \
+  --enforce-eager
+```
+
+### 2) GSM8K validation
+
+```bash
+MODEL=/home/models/DeepSeek-V4-Pro
+lm_eval --model local-completions \
+  --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 \
+  --batch_size auto \
+  --tasks gsm8k \
+  --num_fewshot 8 \
+  --output_path .
+```
+
+Reported result from PR #40871:
+
+```text
+|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9538|±  |0.0058|
+|     |       |strict-match    |     8|exact_match|↑  |0.9545|±  |0.0057|
+```
+
+## DeepSeek-V4-Flash validation (MI355X, TP=4)
+
+### 1) Serve command
+
+```bash
+export HF_HOME=/data/huggingface-cache
+export VLLM_ROCM_USE_AITER=1
+
+vllm serve /home/models/DeepSeek-V4-Flash \
+  --host localhost \
+  --port 8001 \
+  --dtype auto \
+  --tensor-parallel-size 4 \
+  --max-num-seqs 16 \
+  --max-num-batched-tokens 1024 \
+  --distributed-executor-backend mp \
+  --trust-remote-code \
+  --gpu-memory-utilization 0.35 \
+  --moe-backend triton_unfused \
+  --tokenizer-mode deepseek_v4 \
+  --async-scheduling \
+  --enforce-eager
+```
+
+### 2) GSM8K validation
+
+```bash
+MODEL=/home/models/DeepSeek-V4-Flash
+lm_eval --model local-completions \
+  --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=4,max_retries=10,max_gen_toks=2048,timeout=60000 \
+  --batch_size auto \
+  --tasks gsm8k \
+  --num_fewshot 8 \
+  --output_path .
+```
+
+Reported result from PR #40871:
+
+```text
+|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9439|±  |0.0063|
+|     |       |strict-match    |     8|exact_match|↑  |0.9431|±  |0.0064|
+```
+
+## Related PR links
+
+- [Functionality] Base PR is functionality/accuracy ready on MI35x for both
+  DeepSeek-V4-Pro and DeepSeek-V4-Flash; lm_eval passed on full GSM8K:
+  [Ready to merge, #40871](https://github.com/vllm-project/vllm/pull/40871)
+- [Functionality] MI300 support PR:
+  [#41451](https://github.com/vllm-project/vllm/pull/41451)
+- [Performance] MLA Indexer optimization for DeepSeek-V4 and DeepSeek-V3.2 (ROCm):
+  [#41217](https://github.com/vllm-project/vllm/pull/41217)
+
diff --git a/models/deepseek-ai/DeepSeek-V4-Flash.yaml b/models/deepseek-ai/DeepSeek-V4-Flash.yaml
@@ -3,7 +3,7 @@ meta:
   slug: "deepseek-v4-flash"
   provider: "DeepSeek"
   description: "DeepSeek V4 MoE model with hybrid CSA+HCA attention, manifold-constrained hyper-connections, and three-tier reasoning (Non-think / Think High / Think Max)."
-  date_updated: 2026-04-24
+  date_updated: 2026-05-01
   difficulty: hard
   tasks:
     - text
@@ -17,11 +17,13 @@ meta:
     gb300: verified
     mi300x: unsupported
     mi325x: unsupported
-    mi355x: unsupported
+    mi355x: verified
 
 model:
   model_id: "deepseek-ai/DeepSeek-V4-Flash"
   min_vllm_version: "0.20.0"
+  docker_image:
+    amd: "vllm/vllm-openai-rocm:nightly"
   architecture: moe
   parameter_count: "284B"
   active_parameters: "13B"
@@ -91,6 +93,22 @@ hardware_overrides:
       - "--attention_config.use_fp4_indexer_cache=True"
       - "--moe-backend"
       - "deep_gemm_mega_moe"
+  amd:
+    extra_args:
+      - "--distributed-executor-backend"
+      - "mp"
+      - "--gpu-memory-utilization"
+      - "0.35"
+      - "--max-num-seqs"
+      - "16"
+      - "--max-num-batched-tokens"
+      - "1024"
+      - "--moe-backend"
+      - "triton_unfused"
+      - "--async-scheduling"
+      - "--enforce-eager"
+    extra_env:
+      VLLM_ROCM_USE_AITER: "1"
 
 strategy_overrides:
   single_node_tp:
@@ -228,6 +246,57 @@ guide: |
   replica on H200/B200/B300 (leaving headroom for throughput-vs-latency tuning).
   For disaggregated prefill/decode on GB200, use the PD Cluster tab.
 
+  On **MI355X (8×288GB)**, validation used ROCm + AITER (`VLLM_ROCM_USE_AITER=1`),
+  `--distributed-executor-backend mp`, `--gpu-memory-utilization 0.35`,
+  `--max-num-seqs 16`, `--max-num-batched-tokens 1024`,
+  `--moe-backend triton_unfused`, `--async-scheduling`, and `--enforce-eager`.
+
+  ## GSM8K validation (MI355X)
+
+  Launch command (TP=4):
+
+  ```bash
+  export HF_HOME=/data/huggingface-cache
+  export VLLM_ROCM_USE_AITER=1
+
+  vllm serve /home/models/DeepSeek-V4-Flash \
+    --host localhost \
+    --port 8001 \
+    --dtype auto \
+    --tensor-parallel-size 4 \
+    --max-num-seqs 16 \
+    --max-num-batched-tokens 1024 \
+    --distributed-executor-backend mp \
+    --trust-remote-code \
+    --gpu-memory-utilization 0.35 \
+    --moe-backend triton_unfused \
+    --tokenizer-mode deepseek_v4 \
+    --async-scheduling \
+    --enforce-eager
+  ```
+
+  GSM8K command:
+
+  ```bash
+  MODEL=/home/models/DeepSeek-V4-Flash
+  lm_eval --model local-completions \
+    --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=4,max_retries=10,max_gen_toks=2048,timeout=60000 \
+    --batch_size auto \
+    --tasks gsm8k \
+    --num_fewshot 8 \
+    --output_path . 2>&1 | tee -a eval.log
+  ```
+
+  Reported result from PR #40871:
+
+  ```text
+  local-completions ({'model': '/home/models/DeepSeek-V4-Flash', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 4, 'max_retries': 10, 'max_gen_toks': 2048, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 8, batch_size: auto
+  |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+  |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+  |gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9439|±  |0.0063|
+  |     |       |strict-match    |     8|exact_match|↑  |0.9431|±  |0.0064|
+  ```
+
   ### H200 Single-Node PD (Mooncake)
 
   Single-host disaggregated serving: 4 prefill GPUs + 4 decode GPUs on one 8-GPU H200 node,

diff --git a/models/deepseek-ai/DeepSeek-V4-Pro.yaml b/models/deepseek-ai/DeepSeek-V4-Pro.yaml
@@ -3,7 +3,7 @@ meta:
   slug: "deepseek-v4-pro"
   provider: "DeepSeek"
   description: "DeepSeek V4 flagship MoE (1.6T total / 49B active) with hybrid CSA+HCA attention, manifold-constrained hyper-connections, Muon-trained on 32T+ tokens, and three-tier reasoning."
-  date_updated: 2026-04-24
+  date_updated: 2026-05-01
   difficulty: hard
   tasks:
     - text
@@ -17,11 +17,13 @@ meta:
     gb300: verified
     mi300x: unsupported
     mi325x: unsupported
-    mi355x: unsupported
+    mi355x: verified
 
 model:
   model_id: "deepseek-ai/DeepSeek-V4-Pro"
   min_vllm_version: "0.20.0"
+  docker_image:
+    amd: "vllm/vllm-openai-rocm:nightly"
   architecture: moe
   parameter_count: "1600B"
   active_parameters: "49B"
@@ -109,6 +111,23 @@ hardware_overrides:
       - "--attention_config.use_fp4_indexer_cache=True"
       - "--moe-backend"
       - "deep_gemm_mega_moe"
+  amd:
+    extra_args:
+      - "--distributed-executor-backend"
+      - "mp"
+      - "--gpu-memory-utilization"
+      - "0.6"
+      - "--max-num-seqs"
+      - "128"
+      - "--max-num-batched-tokens"
+      - "8192"
+      - "--moe-backend"
+      - "triton_unfused"
+      - "--async-scheduling"
+      - "--enforce-eager"
+    extra_env:
+      VLLM_ROCM_USE_AITER: "1"
+      VLLM_ROCM_USE_AITER_LINEAR: "1"
 
 strategy_overrides:
   single_node_tp:
@@ -252,6 +271,59 @@ guide: |
   - **H200 (8× GPU)**: DP + EP with `--data-parallel-size 8`. Context is capped at
     800K tokens (`--max-model-len 800000`) to leave KV headroom with dense params
     replicated across ranks — applies to both single-node and multi-node H200.
+  - **MI355X (8× GPU)**: validated with ROCm + AITER (`VLLM_ROCM_USE_AITER=1`,
+    `VLLM_ROCM_USE_AITER_LINEAR=1`), `--moe-backend triton_unfused`,
+    `--gpu-memory-utilization 0.6`, `--max-num-seqs 128`,
+    `--max-num-batched-tokens 8192`, and `--distributed-executor-backend mp`.
   - **GB200 NVL4 (4× GPU per tray)**: the ~960 GB mixed-precision checkpoint does not
     fit on one tray; run multi-node DP + EP across **2 trays** (8 GPUs total) with
     `--data-parallel-size 8`. Pick the "Multi-Node" tab and set nodes to 2.
+
+  ## GSM8K validation (MI355X)
+
+  Launch command (TP=8):
+
+  ```bash
+  export HF_HOME=/data/huggingface-cache
+  export VLLM_ROCM_USE_AITER=1
+  export VLLM_ROCM_USE_AITER_LINEAR=1
+
+  vllm serve /home/models/DeepSeek-V4-Pro \
+    --host localhost \
+    --port 8001 \
+    --dtype auto \
+    --kv-cache-dtype fp8 \
+    --tensor-parallel-size 8 \
+    --max-num-seqs 128 \
+    --max-num-batched-tokens 8192 \
+    --distributed-executor-backend mp \
+    --trust-remote-code \
+    --gpu-memory-utilization 0.6 \
+    --moe-backend triton_unfused \
+    --tokenizer-mode deepseek_v4 \
+    --reasoning-parser deepseek_v4 \
+    --async-scheduling \
+    --enforce-eager
+  ```
+
+  GSM8K command:
+
+  ```bash
+  MODEL=/home/models/DeepSeek-V4-Pro
+  lm_eval --model local-completions \
+    --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 \
+    --batch_size auto \
+    --tasks gsm8k \
+    --num_fewshot 8 \
+    --output_path . 2>&1 | tee -a eval.log
+  ```
+
+  Reported result from PR #40871:
+
+  ```text
+  local-completions ({'model': '/home/models/DeepSeek-V4-Pro', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 2, 'max_retries': 10, 'max_gen_toks': 2048, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 8, batch_size: auto
+  |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+  |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+  |gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9538|±  |0.0058|
+  |     |       |strict-match    |     8|exact_match|↑  |0.9545|±  |0.0057|
+  ```