Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 159 additions & 0 deletions DeepSeek/DeepSeek-V4-AMD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# DeepSeek-V4 on AMD (ROCm) Usage Guide

This page is aligned with the DeepSeek-V4-Pro recipe layout on recipes.vllm.ai and
captures the AMD MI355X validated settings from [vllm-project/vllm#40871](https://github.com/vllm-project/vllm/pull/40871).

## Overview

DeepSeek-V4-Pro is the flagship of the V4 preview family: a 1.6T-total / 49B-active
Mixture-of-Experts model. It pairs a **hybrid attention stack** — Compressed Sparse
Attention (CSA) + Heavily Compressed Attention (HCA) — with **Manifold-Constrained
Hyper-Connections (mHC)** to reach 27% of V3.2's per-token inference FLOPs and 10% of
V3.2's KV cache at 1M context. Pre-trained on 32T+ tokens with the **Muon optimizer**
for faster convergence; post-training is a two-stage pipeline (domain-specific expert
cultivation + unified consolidation via on-policy distillation).

Checkpoint is **FP4+FP8 mixed**: MoE expert weights are stored in FP4 while the
remaining (attention / norm / router) params stay in FP8.

## Docker image (AMD ROCm)

```bash
docker pull vllm/vllm-openai-rocm:nightly
```

## Recommended deployments

- **MI355X (8× GPU)**: validated with ROCm + AITER
(`VLLM_ROCM_USE_AITER=1`, `VLLM_ROCM_USE_AITER_LINEAR=1`), `--moe-backend triton_unfused`,
`--gpu-memory-utilization 0.6`, `--max-num-seqs 128`,
`--max-num-batched-tokens 8192`, and `--distributed-executor-backend mp`.

## Feature matrix

The table below is a static equivalent of the interactive matrix shown on
recipes.vllm.ai (hardware / variant / strategy / features).

| Model | Hardware | Variant | Recommended strategies | Tool calling | Reasoning | Spec decoding |
| --- | --- | --- | --- | --- | --- | --- |
| [DeepSeek-V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) | MI355X (8x288GB) | FP8 (~960GB) | Tensor Parallel (TP) | Yes (`deepseek_v4`) | Yes (`deepseek_v4`) | No (`false`) |
| [DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | MI355X (8x288GB) | FP8 (~170GB) | Tensor Parallel (TP) | Yes (`deepseek_v4`) | Yes (`deepseek_v4`) | No (`false`) |

### MI355X recommended presets

| Model | TP | Max num seqs | Max batched tokens | GPU memory utilization | Key ROCm env |
| --- | --- | ---: | ---: | ---: | --- |
| [DeepSeek-V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) | 8 | 128 | 8192 | 0.6 | `VLLM_ROCM_USE_AITER=1`, `VLLM_ROCM_USE_AITER_LINEAR=1` |
| [DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | 4 | 16 | 1024 | 0.35 | `VLLM_ROCM_USE_AITER=1` |

### Feature toggles

| Feature | Server args |
| --- | --- |
| Tool Calling | `--tokenizer-mode deepseek_v4 --tool-call-parser deepseek_v4 --enable-auto-tool-choice` |
| Reasoning | `--reasoning-parser deepseek_v4` |
| Spec Decoding | Disabled (`false`) |

## DeepSeek-V4-Pro validation (MI355X, TP=8)

### 1) Serve command

```bash
export HF_HOME=/data/huggingface-cache
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_LINEAR=1

vllm serve /home/models/DeepSeek-V4-Pro \
--host localhost \
--port 8001 \
--dtype auto \
--kv-cache-dtype fp8 \
--tensor-parallel-size 8 \
--max-num-seqs 128 \
--max-num-batched-tokens 8192 \
--distributed-executor-backend mp \
--trust-remote-code \
--gpu-memory-utilization 0.6 \
--moe-backend triton_unfused \
--tokenizer-mode deepseek_v4 \
--reasoning-parser deepseek_v4 \
--async-scheduling \
--enforce-eager
```

### 2) GSM8K validation

```bash
MODEL=/home/models/DeepSeek-V4-Pro
lm_eval --model local-completions \
--model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 \
--batch_size auto \
--tasks gsm8k \
--num_fewshot 8 \
--output_path .
```

Reported result from PR #40871:

```text
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 8|exact_match|↑ |0.9538|± |0.0058|
| | |strict-match | 8|exact_match|↑ |0.9545|± |0.0057|
```

## DeepSeek-V4-Flash validation (MI355X, TP=4)

### 1) Serve command

```bash
export HF_HOME=/data/huggingface-cache
export VLLM_ROCM_USE_AITER=1

vllm serve /home/models/DeepSeek-V4-Flash \
--host localhost \
--port 8001 \
--dtype auto \
--tensor-parallel-size 4 \
--max-num-seqs 16 \
--max-num-batched-tokens 1024 \
--distributed-executor-backend mp \
--trust-remote-code \
--gpu-memory-utilization 0.35 \
--moe-backend triton_unfused \
--tokenizer-mode deepseek_v4 \
--async-scheduling \
--enforce-eager
```

### 2) GSM8K validation

```bash
MODEL=/home/models/DeepSeek-V4-Flash
lm_eval --model local-completions \
--model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=4,max_retries=10,max_gen_toks=2048,timeout=60000 \
--batch_size auto \
--tasks gsm8k \
--num_fewshot 8 \
--output_path .
```

Reported result from PR #40871:

```text
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 8|exact_match|↑ |0.9439|± |0.0063|
| | |strict-match | 8|exact_match|↑ |0.9431|± |0.0064|
```

## Related PR links

- [Functionality] Base PR is functionality/accuracy ready on MI35x for both
DeepSeek-V4-Pro and DeepSeek-V4-Flash; lm_eval passed on full GSM8K:
[Ready to merge, #40871](https://github.com/vllm-project/vllm/pull/40871)
- [Functionality] MI300 support PR:
[#41451](https://github.com/vllm-project/vllm/pull/41451)
- [Performance] MLA Indexer optimization for DeepSeek-V4 and DeepSeek-V3.2 (ROCm):
[#41217](https://github.com/vllm-project/vllm/pull/41217)

73 changes: 71 additions & 2 deletions models/deepseek-ai/DeepSeek-V4-Flash.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ meta:
slug: "deepseek-v4-flash"
provider: "DeepSeek"
description: "DeepSeek V4 MoE model with hybrid CSA+HCA attention, manifold-constrained hyper-connections, and three-tier reasoning (Non-think / Think High / Think Max)."
date_updated: 2026-04-24
date_updated: 2026-05-01
difficulty: hard
tasks:
- text
Expand All @@ -17,11 +17,13 @@ meta:
gb300: verified
mi300x: unsupported
mi325x: unsupported
mi355x: unsupported
mi355x: verified

model:
model_id: "deepseek-ai/DeepSeek-V4-Flash"
min_vllm_version: "0.20.0"
docker_image:
amd: "vllm/vllm-openai-rocm:nightly"
architecture: moe
parameter_count: "284B"
active_parameters: "13B"
Expand Down Expand Up @@ -91,6 +93,22 @@ hardware_overrides:
- "--attention_config.use_fp4_indexer_cache=True"
- "--moe-backend"
- "deep_gemm_mega_moe"
amd:
extra_args:
- "--distributed-executor-backend"
- "mp"
- "--gpu-memory-utilization"
- "0.35"
- "--max-num-seqs"
- "16"
- "--max-num-batched-tokens"
- "1024"
- "--moe-backend"
- "triton_unfused"
- "--async-scheduling"
- "--enforce-eager"
extra_env:
VLLM_ROCM_USE_AITER: "1"

strategy_overrides:
single_node_tp:
Expand Down Expand Up @@ -228,6 +246,57 @@ guide: |
replica on H200/B200/B300 (leaving headroom for throughput-vs-latency tuning).
For disaggregated prefill/decode on GB200, use the PD Cluster tab.

On **MI355X (8×288GB)**, validation used ROCm + AITER (`VLLM_ROCM_USE_AITER=1`),
`--distributed-executor-backend mp`, `--gpu-memory-utilization 0.35`,
`--max-num-seqs 16`, `--max-num-batched-tokens 1024`,
`--moe-backend triton_unfused`, `--async-scheduling`, and `--enforce-eager`.

## GSM8K validation (MI355X)

Launch command (TP=4):

```bash
export HF_HOME=/data/huggingface-cache
export VLLM_ROCM_USE_AITER=1

vllm serve /home/models/DeepSeek-V4-Flash \
--host localhost \
--port 8001 \
--dtype auto \
--tensor-parallel-size 4 \
--max-num-seqs 16 \
--max-num-batched-tokens 1024 \
--distributed-executor-backend mp \
--trust-remote-code \
--gpu-memory-utilization 0.35 \
--moe-backend triton_unfused \
--tokenizer-mode deepseek_v4 \
--async-scheduling \
--enforce-eager
```

GSM8K command:

```bash
MODEL=/home/models/DeepSeek-V4-Flash
lm_eval --model local-completions \
--model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=4,max_retries=10,max_gen_toks=2048,timeout=60000 \
--batch_size auto \
--tasks gsm8k \
--num_fewshot 8 \
--output_path . 2>&1 | tee -a eval.log
```

Reported result from PR #40871:

```text
local-completions ({'model': '/home/models/DeepSeek-V4-Flash', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 4, 'max_retries': 10, 'max_gen_toks': 2048, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 8, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 8|exact_match|↑ |0.9439|± |0.0063|
| | |strict-match | 8|exact_match|↑ |0.9431|± |0.0064|
```

### H200 Single-Node PD (Mooncake)

Single-host disaggregated serving: 4 prefill GPUs + 4 decode GPUs on one 8-GPU H200 node,
Expand Down
76 changes: 74 additions & 2 deletions models/deepseek-ai/DeepSeek-V4-Pro.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ meta:
slug: "deepseek-v4-pro"
provider: "DeepSeek"
description: "DeepSeek V4 flagship MoE (1.6T total / 49B active) with hybrid CSA+HCA attention, manifold-constrained hyper-connections, Muon-trained on 32T+ tokens, and three-tier reasoning."
date_updated: 2026-04-24
date_updated: 2026-05-01
difficulty: hard
tasks:
- text
Expand All @@ -17,11 +17,13 @@ meta:
gb300: verified
mi300x: unsupported
mi325x: unsupported
mi355x: unsupported
mi355x: verified

model:
model_id: "deepseek-ai/DeepSeek-V4-Pro"
min_vllm_version: "0.20.0"
docker_image:
amd: "vllm/vllm-openai-rocm:nightly"
architecture: moe
parameter_count: "1600B"
active_parameters: "49B"
Expand Down Expand Up @@ -109,6 +111,23 @@ hardware_overrides:
- "--attention_config.use_fp4_indexer_cache=True"
- "--moe-backend"
- "deep_gemm_mega_moe"
amd:
extra_args:
- "--distributed-executor-backend"
- "mp"
- "--gpu-memory-utilization"
- "0.6"
- "--max-num-seqs"
- "128"
- "--max-num-batched-tokens"
- "8192"
- "--moe-backend"
- "triton_unfused"
- "--async-scheduling"
- "--enforce-eager"
extra_env:
VLLM_ROCM_USE_AITER: "1"
VLLM_ROCM_USE_AITER_LINEAR: "1"

strategy_overrides:
single_node_tp:
Expand Down Expand Up @@ -252,6 +271,59 @@ guide: |
- **H200 (8× GPU)**: DP + EP with `--data-parallel-size 8`. Context is capped at
800K tokens (`--max-model-len 800000`) to leave KV headroom with dense params
replicated across ranks — applies to both single-node and multi-node H200.
- **MI355X (8× GPU)**: validated with ROCm + AITER (`VLLM_ROCM_USE_AITER=1`,
`VLLM_ROCM_USE_AITER_LINEAR=1`), `--moe-backend triton_unfused`,
`--gpu-memory-utilization 0.6`, `--max-num-seqs 128`,
`--max-num-batched-tokens 8192`, and `--distributed-executor-backend mp`.
- **GB200 NVL4 (4× GPU per tray)**: the ~960 GB mixed-precision checkpoint does not
fit on one tray; run multi-node DP + EP across **2 trays** (8 GPUs total) with
`--data-parallel-size 8`. Pick the "Multi-Node" tab and set nodes to 2.

## GSM8K validation (MI355X)

Launch command (TP=8):

```bash
export HF_HOME=/data/huggingface-cache
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_LINEAR=1

vllm serve /home/models/DeepSeek-V4-Pro \
--host localhost \
--port 8001 \
--dtype auto \
--kv-cache-dtype fp8 \
--tensor-parallel-size 8 \
--max-num-seqs 128 \
--max-num-batched-tokens 8192 \
--distributed-executor-backend mp \
--trust-remote-code \
--gpu-memory-utilization 0.6 \
--moe-backend triton_unfused \
--tokenizer-mode deepseek_v4 \
--reasoning-parser deepseek_v4 \
--async-scheduling \
--enforce-eager
```

GSM8K command:

```bash
MODEL=/home/models/DeepSeek-V4-Pro
lm_eval --model local-completions \
--model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 \
--batch_size auto \
--tasks gsm8k \
--num_fewshot 8 \
--output_path . 2>&1 | tee -a eval.log
```

Reported result from PR #40871:

```text
local-completions ({'model': '/home/models/DeepSeek-V4-Pro', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 2, 'max_retries': 10, 'max_gen_toks': 2048, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 8, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 8|exact_match|↑ |0.9538|± |0.0058|
| | |strict-match | 8|exact_match|↑ |0.9545|± |0.0057|
```