diff --git a/DeepSeek/DeepSeek-V4-AMD.md b/DeepSeek/DeepSeek-V4-AMD.md new file mode 100644 index 00000000..2830f09f --- /dev/null +++ b/DeepSeek/DeepSeek-V4-AMD.md @@ -0,0 +1,159 @@ +# DeepSeek-V4 on AMD (ROCm) Usage Guide + +This page is aligned with the DeepSeek-V4-Pro recipe layout on recipes.vllm.ai and +captures the AMD MI355X validated settings from [vllm-project/vllm#40871](https://github.com/vllm-project/vllm/pull/40871). + +## Overview + +DeepSeek-V4-Pro is the flagship of the V4 preview family: a 1.6T-total / 49B-active +Mixture-of-Experts model. It pairs a **hybrid attention stack** — Compressed Sparse +Attention (CSA) + Heavily Compressed Attention (HCA) — with **Manifold-Constrained +Hyper-Connections (mHC)** to reach 27% of V3.2's per-token inference FLOPs and 10% of +V3.2's KV cache at 1M context. Pre-trained on 32T+ tokens with the **Muon optimizer** +for faster convergence; post-training is a two-stage pipeline (domain-specific expert +cultivation + unified consolidation via on-policy distillation). + +Checkpoint is **FP4+FP8 mixed**: MoE expert weights are stored in FP4 while the +remaining (attention / norm / router) params stay in FP8. + +## Docker image (AMD ROCm) + +```bash +docker pull vllm/vllm-openai-rocm:nightly +``` + +## Recommended deployments + +- **MI355X (8× GPU)**: validated with ROCm + AITER + (`VLLM_ROCM_USE_AITER=1`, `VLLM_ROCM_USE_AITER_LINEAR=1`), `--moe-backend triton_unfused`, + `--gpu-memory-utilization 0.6`, `--max-num-seqs 128`, + `--max-num-batched-tokens 8192`, and `--distributed-executor-backend mp`. + +## Feature matrix + +The table below is a static equivalent of the interactive matrix shown on +recipes.vllm.ai (hardware / variant / strategy / features). + +| Model | Hardware | Variant | Recommended strategies | Tool calling | Reasoning | Spec decoding | +| --- | --- | --- | --- | --- | --- | --- | +| [DeepSeek-V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) | MI355X (8x288GB) | FP8 (~960GB) | Tensor Parallel (TP) | Yes (`deepseek_v4`) | Yes (`deepseek_v4`) | No (`false`) | +| [DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | MI355X (8x288GB) | FP8 (~170GB) | Tensor Parallel (TP) | Yes (`deepseek_v4`) | Yes (`deepseek_v4`) | No (`false`) | + +### MI355X recommended presets + +| Model | TP | Max num seqs | Max batched tokens | GPU memory utilization | Key ROCm env | +| --- | --- | ---: | ---: | ---: | --- | +| [DeepSeek-V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) | 8 | 128 | 8192 | 0.6 | `VLLM_ROCM_USE_AITER=1`, `VLLM_ROCM_USE_AITER_LINEAR=1` | +| [DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | 4 | 16 | 1024 | 0.35 | `VLLM_ROCM_USE_AITER=1` | + +### Feature toggles + +| Feature | Server args | +| --- | --- | +| Tool Calling | `--tokenizer-mode deepseek_v4 --tool-call-parser deepseek_v4 --enable-auto-tool-choice` | +| Reasoning | `--reasoning-parser deepseek_v4` | +| Spec Decoding | Disabled (`false`) | + +## DeepSeek-V4-Pro validation (MI355X, TP=8) + +### 1) Serve command + +```bash +export HF_HOME=/data/huggingface-cache +export VLLM_ROCM_USE_AITER=1 +export VLLM_ROCM_USE_AITER_LINEAR=1 + +vllm serve /home/models/DeepSeek-V4-Pro \ + --host localhost \ + --port 8001 \ + --dtype auto \ + --kv-cache-dtype fp8 \ + --tensor-parallel-size 8 \ + --max-num-seqs 128 \ + --max-num-batched-tokens 8192 \ + --distributed-executor-backend mp \ + --trust-remote-code \ + --gpu-memory-utilization 0.6 \ + --moe-backend triton_unfused \ + --tokenizer-mode deepseek_v4 \ + --reasoning-parser deepseek_v4 \ + --async-scheduling \ + --enforce-eager +``` + +### 2) GSM8K validation + +```bash +MODEL=/home/models/DeepSeek-V4-Pro +lm_eval --model local-completions \ + --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 \ + --batch_size auto \ + --tasks gsm8k \ + --num_fewshot 8 \ + --output_path . +``` + +Reported result from PR #40871: + +```text +|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| +|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| +|gsm8k| 3|flexible-extract| 8|exact_match|↑ |0.9538|± |0.0058| +| | |strict-match | 8|exact_match|↑ |0.9545|± |0.0057| +``` + +## DeepSeek-V4-Flash validation (MI355X, TP=4) + +### 1) Serve command + +```bash +export HF_HOME=/data/huggingface-cache +export VLLM_ROCM_USE_AITER=1 + +vllm serve /home/models/DeepSeek-V4-Flash \ + --host localhost \ + --port 8001 \ + --dtype auto \ + --tensor-parallel-size 4 \ + --max-num-seqs 16 \ + --max-num-batched-tokens 1024 \ + --distributed-executor-backend mp \ + --trust-remote-code \ + --gpu-memory-utilization 0.35 \ + --moe-backend triton_unfused \ + --tokenizer-mode deepseek_v4 \ + --async-scheduling \ + --enforce-eager +``` + +### 2) GSM8K validation + +```bash +MODEL=/home/models/DeepSeek-V4-Flash +lm_eval --model local-completions \ + --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=4,max_retries=10,max_gen_toks=2048,timeout=60000 \ + --batch_size auto \ + --tasks gsm8k \ + --num_fewshot 8 \ + --output_path . +``` + +Reported result from PR #40871: + +```text +|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| +|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| +|gsm8k| 3|flexible-extract| 8|exact_match|↑ |0.9439|± |0.0063| +| | |strict-match | 8|exact_match|↑ |0.9431|± |0.0064| +``` + +## Related PR links + +- [Functionality] Base PR is functionality/accuracy ready on MI35x for both + DeepSeek-V4-Pro and DeepSeek-V4-Flash; lm_eval passed on full GSM8K: + [Ready to merge, #40871](https://github.com/vllm-project/vllm/pull/40871) +- [Functionality] MI300 support PR: + [#41451](https://github.com/vllm-project/vllm/pull/41451) +- [Performance] MLA Indexer optimization for DeepSeek-V4 and DeepSeek-V3.2 (ROCm): + [#41217](https://github.com/vllm-project/vllm/pull/41217) + diff --git a/models/deepseek-ai/DeepSeek-V4-Flash.yaml b/models/deepseek-ai/DeepSeek-V4-Flash.yaml index 6bf204ed..067fb1fe 100644 --- a/models/deepseek-ai/DeepSeek-V4-Flash.yaml +++ b/models/deepseek-ai/DeepSeek-V4-Flash.yaml @@ -3,7 +3,7 @@ meta: slug: "deepseek-v4-flash" provider: "DeepSeek" description: "DeepSeek V4 MoE model with hybrid CSA+HCA attention, manifold-constrained hyper-connections, and three-tier reasoning (Non-think / Think High / Think Max)." - date_updated: 2026-04-24 + date_updated: 2026-05-01 difficulty: hard tasks: - text @@ -17,11 +17,13 @@ meta: gb300: verified mi300x: unsupported mi325x: unsupported - mi355x: unsupported + mi355x: verified model: model_id: "deepseek-ai/DeepSeek-V4-Flash" min_vllm_version: "0.20.0" + docker_image: + amd: "vllm/vllm-openai-rocm:nightly" architecture: moe parameter_count: "284B" active_parameters: "13B" @@ -91,6 +93,22 @@ hardware_overrides: - "--attention_config.use_fp4_indexer_cache=True" - "--moe-backend" - "deep_gemm_mega_moe" + amd: + extra_args: + - "--distributed-executor-backend" + - "mp" + - "--gpu-memory-utilization" + - "0.35" + - "--max-num-seqs" + - "16" + - "--max-num-batched-tokens" + - "1024" + - "--moe-backend" + - "triton_unfused" + - "--async-scheduling" + - "--enforce-eager" + extra_env: + VLLM_ROCM_USE_AITER: "1" strategy_overrides: single_node_tp: @@ -228,6 +246,57 @@ guide: | replica on H200/B200/B300 (leaving headroom for throughput-vs-latency tuning). For disaggregated prefill/decode on GB200, use the PD Cluster tab. + On **MI355X (8×288GB)**, validation used ROCm + AITER (`VLLM_ROCM_USE_AITER=1`), + `--distributed-executor-backend mp`, `--gpu-memory-utilization 0.35`, + `--max-num-seqs 16`, `--max-num-batched-tokens 1024`, + `--moe-backend triton_unfused`, `--async-scheduling`, and `--enforce-eager`. + + ## GSM8K validation (MI355X) + + Launch command (TP=4): + + ```bash + export HF_HOME=/data/huggingface-cache + export VLLM_ROCM_USE_AITER=1 + + vllm serve /home/models/DeepSeek-V4-Flash \ + --host localhost \ + --port 8001 \ + --dtype auto \ + --tensor-parallel-size 4 \ + --max-num-seqs 16 \ + --max-num-batched-tokens 1024 \ + --distributed-executor-backend mp \ + --trust-remote-code \ + --gpu-memory-utilization 0.35 \ + --moe-backend triton_unfused \ + --tokenizer-mode deepseek_v4 \ + --async-scheduling \ + --enforce-eager + ``` + + GSM8K command: + + ```bash + MODEL=/home/models/DeepSeek-V4-Flash + lm_eval --model local-completions \ + --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=4,max_retries=10,max_gen_toks=2048,timeout=60000 \ + --batch_size auto \ + --tasks gsm8k \ + --num_fewshot 8 \ + --output_path . 2>&1 | tee -a eval.log + ``` + + Reported result from PR #40871: + + ```text + local-completions ({'model': '/home/models/DeepSeek-V4-Flash', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 4, 'max_retries': 10, 'max_gen_toks': 2048, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 8, batch_size: auto + |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| + |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| + |gsm8k| 3|flexible-extract| 8|exact_match|↑ |0.9439|± |0.0063| + | | |strict-match | 8|exact_match|↑ |0.9431|± |0.0064| + ``` + ### H200 Single-Node PD (Mooncake) Single-host disaggregated serving: 4 prefill GPUs + 4 decode GPUs on one 8-GPU H200 node, diff --git a/models/deepseek-ai/DeepSeek-V4-Pro.yaml b/models/deepseek-ai/DeepSeek-V4-Pro.yaml index 8312008d..1c6e6adc 100644 --- a/models/deepseek-ai/DeepSeek-V4-Pro.yaml +++ b/models/deepseek-ai/DeepSeek-V4-Pro.yaml @@ -3,7 +3,7 @@ meta: slug: "deepseek-v4-pro" provider: "DeepSeek" description: "DeepSeek V4 flagship MoE (1.6T total / 49B active) with hybrid CSA+HCA attention, manifold-constrained hyper-connections, Muon-trained on 32T+ tokens, and three-tier reasoning." - date_updated: 2026-04-24 + date_updated: 2026-05-01 difficulty: hard tasks: - text @@ -17,11 +17,13 @@ meta: gb300: verified mi300x: unsupported mi325x: unsupported - mi355x: unsupported + mi355x: verified model: model_id: "deepseek-ai/DeepSeek-V4-Pro" min_vllm_version: "0.20.0" + docker_image: + amd: "vllm/vllm-openai-rocm:nightly" architecture: moe parameter_count: "1600B" active_parameters: "49B" @@ -109,6 +111,23 @@ hardware_overrides: - "--attention_config.use_fp4_indexer_cache=True" - "--moe-backend" - "deep_gemm_mega_moe" + amd: + extra_args: + - "--distributed-executor-backend" + - "mp" + - "--gpu-memory-utilization" + - "0.6" + - "--max-num-seqs" + - "128" + - "--max-num-batched-tokens" + - "8192" + - "--moe-backend" + - "triton_unfused" + - "--async-scheduling" + - "--enforce-eager" + extra_env: + VLLM_ROCM_USE_AITER: "1" + VLLM_ROCM_USE_AITER_LINEAR: "1" strategy_overrides: single_node_tp: @@ -252,6 +271,59 @@ guide: | - **H200 (8× GPU)**: DP + EP with `--data-parallel-size 8`. Context is capped at 800K tokens (`--max-model-len 800000`) to leave KV headroom with dense params replicated across ranks — applies to both single-node and multi-node H200. + - **MI355X (8× GPU)**: validated with ROCm + AITER (`VLLM_ROCM_USE_AITER=1`, + `VLLM_ROCM_USE_AITER_LINEAR=1`), `--moe-backend triton_unfused`, + `--gpu-memory-utilization 0.6`, `--max-num-seqs 128`, + `--max-num-batched-tokens 8192`, and `--distributed-executor-backend mp`. - **GB200 NVL4 (4× GPU per tray)**: the ~960 GB mixed-precision checkpoint does not fit on one tray; run multi-node DP + EP across **2 trays** (8 GPUs total) with `--data-parallel-size 8`. Pick the "Multi-Node" tab and set nodes to 2. + + ## GSM8K validation (MI355X) + + Launch command (TP=8): + + ```bash + export HF_HOME=/data/huggingface-cache + export VLLM_ROCM_USE_AITER=1 + export VLLM_ROCM_USE_AITER_LINEAR=1 + + vllm serve /home/models/DeepSeek-V4-Pro \ + --host localhost \ + --port 8001 \ + --dtype auto \ + --kv-cache-dtype fp8 \ + --tensor-parallel-size 8 \ + --max-num-seqs 128 \ + --max-num-batched-tokens 8192 \ + --distributed-executor-backend mp \ + --trust-remote-code \ + --gpu-memory-utilization 0.6 \ + --moe-backend triton_unfused \ + --tokenizer-mode deepseek_v4 \ + --reasoning-parser deepseek_v4 \ + --async-scheduling \ + --enforce-eager + ``` + + GSM8K command: + + ```bash + MODEL=/home/models/DeepSeek-V4-Pro + lm_eval --model local-completions \ + --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 \ + --batch_size auto \ + --tasks gsm8k \ + --num_fewshot 8 \ + --output_path . 2>&1 | tee -a eval.log + ``` + + Reported result from PR #40871: + + ```text + local-completions ({'model': '/home/models/DeepSeek-V4-Pro', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 2, 'max_retries': 10, 'max_gen_toks': 2048, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 8, batch_size: auto + |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| + |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| + |gsm8k| 3|flexible-extract| 8|exact_match|↑ |0.9538|± |0.0058| + | | |strict-match | 8|exact_match|↑ |0.9545|± |0.0057| + ```