[Bug]: v0.21.0 release notes claim 'DeepSeek V4: AMD/ROCm support' but stock vllm/vllm-openai-rocm:v0.21.0 fails to bring up DeepSeek-V4-Flash on MI350X (mhc_post_tilelang → PDL is not supported); the official recipes YAML marks every AMD SKU as unsupported

## Your current environment

The reproduction pod has been reclaimed by other workloads, so a fresh `python collect_env.py` is not available; relevant fields gathered in-pod during the run:

<details>
<summary>Environment (captured in-pod during the failing run)</summary>

```text
Container image:   vllm/vllm-openai-rocm:v0.21.0   (official vllm-project release image)
vllm:              0.21.0   (/usr/local/lib/python3.12/dist-packages/vllm/__init__.py)
PyTorch:           2.10.0+git8514f05
HIP runtime:       7.2.53211
Platform:          ROCm 7.2.53
GPU:               8x AMD Instinct MI350X (gfx950)
                   288 GiB HBM each (294592 MiB)
                   256 CUs, 160 KiB shared_memory_per_block
current_platform:  is_rocm()=True, is_fp8_fnuz()=False,
                   fp8_dtype()=torch.float8_e4m3fn, capability=(9, 5)
Tokenizer:         deepseek_v4 (per --tokenizer-mode)
Model:             deepseek-ai/DeepSeek-V4-Flash (canonical HF repo,
                   46 shards / ~159 GB, FP4+FP8 mixed per the model card)
Cluster:           Kubernetes (DigitalOcean DOKS), MI350X partition.
```

</details>

## 🐛 Describe the bug

### Summary

The v0.21.0 release notes list **"DeepSeek V4: AMD/ROCm support (#40871)"** under Model Support, but on the official `vllm/vllm-openai-rocm:v0.21.0` image, attempting to serve `deepseek-ai/DeepSeek-V4-Flash` on MI350X (gfx950) fails during `profile_run` in the `mhc_post_tilelang` op with a hard TVM/TileLang `LogFatal`:

```
tvm.error.InternalError: Check failed: (!mutator.has_trigger_launch_ && !mutator.has_grid_sync_) is false: PDL is not supported
```

PDL = NVIDIA Hopper's Programmatic Dependent Launch — a CUDA-only feature. The TileLang pass `MarkCudaSyncCalls` refuses an IR that emits `trigger_launch` / `grid_sync` ops when targeting AMD. The path is reached from the DSv4 model's `forward` → `torch.ops.vllm.mhc_fused_post_pre` → `mhc.py:717 mhc_post_tilelang` → TileLang JIT compile.

Additionally, the canonical recipe source backing [recipes.vllm.ai](https://recipes.vllm.ai) marks **every AMD SKU as `unsupported`** for both DSv4-Pro and DSv4-Flash (see [`DeepSeek-V4-Flash.yaml`](https://github.com/vllm-project/recipes/blob/main/models/deepseek-ai/DeepSeek-V4-Flash.yaml) and [`DeepSeek-V4-Pro.yaml`](https://github.com/vllm-project/recipes/blob/main/models/deepseek-ai/DeepSeek-V4-Pro.yaml)):

```yaml
hardware:
  h200: verified
  b200: verified
  gb200: verified
  b300: verified
  gb300: verified
  mi300x: unsupported
  mi325x: unsupported
  mi355x: unsupported
```

So the release notes claim AMD/ROCm support, while the official recipes — same project — say it is `unsupported`. As a user reading the release notes first, this contradiction is the actual cost: ~half a day of triage on real hardware before discovering it.

There is already a merged main-branch PR that adds AMD MHC support (#41946 "[Bugfix] [ROCm] [DSV4] [Perf] Add aiter mhc support", merged 2026-05-13), but it **did not make the v0.21.0 cut** at commit `ad7125a` (`git log v0.21.0 --grep="(#41946)"` returns nothing). The remaining AMD fallbacks for the rest of the Flash variant are in **open PR #41136** (`lcskrishna/vllm:deepseekv4-rocm`, currently `state=open, mergeable_state=dirty`).

### Reproduction

Launch command (adapted from the only recipe-shaped doc that exists for AMD, [`wuhuikx/recipes#433`](https://github.com/vllm-project/recipes/pull/433) — unmerged PR self-labeled `[Do not merge]`; the official YAML provides no AMD command since AMD is `unsupported`). The `--kv-cache-dtype fp8` is required because `vllm/model_executor/models/deepseek_v4.py:1636` asserts on it:

```bash
export VLLM_ROCM_USE_AITER=1

vllm serve /path/to/DeepSeek-V4-Flash \
  --host 0.0.0.0 --port 8000 \
  --dtype auto --kv-cache-dtype fp8 \
  --tensor-parallel-size 4 \
  --max-num-seqs 16 --max-num-batched-tokens 1024 \
  --distributed-executor-backend mp \
  --trust-remote-code \
  --gpu-memory-utilization 0.35 \
  --moe-backend triton_unfused \
  --tokenizer-mode deepseek_v4 \
  --async-scheduling --enforce-eager
```

### What works

Everything up to the first forward pass succeeds:

- Engine config wires up cleanly: `quantization=deepseek_v4_fp8`, `kv_cache_dtype=fp8`, `tensor_parallel_size=4`, `enforce_eager=True`, `moe_backend='triton_unfused'`, `tokenizer_mode='deepseek_v4'`.
- All 4 TP workers spawn, NCCL/RCCL collectives initialize, `aiter` imports.
- 46/46 safetensors shards load in 233 s (`default_loader.py:397 "Loading weights took 233.04 seconds"`).
- KV cache backend registers as `fp8_ds_mla` (`deepseek_v4_attention.py:714 "Using DeepSeek's fp8_ds_mla KV cache format."`).

### What fails

`gpu_worker.determine_available_memory` → `model_runner.profile_run` → dummy forward:

```text
File "vllm/model_executor/models/deepseek_v4.py", line 1235, in forward
    residual, post_mix, res_mix, x = torch.ops.vllm.mhc_fused_post_pre(...)
File "vllm/model_executor/layers/mhc.py", line 717, in mhc_fused_post_pre
    mhc_post_tilelang(...)
File "tilelang/jit/__init__.py", line 465, in __call__
    kernel = self.compile(*args, **kwargs)
File "tilelang/engine/lower.py", line 311, in lower
    mod = OptimizeForTarget(mod, target)
File "tilelang/engine/phase.py", line 272, in OptimizeForTarget
    mod = tilelang.transform.MarkCudaSyncCalls(have_pdl(target))(mod)
...
tvm.error.InternalError: Check failed:
    (!mutator.has_trigger_launch_ && !mutator.has_grid_sync_) is false:
    PDL is not supported
```

All 4 workers crash identically. EngineCore then raises:

```text
RuntimeError: Worker failed with error 'Check failed: (!mutator.has_trigger_launch_ && !mutator.has_grid_sync_) is false: PDL is not supported'
```

No tokens are ever generated.

### Why this is a bug worth reporting (not "your hardware is unsupported")

The hardware-unsupported signal exists in `recipes.vllm.ai`, but **not** in the v0.21.0 release notes, which read "DeepSeek V4: AMD/ROCm support (#40871)" — a phrase a downstream reader will reasonably interpret as "Flash on at least one MI3xx SKU runs in 0.21.0". The two sources contradict each other.

Additionally, the user-facing failure is a TVM C++ `LogFatal` deep in TileLang lowering, with no actionable up-front platform check. There is no message saying "DSv4 MHC has no AMD fallback in this build; see #41136" — the user has to dig into the stack to discover that.

### Suggested fixes (any one of these would help)

1. **Reconcile the docs.** Either narrow the release-notes claim (e.g., "initial AMD/ROCm enablement landed in #40871; functional Flash support on MI3xx requires fallbacks that are still in flight in #41136 and were not all included in 0.21.0"), or — if the intent was that Pro on MI355X is actually verified — update the recipes YAMLs to reflect a `preview` / `partial` status rather than `unsupported`.
2. **Add an explicit platform check at `mhc_post_tilelang` entry** so users on ROCm get a clear error pointing to #41136 and #41946, rather than a TVM C++ check failing inside TileLang IR lowering.
3. **Consider a v0.21.1 cut that picks up #41946** (which is already merged on main but missed the 0.21.0 tag), so the gap between "release notes says supported" and "Flash runs on AMD" shrinks.

### Related

- Release notes for v0.21.0: https://github.com/vllm-project/vllm/releases/tag/v0.21.0
- Base AMD enablement PR cited in release notes: #40871 (merged, in 0.21.0)
- AMD MHC support: #41946 (merged on main 2026-05-13, **not in v0.21.0**)
- Remaining AMD fallbacks for DSv4-Flash: #41136 (open, dirty)
- Tracking issue for DSv4 ROCm: #41820
- Existing AMD-specific DSv4 bugs (separate failure modes, not duplicates): #41961, #41962, #41963, #41027
- NVIDIA SM12x equivalent of this MHC issue (closed via a different code path): #40991
- Recipes YAMLs claiming AMD unsupported:
  - https://github.com/vllm-project/recipes/blob/main/models/deepseek-ai/DeepSeek-V4-Flash.yaml
  - https://github.com/vllm-project/recipes/blob/main/models/deepseek-ai/DeepSeek-V4-Pro.yaml

cc @hongxiayang @tjtanaa @vllmellm @lcskrishna (per the auto-cc ROCm group).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: v0.21.0 release notes claim 'DeepSeek V4: AMD/ROCm support' but stock vllm/vllm-openai-rocm:v0.21.0 fails to bring up DeepSeek-V4-Flash on MI350X (mhc_post_tilelang → PDL is not supported); the official recipes YAML marks every AMD SKU as unsupported #42876

Your current environment

🐛 Describe the bug

Summary

Reproduction

What works

What fails

Why this is a bug worth reporting (not "your hardware is unsupported")

Suggested fixes (any one of these would help)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: v0.21.0 release notes claim 'DeepSeek V4: AMD/ROCm support' but stock vllm/vllm-openai-rocm:v0.21.0 fails to bring up DeepSeek-V4-Flash on MI350X (mhc_post_tilelang → PDL is not supported); the official recipes YAML marks every AMD SKU as unsupported #42876

Description

Your current environment

🐛 Describe the bug

Summary

Reproduction

What works

What fails

Why this is a bug worth reporting (not "your hardware is unsupported")

Suggested fixes (any one of these would help)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions