Skip to content

[Bug]: v0.21.0 release notes claim 'DeepSeek V4: AMD/ROCm support' but stock vllm/vllm-openai-rocm:v0.21.0 fails to bring up DeepSeek-V4-Flash on MI350X (mhc_post_tilelang → PDL is not supported); the official recipes YAML marks every AMD SKU as unsupported #42876

@ziliangpeng

Description

@ziliangpeng

Your current environment

The reproduction pod has been reclaimed by other workloads, so a fresh python collect_env.py is not available; relevant fields gathered in-pod during the run:

Environment (captured in-pod during the failing run)
Container image:   vllm/vllm-openai-rocm:v0.21.0   (official vllm-project release image)
vllm:              0.21.0   (/usr/local/lib/python3.12/dist-packages/vllm/__init__.py)
PyTorch:           2.10.0+git8514f05
HIP runtime:       7.2.53211
Platform:          ROCm 7.2.53
GPU:               8x AMD Instinct MI350X (gfx950)
                   288 GiB HBM each (294592 MiB)
                   256 CUs, 160 KiB shared_memory_per_block
current_platform:  is_rocm()=True, is_fp8_fnuz()=False,
                   fp8_dtype()=torch.float8_e4m3fn, capability=(9, 5)
Tokenizer:         deepseek_v4 (per --tokenizer-mode)
Model:             deepseek-ai/DeepSeek-V4-Flash (canonical HF repo,
                   46 shards / ~159 GB, FP4+FP8 mixed per the model card)
Cluster:           Kubernetes (DigitalOcean DOKS), MI350X partition.

🐛 Describe the bug

Summary

The v0.21.0 release notes list "DeepSeek V4: AMD/ROCm support (#40871)" under Model Support, but on the official vllm/vllm-openai-rocm:v0.21.0 image, attempting to serve deepseek-ai/DeepSeek-V4-Flash on MI350X (gfx950) fails during profile_run in the mhc_post_tilelang op with a hard TVM/TileLang LogFatal:

tvm.error.InternalError: Check failed: (!mutator.has_trigger_launch_ && !mutator.has_grid_sync_) is false: PDL is not supported

PDL = NVIDIA Hopper's Programmatic Dependent Launch — a CUDA-only feature. The TileLang pass MarkCudaSyncCalls refuses an IR that emits trigger_launch / grid_sync ops when targeting AMD. The path is reached from the DSv4 model's forwardtorch.ops.vllm.mhc_fused_post_premhc.py:717 mhc_post_tilelang → TileLang JIT compile.

Additionally, the canonical recipe source backing recipes.vllm.ai marks every AMD SKU as unsupported for both DSv4-Pro and DSv4-Flash (see DeepSeek-V4-Flash.yaml and DeepSeek-V4-Pro.yaml):

hardware:
  h200: verified
  b200: verified
  gb200: verified
  b300: verified
  gb300: verified
  mi300x: unsupported
  mi325x: unsupported
  mi355x: unsupported

So the release notes claim AMD/ROCm support, while the official recipes — same project — say it is unsupported. As a user reading the release notes first, this contradiction is the actual cost: ~half a day of triage on real hardware before discovering it.

There is already a merged main-branch PR that adds AMD MHC support (#41946 "[Bugfix] [ROCm] [DSV4] [Perf] Add aiter mhc support", merged 2026-05-13), but it did not make the v0.21.0 cut at commit ad7125a (git log v0.21.0 --grep="(#41946)" returns nothing). The remaining AMD fallbacks for the rest of the Flash variant are in open PR #41136 (lcskrishna/vllm:deepseekv4-rocm, currently state=open, mergeable_state=dirty).

Reproduction

Launch command (adapted from the only recipe-shaped doc that exists for AMD, wuhuikx/recipes#433 — unmerged PR self-labeled [Do not merge]; the official YAML provides no AMD command since AMD is unsupported). The --kv-cache-dtype fp8 is required because vllm/model_executor/models/deepseek_v4.py:1636 asserts on it:

export VLLM_ROCM_USE_AITER=1

vllm serve /path/to/DeepSeek-V4-Flash \
  --host 0.0.0.0 --port 8000 \
  --dtype auto --kv-cache-dtype fp8 \
  --tensor-parallel-size 4 \
  --max-num-seqs 16 --max-num-batched-tokens 1024 \
  --distributed-executor-backend mp \
  --trust-remote-code \
  --gpu-memory-utilization 0.35 \
  --moe-backend triton_unfused \
  --tokenizer-mode deepseek_v4 \
  --async-scheduling --enforce-eager

What works

Everything up to the first forward pass succeeds:

  • Engine config wires up cleanly: quantization=deepseek_v4_fp8, kv_cache_dtype=fp8, tensor_parallel_size=4, enforce_eager=True, moe_backend='triton_unfused', tokenizer_mode='deepseek_v4'.
  • All 4 TP workers spawn, NCCL/RCCL collectives initialize, aiter imports.
  • 46/46 safetensors shards load in 233 s (default_loader.py:397 "Loading weights took 233.04 seconds").
  • KV cache backend registers as fp8_ds_mla (deepseek_v4_attention.py:714 "Using DeepSeek's fp8_ds_mla KV cache format.").

What fails

gpu_worker.determine_available_memorymodel_runner.profile_run → dummy forward:

File "vllm/model_executor/models/deepseek_v4.py", line 1235, in forward
    residual, post_mix, res_mix, x = torch.ops.vllm.mhc_fused_post_pre(...)
File "vllm/model_executor/layers/mhc.py", line 717, in mhc_fused_post_pre
    mhc_post_tilelang(...)
File "tilelang/jit/__init__.py", line 465, in __call__
    kernel = self.compile(*args, **kwargs)
File "tilelang/engine/lower.py", line 311, in lower
    mod = OptimizeForTarget(mod, target)
File "tilelang/engine/phase.py", line 272, in OptimizeForTarget
    mod = tilelang.transform.MarkCudaSyncCalls(have_pdl(target))(mod)
...
tvm.error.InternalError: Check failed:
    (!mutator.has_trigger_launch_ && !mutator.has_grid_sync_) is false:
    PDL is not supported

All 4 workers crash identically. EngineCore then raises:

RuntimeError: Worker failed with error 'Check failed: (!mutator.has_trigger_launch_ && !mutator.has_grid_sync_) is false: PDL is not supported'

No tokens are ever generated.

Why this is a bug worth reporting (not "your hardware is unsupported")

The hardware-unsupported signal exists in recipes.vllm.ai, but not in the v0.21.0 release notes, which read "DeepSeek V4: AMD/ROCm support (#40871)" — a phrase a downstream reader will reasonably interpret as "Flash on at least one MI3xx SKU runs in 0.21.0". The two sources contradict each other.

Additionally, the user-facing failure is a TVM C++ LogFatal deep in TileLang lowering, with no actionable up-front platform check. There is no message saying "DSv4 MHC has no AMD fallback in this build; see #41136" — the user has to dig into the stack to discover that.

Suggested fixes (any one of these would help)

  1. Reconcile the docs. Either narrow the release-notes claim (e.g., "initial AMD/ROCm enablement landed in [New Model][ROCm] Add AMD support for DeepSeek V4 #40871; functional Flash support on MI3xx requires fallbacks that are still in flight in [ROCm] DeepSeekV4-Flash-Base model enablement on ROCm with triton & torchfallback #41136 and were not all included in 0.21.0"), or — if the intent was that Pro on MI355X is actually verified — update the recipes YAMLs to reflect a preview / partial status rather than unsupported.
  2. Add an explicit platform check at mhc_post_tilelang entry so users on ROCm get a clear error pointing to [ROCm] DeepSeekV4-Flash-Base model enablement on ROCm with triton & torchfallback #41136 and [Bugfix] [ROCm] [DSV4] [Perf] Add aiter mhc support #41946, rather than a TVM C++ check failing inside TileLang IR lowering.
  3. Consider a v0.21.1 cut that picks up [Bugfix] [ROCm] [DSV4] [Perf] Add aiter mhc support #41946 (which is already merged on main but missed the 0.21.0 tag), so the gap between "release notes says supported" and "Flash runs on AMD" shrinks.

Related

cc @hongxiayang @tjtanaa @vllmellm @lcskrishna (per the auto-cc ROCm group).

Metadata

Metadata

Assignees

No one assigned

    Labels

    rocmRelated to AMD ROCm

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions