You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Bug]: v0.21.0 release notes claim 'DeepSeek V4: AMD/ROCm support' but stock vllm/vllm-openai-rocm:v0.21.0 fails to bring up DeepSeek-V4-Flash on MI350X (mhc_post_tilelang → PDL is not supported); the official recipes YAML marks every AMD SKU as unsupported #42876
The reproduction pod has been reclaimed by other workloads, so a fresh python collect_env.py is not available; relevant fields gathered in-pod during the run:
Environment (captured in-pod during the failing run)
The v0.21.0 release notes list "DeepSeek V4: AMD/ROCm support (#40871)" under Model Support, but on the official vllm/vllm-openai-rocm:v0.21.0 image, attempting to serve deepseek-ai/DeepSeek-V4-Flash on MI350X (gfx950) fails during profile_run in the mhc_post_tilelang op with a hard TVM/TileLang LogFatal:
tvm.error.InternalError: Check failed: (!mutator.has_trigger_launch_ && !mutator.has_grid_sync_) is false: PDL is not supported
PDL = NVIDIA Hopper's Programmatic Dependent Launch — a CUDA-only feature. The TileLang pass MarkCudaSyncCalls refuses an IR that emits trigger_launch / grid_sync ops when targeting AMD. The path is reached from the DSv4 model's forward → torch.ops.vllm.mhc_fused_post_pre → mhc.py:717 mhc_post_tilelang → TileLang JIT compile.
So the release notes claim AMD/ROCm support, while the official recipes — same project — say it is unsupported. As a user reading the release notes first, this contradiction is the actual cost: ~half a day of triage on real hardware before discovering it.
There is already a merged main-branch PR that adds AMD MHC support (#41946 "[Bugfix] [ROCm] [DSV4] [Perf] Add aiter mhc support", merged 2026-05-13), but it did not make the v0.21.0 cut at commit ad7125a (git log v0.21.0 --grep="(#41946)" returns nothing). The remaining AMD fallbacks for the rest of the Flash variant are in open PR #41136 (lcskrishna/vllm:deepseekv4-rocm, currently state=open, mergeable_state=dirty).
Reproduction
Launch command (adapted from the only recipe-shaped doc that exists for AMD, wuhuikx/recipes#433 — unmerged PR self-labeled [Do not merge]; the official YAML provides no AMD command since AMD is unsupported). The --kv-cache-dtype fp8 is required because vllm/model_executor/models/deepseek_v4.py:1636 asserts on it:
File "vllm/model_executor/models/deepseek_v4.py", line 1235, in forward
residual, post_mix, res_mix, x = torch.ops.vllm.mhc_fused_post_pre(...)
File "vllm/model_executor/layers/mhc.py", line 717, in mhc_fused_post_pre
mhc_post_tilelang(...)
File "tilelang/jit/__init__.py", line 465, in __call__
kernel = self.compile(*args, **kwargs)
File "tilelang/engine/lower.py", line 311, in lower
mod = OptimizeForTarget(mod, target)
File "tilelang/engine/phase.py", line 272, in OptimizeForTarget
mod = tilelang.transform.MarkCudaSyncCalls(have_pdl(target))(mod)
...
tvm.error.InternalError: Check failed:
(!mutator.has_trigger_launch_ && !mutator.has_grid_sync_) is false:
PDL is not supported
All 4 workers crash identically. EngineCore then raises:
RuntimeError: Worker failed with error 'Check failed: (!mutator.has_trigger_launch_ && !mutator.has_grid_sync_) is false: PDL is not supported'
No tokens are ever generated.
Why this is a bug worth reporting (not "your hardware is unsupported")
The hardware-unsupported signal exists in recipes.vllm.ai, but not in the v0.21.0 release notes, which read "DeepSeek V4: AMD/ROCm support (#40871)" — a phrase a downstream reader will reasonably interpret as "Flash on at least one MI3xx SKU runs in 0.21.0". The two sources contradict each other.
Additionally, the user-facing failure is a TVM C++ LogFatal deep in TileLang lowering, with no actionable up-front platform check. There is no message saying "DSv4 MHC has no AMD fallback in this build; see #41136" — the user has to dig into the stack to discover that.
Consider a v0.21.1 cut that picks up [Bugfix] [ROCm] [DSV4] [Perf] Add aiter mhc support #41946 (which is already merged on main but missed the 0.21.0 tag), so the gap between "release notes says supported" and "Flash runs on AMD" shrinks.
Your current environment
The reproduction pod has been reclaimed by other workloads, so a fresh
python collect_env.pyis not available; relevant fields gathered in-pod during the run:Environment (captured in-pod during the failing run)
🐛 Describe the bug
Summary
The v0.21.0 release notes list "DeepSeek V4: AMD/ROCm support (#40871)" under Model Support, but on the official
vllm/vllm-openai-rocm:v0.21.0image, attempting to servedeepseek-ai/DeepSeek-V4-Flashon MI350X (gfx950) fails duringprofile_runin themhc_post_tilelangop with a hard TVM/TileLangLogFatal:PDL = NVIDIA Hopper's Programmatic Dependent Launch — a CUDA-only feature. The TileLang pass
MarkCudaSyncCallsrefuses an IR that emitstrigger_launch/grid_syncops when targeting AMD. The path is reached from the DSv4 model'sforward→torch.ops.vllm.mhc_fused_post_pre→mhc.py:717 mhc_post_tilelang→ TileLang JIT compile.Additionally, the canonical recipe source backing recipes.vllm.ai marks every AMD SKU as
unsupportedfor both DSv4-Pro and DSv4-Flash (seeDeepSeek-V4-Flash.yamlandDeepSeek-V4-Pro.yaml):So the release notes claim AMD/ROCm support, while the official recipes — same project — say it is
unsupported. As a user reading the release notes first, this contradiction is the actual cost: ~half a day of triage on real hardware before discovering it.There is already a merged main-branch PR that adds AMD MHC support (#41946 "[Bugfix] [ROCm] [DSV4] [Perf] Add aiter mhc support", merged 2026-05-13), but it did not make the v0.21.0 cut at commit
ad7125a(git log v0.21.0 --grep="(#41946)"returns nothing). The remaining AMD fallbacks for the rest of the Flash variant are in open PR #41136 (lcskrishna/vllm:deepseekv4-rocm, currentlystate=open, mergeable_state=dirty).Reproduction
Launch command (adapted from the only recipe-shaped doc that exists for AMD,
wuhuikx/recipes#433— unmerged PR self-labeled[Do not merge]; the official YAML provides no AMD command since AMD isunsupported). The--kv-cache-dtype fp8is required becausevllm/model_executor/models/deepseek_v4.py:1636asserts on it:export VLLM_ROCM_USE_AITER=1 vllm serve /path/to/DeepSeek-V4-Flash \ --host 0.0.0.0 --port 8000 \ --dtype auto --kv-cache-dtype fp8 \ --tensor-parallel-size 4 \ --max-num-seqs 16 --max-num-batched-tokens 1024 \ --distributed-executor-backend mp \ --trust-remote-code \ --gpu-memory-utilization 0.35 \ --moe-backend triton_unfused \ --tokenizer-mode deepseek_v4 \ --async-scheduling --enforce-eagerWhat works
Everything up to the first forward pass succeeds:
quantization=deepseek_v4_fp8,kv_cache_dtype=fp8,tensor_parallel_size=4,enforce_eager=True,moe_backend='triton_unfused',tokenizer_mode='deepseek_v4'.aiterimports.default_loader.py:397 "Loading weights took 233.04 seconds").fp8_ds_mla(deepseek_v4_attention.py:714 "Using DeepSeek's fp8_ds_mla KV cache format.").What fails
gpu_worker.determine_available_memory→model_runner.profile_run→ dummy forward:All 4 workers crash identically. EngineCore then raises:
No tokens are ever generated.
Why this is a bug worth reporting (not "your hardware is unsupported")
The hardware-unsupported signal exists in
recipes.vllm.ai, but not in the v0.21.0 release notes, which read "DeepSeek V4: AMD/ROCm support (#40871)" — a phrase a downstream reader will reasonably interpret as "Flash on at least one MI3xx SKU runs in 0.21.0". The two sources contradict each other.Additionally, the user-facing failure is a TVM C++
LogFataldeep in TileLang lowering, with no actionable up-front platform check. There is no message saying "DSv4 MHC has no AMD fallback in this build; see #41136" — the user has to dig into the stack to discover that.Suggested fixes (any one of these would help)
preview/partialstatus rather thanunsupported.mhc_post_tilelangentry so users on ROCm get a clear error pointing to [ROCm] DeepSeekV4-Flash-Base model enablement on ROCm with triton & torchfallback #41136 and [Bugfix] [ROCm] [DSV4] [Perf] Add aiter mhc support #41946, rather than a TVM C++ check failing inside TileLang IR lowering.Related
cc @hongxiayang @tjtanaa @vllmellm @lcskrishna (per the auto-cc ROCm group).