Fix ROCm build/runtime naming and MTP model mapping by adv0r · Pull Request #156 · antirez/ds4

adv0r · 2026-05-15T06:45:58Z

Summary

add --rocm and --backend rocm as the canonical user-facing names for ROCm builds
keep --cuda and --backend cuda as compatibility aliases because the shared GPU backend still uses the CUDA enum internally
pass DS4_ROCM_BUILD to both C and HIP/CUDA compilation during make rocm
make the low-level GPU runtime logs emitted by the shared CUDA/HIP file print ROCm/hipBLAS in ROCm builds instead of CUDA/cuBLAS
fix two ROCm host-side rsqrtf() build failures by using 1.0f / sqrtf(...) for hipBLAS alpha constants
fix the CUDA/HIP MTP support-model path so a secondary MTP GGUF does not replace the primary model mapping state

Implementation notes

This intentionally does not rename the internal DS4_BACKEND_CUDA enum or the shared CUDA/HIP implementation symbols. The ROCm branch still maps HIP onto the existing CUDA graph backend internally, so this patch keeps the internal backend identity stable and adds a separate compile-time DS4_GPU_BACKEND_CLI_NAME for the user-facing CLI spelling.

The MTP fix addresses a separate runtime issue found while testing the optional MTP GGUF on ROCm:

the CUDA/HIP weight resolver has process-global state for the current full model map;
with --mtp, startup registered the primary model map and then registered the MTP model map;
the second registration made g_model_host_base point at the MTP file;
the full-map shortcut then treated unrelated model maps as directly device-readable whenever g_model_registered was true;
the primary model preload was also skipped when MTP was present.

The fix keeps the full-map shortcut scoped to the matching model_map, keeps caching checks scoped the same way, avoids promoting the MTP file to the CUDA/HIP global model map, and still prepares the primary model tensor cache when MTP is enabled. MTP weights can then be resolved through the existing per-range path instead of replacing the primary model state.

Validation

Built on AMD ROCm 7.2 / gfx1151 with:

make rocm ROCM_PATH=/opt/rocm-7.2.0 ROCM_ARCH=gfx1151 -j$(nproc)

Checked ./ds4 --help, ./ds4-server --help, and ./ds4-bench --help expose --rocm, --backend rocm, and the compatibility cuda backend value in ROCm builds.

Runtime smoke on a Radeon 8060S / gfx1151:

./ds4 --inspect --backend rocm --ctx 32768 loads the 80.76 GiB GGUF and logs ROCm backend initialized on Radeon 8060S Graphics (gfx1151)
./ds4 --backend rocm --ctx 32768 --nothink -n 32 --temp 0 -p ... returns DS4_OK
./ds4-server --backend rocm --host 127.0.0.1 --port 18188 --ctx 32768 --kv-disk-dir /scratch/ds4-kv ... serves both /v1/chat/completions and /v1/messages
./ds4-bench --backend rocm --ctx-start 2048 --ctx-max 2048 --ctx-alloc 4096 --gen-tokens 32 reports 39.33 prefill tok/s and 8.99 generation tok/s

MTP regression check on the same machine:

Before the last commit, this command shape booted but crashed on first prompt with a ROCm GPU page fault:

./ds4-server \
  --model DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf \
  --backend rocm \
  --ctx 65536 \
  --mtp DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf \
  --mtp-draft 2

Crash signature:

Memory access fault by GPU node-1 ... Reason: Page not present or supervisor privilege.

After the fix:

ctx=16384 --mtp-draft 2 returned DS4 MTP FIX OK
ctx=65536 --mtp-draft 2 returned DS4 MTP 65K FIX OK
startup logs show the primary model tensor cache is prepared again with MTP enabled: ROCm startup model cache prepared 80.76 GiB of tensor spans

This only fixes the crash. On this APU, the MTP path was not a throughput win in my short benchmark: non-MTP 256-token decode was about 10.18 tok/s, while patched MTP was about 8.44 tok/s.

adv0r force-pushed the codex/rocm-backend-name branch from 3116fc8 to 1d2f608 Compare May 15, 2026 07:10

adv0r marked this pull request as ready for review May 15, 2026 07:11

adv0r force-pushed the codex/rocm-backend-name branch from 1d2f608 to 9960bd2 Compare May 15, 2026 07:51

Add ROCm backend CLI alias

31b7425

adv0r force-pushed the codex/rocm-backend-name branch from 9960bd2 to 31b7425 Compare May 15, 2026 07:53

Fix CUDA MTP model map state

a839e32

adv0r changed the title ~~Add ROCm backend CLI alias~~ Fix ROCm build/runtime naming and MTP model mapping May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ROCm build/runtime naming and MTP model mapping#156

Fix ROCm build/runtime naming and MTP model mapping#156
adv0r wants to merge 2 commits into
antirez:rocmfrom
adv0r:codex/rocm-backend-name

adv0r commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adv0r commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation notes

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adv0r commented May 15, 2026 •

edited

Loading