Skip to content

Eval bug: --split-mode tensor aborts in ggml_backend_meta_buffer_get_tensor with Qwen3 MoE Q8_K_XL on ROCm #22307

@cgarwood82

Description

@cgarwood82

Name and Version

$ ./llama-server --version
ggml_cuda_init: found 2 ROCm devices (Total VRAM: 65248 MiB):
  Device 0: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB
  Device 1: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB
version: 1 (8bc492e)
built with GNU 15.2.1 for Linux x86_64

Built from commit 8bc492e (current master at time of filing). Includes PR #19378 (backend-agnostic tensor parallelism) and follow-up fix PR #22129 (Gemma-4 MoE delayed AllReduce).

Operating systems

Linux

GGML backends

HIP

Hardware

CPU: AMD Ryzen 9 9950X3D (16-core)
GPUs: 2x AMD Radeon AI PRO R9700 (Navi 48, gfx1201, 32 GB each, 64 GB total VRAM)
ROCm 7.2.2, RCCL installed
PCIe: Cards on separate roots (0000:03:00.0 and 0000:07:00.0), PEER_MAX_BATCH_SIZE = 128

Models

unsloth/Qwen3.6-35B-A3B-GGUF — file Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf (~35 GB). MoE architecture with ~3B active params per token. Snapshot a483e9e6cbd595906af30beda3187c2663a1118c. The crash reproduces with the base model file alone (no --mmproj).

Problem description & steps to reproduce

--split-mode tensor loads the model successfully across both GPUs (meta-backend reports Meta() model buffer size = 18177.72 MiB, VRAM splits ~53/47 across the two cards) and the server comes up clean. But both normal tool-usage paths then abort inside the meta-backend at the same source location.

Assertion (both failure modes hit this exact line):

ggml/src/ggml-backend-meta.cpp:1299: GGML_ASSERT(size == ggml_nbytes(tensor)) failed

which is this check inside ggml_backend_meta_buffer_get_tensor:

if (split_state.n_segments != 1) {
    GGML_ASSERT(split_state.axis >= 0 && split_state.axis < GGML_MAX_DIMS);
    GGML_ASSERT(offset == 0);
    GGML_ASSERT(size == ggml_nbytes(tensor));   // <-- line 1299
    ...
}

Failure mode 1 — idle-slot cache save

Launch command:

./llama-server \
    -m Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf \
    --ctx-size 16384 \
    --port 18001 \
    --split-mode tensor \
    --tensor-split 1,1 \
    -fit off \
    -ngl 99 \
    --no-webui

(-fit off is required — with the default -fit on, startup aborts earlier with llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort.)

Send any single chat completion request, however short:

curl -s http://127.0.0.1:18001/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"x","messages":[{"role":"user","content":"hi"}],"max_tokens":10}'

The request returns a response. A few seconds later, as the server tries to save the now-idle slot:

slot slot_save_an: id  3 | task -1 | saving idle slot to prompt cache
srv   prompt_save:  - saving prompt with length 47, total state size = 63.732 MiB
ggml/src/ggml-backend-meta.cpp:1299: GGML_ASSERT(size == ggml_nbytes(tensor)) failed

→ process aborts (SIGABRT, exit 134).

Failure mode 2 — end of prefill

Same launch command plus --no-cache-idle-slots to bypass failure mode 1. Send a larger prompt (~4500 tokens):

prompt='The quick brown fox jumps over the lazy dog. '
yes "$prompt" | head -n 450 | tr -d '\n' > /tmp/big
echo -n ' Summarize.' >> /tmp/big
curl -s http://127.0.0.1:18001/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d "$(jq -Rs --arg p "$(cat /tmp/big)" '{model:"x",messages:[{role:"user",content:$p}],max_tokens:20}')"

Prefill progresses normally:

prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.453
prompt processing progress, n_tokens = 4003, batch.n_tokens = 1955, progress = 0.886
prompt processing progress, n_tokens = 4515, batch.n_tokens =  512, progress = 0.999

Then aborts on the same assertion, same source line, as generation is about to begin.

What works

Same build, same model, same launch command with --split-mode layer (or --split-mode row) runs stably through both of the above scenarios. Prefill and generation both complete, idle-slot save works. So this is specific to the meta-backend's split path, not to the build or the model.

First Bad Commit

Not bisected, but the ggml_backend_meta_buffer_get_tensor function is introduced by PR #19378 (merged 2026-04-09 as commit d6f3030047), so it cannot be older than that.

Relevant log output

Launch + crash log (failure mode 2)
$ HIP_VISIBLE_DEVICES=0,1 ./llama-server \
    -m Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf \
    --ctx-size 16384 --port 18001 --split-mode tensor --tensor-split 1,1 \
    -fit off --no-cache-idle-slots -ngl 99 --no-webui

# ... load_tensors shows model fits across both ROCm devices ...
load_tensors:   CPU_Mapped model buffer size =   515.31 MiB
load_tensors:       Meta() model buffer size = 18177.72 MiB
llama_context:  ROCm_Host  output buffer size =     3.79 MiB
sched_reserve:  ROCm_Host compute buffer size =    40.02 MiB
main: server is listening on http://127.0.0.1:18001

# ... prefill progresses cleanly ...
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.453198
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 4003, batch.n_tokens = 1955, progress = 0.885815
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 4515, batch.n_tokens =  512, progress = 0.999115

# ... assertion hits as prefill finishes / transition to decode ...
/home/.../llama.cpp/ggml/src/ggml-backend-meta.cpp:1299: GGML_ASSERT(size == ggml_nbytes(tensor)) failed

#4  0x00007f7e4c753efb in ggml_print_backtrace () from .../libggml-base.so.0
#5  0x00007f7e4c754060 in ggml_abort () from .../libggml-base.so.0
# ... SIGABRT, exit 134 ...
Build configuration
HIPCXX="$(hipconfig -l)/clang" HIP_PATH=/opt/rocm ROCM_PATH=/opt/rocm \
cmake -S . -B build \
    -DGGML_HIP=ON \
    -DGPU_TARGETS=gfx1201 \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_HIP_COMPILER_ROCM_ROOT=/opt/rocm \
    -DCMAKE_HIP_FLAGS="-I/opt/rocm/include" \
    -DCMAKE_EXE_LINKER_FLAGS="-L/opt/rocm/lib" \
    -DCMAKE_SHARED_LINKER_FLAGS="-L/opt/rocm/lib" \
    -GNinja
cmake --build build --config Release -j16

Both failure modes hit the same assertion at ggml-backend-meta.cpp:1299: size == ggml_nbytes(tensor) inside the multi-segment branch of ggml_backend_meta_buffer_get_tensor. Something is requesting a partial tensor read from a split meta buffer, which the current implementation doesn't support.

Looks like an MoE/K-quant interaction on ROCm as far as I've tested — layer-split works, and a Gemma-4 MoE fix already landed in #22129 for a different tensor-parallel MoE issue. May be the same class of problem.

Happy to test patches against this exact setup and file a PR if the fix is small and I can narrow it down.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions