Name and Version
$ ./llama-server --version
ggml_cuda_init: found 2 ROCm devices (Total VRAM: 65248 MiB):
Device 0: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB
Device 1: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB
version: 1 (8bc492e)
built with GNU 15.2.1 for Linux x86_64
Built from commit 8bc492e (current master at time of filing). Includes PR #19378 (backend-agnostic tensor parallelism) and follow-up fix PR #22129 (Gemma-4 MoE delayed AllReduce).
Operating systems
Linux
GGML backends
HIP
Hardware
CPU: AMD Ryzen 9 9950X3D (16-core)
GPUs: 2x AMD Radeon AI PRO R9700 (Navi 48, gfx1201, 32 GB each, 64 GB total VRAM)
ROCm 7.2.2, RCCL installed
PCIe: Cards on separate roots (0000:03:00.0 and 0000:07:00.0), PEER_MAX_BATCH_SIZE = 128
Models
unsloth/Qwen3.6-35B-A3B-GGUF — file Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf (~35 GB). MoE architecture with ~3B active params per token. Snapshot a483e9e6cbd595906af30beda3187c2663a1118c. The crash reproduces with the base model file alone (no --mmproj).
Problem description & steps to reproduce
--split-mode tensor loads the model successfully across both GPUs (meta-backend reports Meta() model buffer size = 18177.72 MiB, VRAM splits ~53/47 across the two cards) and the server comes up clean. But both normal tool-usage paths then abort inside the meta-backend at the same source location.
Assertion (both failure modes hit this exact line):
ggml/src/ggml-backend-meta.cpp:1299: GGML_ASSERT(size == ggml_nbytes(tensor)) failed
which is this check inside ggml_backend_meta_buffer_get_tensor:
if (split_state.n_segments != 1) {
GGML_ASSERT(split_state.axis >= 0 && split_state.axis < GGML_MAX_DIMS);
GGML_ASSERT(offset == 0);
GGML_ASSERT(size == ggml_nbytes(tensor)); // <-- line 1299
...
}
Failure mode 1 — idle-slot cache save
Launch command:
./llama-server \
-m Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf \
--ctx-size 16384 \
--port 18001 \
--split-mode tensor \
--tensor-split 1,1 \
-fit off \
-ngl 99 \
--no-webui
(-fit off is required — with the default -fit on, startup aborts earlier with llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort.)
Send any single chat completion request, however short:
curl -s http://127.0.0.1:18001/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"x","messages":[{"role":"user","content":"hi"}],"max_tokens":10}'
The request returns a response. A few seconds later, as the server tries to save the now-idle slot:
slot slot_save_an: id 3 | task -1 | saving idle slot to prompt cache
srv prompt_save: - saving prompt with length 47, total state size = 63.732 MiB
ggml/src/ggml-backend-meta.cpp:1299: GGML_ASSERT(size == ggml_nbytes(tensor)) failed
→ process aborts (SIGABRT, exit 134).
Failure mode 2 — end of prefill
Same launch command plus --no-cache-idle-slots to bypass failure mode 1. Send a larger prompt (~4500 tokens):
prompt='The quick brown fox jumps over the lazy dog. '
yes "$prompt" | head -n 450 | tr -d '\n' > /tmp/big
echo -n ' Summarize.' >> /tmp/big
curl -s http://127.0.0.1:18001/v1/chat/completions \
-H 'Content-Type: application/json' \
-d "$(jq -Rs --arg p "$(cat /tmp/big)" '{model:"x",messages:[{role:"user",content:$p}],max_tokens:20}')"
Prefill progresses normally:
prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.453
prompt processing progress, n_tokens = 4003, batch.n_tokens = 1955, progress = 0.886
prompt processing progress, n_tokens = 4515, batch.n_tokens = 512, progress = 0.999
Then aborts on the same assertion, same source line, as generation is about to begin.
What works
Same build, same model, same launch command with --split-mode layer (or --split-mode row) runs stably through both of the above scenarios. Prefill and generation both complete, idle-slot save works. So this is specific to the meta-backend's split path, not to the build or the model.
First Bad Commit
Not bisected, but the ggml_backend_meta_buffer_get_tensor function is introduced by PR #19378 (merged 2026-04-09 as commit d6f3030047), so it cannot be older than that.
Relevant log output
Launch + crash log (failure mode 2)
$ HIP_VISIBLE_DEVICES=0,1 ./llama-server \
-m Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf \
--ctx-size 16384 --port 18001 --split-mode tensor --tensor-split 1,1 \
-fit off --no-cache-idle-slots -ngl 99 --no-webui
# ... load_tensors shows model fits across both ROCm devices ...
load_tensors: CPU_Mapped model buffer size = 515.31 MiB
load_tensors: Meta() model buffer size = 18177.72 MiB
llama_context: ROCm_Host output buffer size = 3.79 MiB
sched_reserve: ROCm_Host compute buffer size = 40.02 MiB
main: server is listening on http://127.0.0.1:18001
# ... prefill progresses cleanly ...
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.453198
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 4003, batch.n_tokens = 1955, progress = 0.885815
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 4515, batch.n_tokens = 512, progress = 0.999115
# ... assertion hits as prefill finishes / transition to decode ...
/home/.../llama.cpp/ggml/src/ggml-backend-meta.cpp:1299: GGML_ASSERT(size == ggml_nbytes(tensor)) failed
#4 0x00007f7e4c753efb in ggml_print_backtrace () from .../libggml-base.so.0
#5 0x00007f7e4c754060 in ggml_abort () from .../libggml-base.so.0
# ... SIGABRT, exit 134 ...
Build configuration
HIPCXX="$(hipconfig -l)/clang" HIP_PATH=/opt/rocm ROCM_PATH=/opt/rocm \
cmake -S . -B build \
-DGGML_HIP=ON \
-DGPU_TARGETS=gfx1201 \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_HIP_COMPILER_ROCM_ROOT=/opt/rocm \
-DCMAKE_HIP_FLAGS="-I/opt/rocm/include" \
-DCMAKE_EXE_LINKER_FLAGS="-L/opt/rocm/lib" \
-DCMAKE_SHARED_LINKER_FLAGS="-L/opt/rocm/lib" \
-GNinja
cmake --build build --config Release -j16
Both failure modes hit the same assertion at ggml-backend-meta.cpp:1299: size == ggml_nbytes(tensor) inside the multi-segment branch of ggml_backend_meta_buffer_get_tensor. Something is requesting a partial tensor read from a split meta buffer, which the current implementation doesn't support.
Looks like an MoE/K-quant interaction on ROCm as far as I've tested — layer-split works, and a Gemma-4 MoE fix already landed in #22129 for a different tensor-parallel MoE issue. May be the same class of problem.
Happy to test patches against this exact setup and file a PR if the fix is small and I can narrow it down.
Name and Version
Built from commit
8bc492e(current master at time of filing). Includes PR #19378 (backend-agnostic tensor parallelism) and follow-up fix PR #22129 (Gemma-4 MoE delayed AllReduce).Operating systems
Linux
GGML backends
HIP
Hardware
CPU: AMD Ryzen 9 9950X3D (16-core)
GPUs: 2x AMD Radeon AI PRO R9700 (Navi 48, gfx1201, 32 GB each, 64 GB total VRAM)
ROCm 7.2.2, RCCL installed
PCIe: Cards on separate roots (0000:03:00.0 and 0000:07:00.0), PEER_MAX_BATCH_SIZE = 128
Models
unsloth/Qwen3.6-35B-A3B-GGUF — file
Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf(~35 GB). MoE architecture with ~3B active params per token. Snapshota483e9e6cbd595906af30beda3187c2663a1118c. The crash reproduces with the base model file alone (no--mmproj).Problem description & steps to reproduce
--split-mode tensorloads the model successfully across both GPUs (meta-backend reportsMeta() model buffer size = 18177.72 MiB, VRAM splits ~53/47 across the two cards) and the server comes up clean. But both normal tool-usage paths then abort inside the meta-backend at the same source location.Assertion (both failure modes hit this exact line):
which is this check inside
ggml_backend_meta_buffer_get_tensor:Failure mode 1 — idle-slot cache save
Launch command:
(
-fit offis required — with the default-fit on, startup aborts earlier withllama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort.)Send any single chat completion request, however short:
The request returns a response. A few seconds later, as the server tries to save the now-idle slot:
→ process aborts (SIGABRT, exit 134).
Failure mode 2 — end of prefill
Same launch command plus
--no-cache-idle-slotsto bypass failure mode 1. Send a larger prompt (~4500 tokens):Prefill progresses normally:
Then aborts on the same assertion, same source line, as generation is about to begin.
What works
Same build, same model, same launch command with
--split-mode layer(or--split-mode row) runs stably through both of the above scenarios. Prefill and generation both complete, idle-slot save works. So this is specific to the meta-backend's split path, not to the build or the model.First Bad Commit
Not bisected, but the
ggml_backend_meta_buffer_get_tensorfunction is introduced by PR #19378 (merged 2026-04-09 as commitd6f3030047), so it cannot be older than that.Relevant log output
Launch + crash log (failure mode 2)
Build configuration
Both failure modes hit the same assertion at
ggml-backend-meta.cpp:1299:size == ggml_nbytes(tensor)inside the multi-segment branch ofggml_backend_meta_buffer_get_tensor. Something is requesting a partial tensor read from a split meta buffer, which the current implementation doesn't support.Looks like an MoE/K-quant interaction on ROCm as far as I've tested — layer-split works, and a Gemma-4 MoE fix already landed in #22129 for a different tensor-parallel MoE issue. May be the same class of problem.
Happy to test patches against this exact setup and file a PR if the fix is small and I can narrow it down.