Eval bug: --split-mode tensor aborts in ggml_backend_meta_buffer_get_tensor with Qwen3 MoE Q8_K_XL on ROCm

### Name and Version



```
$ ./llama-server --version
ggml_cuda_init: found 2 ROCm devices (Total VRAM: 65248 MiB):
  Device 0: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB
  Device 1: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB
version: 1 (8bc492e)
built with GNU 15.2.1 for Linux x86_64
```

Built from commit `8bc492e` (current master at time of filing). Includes PR #19378 (backend-agnostic tensor parallelism) and follow-up fix PR #22129 (Gemma-4 MoE delayed AllReduce).




### Operating systems

Linux

### GGML backends

HIP

### Hardware

CPU: AMD Ryzen 9 9950X3D (16-core)
GPUs: 2x AMD Radeon AI PRO R9700 (Navi 48, gfx1201, 32 GB each, 64 GB total VRAM)
ROCm 7.2.2, RCCL installed
PCIe: Cards on separate roots (0000:03:00.0 and 0000:07:00.0), PEER_MAX_BATCH_SIZE = 128

### Models

[unsloth/Qwen3.6-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF) — file `Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf` (~35 GB). MoE architecture with ~3B active params per token. Snapshot `a483e9e6cbd595906af30beda3187c2663a1118c`. The crash reproduces with the base model file alone (no `--mmproj`).


### Problem description & steps to reproduce


`--split-mode tensor` loads the model successfully across both GPUs (meta-backend reports `Meta() model buffer size = 18177.72 MiB`, VRAM splits ~53/47 across the two cards) and the server comes up clean. But both normal tool-usage paths then abort inside the meta-backend at the same source location.

**Assertion** (both failure modes hit this exact line):
```
ggml/src/ggml-backend-meta.cpp:1299: GGML_ASSERT(size == ggml_nbytes(tensor)) failed
```

which is this check inside `ggml_backend_meta_buffer_get_tensor`:

```c++
if (split_state.n_segments != 1) {
    GGML_ASSERT(split_state.axis >= 0 && split_state.axis < GGML_MAX_DIMS);
    GGML_ASSERT(offset == 0);
    GGML_ASSERT(size == ggml_nbytes(tensor));   // <-- line 1299
    ...
}
```

### Failure mode 1 — idle-slot cache save

Launch command:
```
./llama-server \
    -m Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf \
    --ctx-size 16384 \
    --port 18001 \
    --split-mode tensor \
    --tensor-split 1,1 \
    -fit off \
    -ngl 99 \
    --no-webui
```
(`-fit off` is required — with the default `-fit on`, startup aborts earlier with `llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort`.)

Send any single chat completion request, however short:
```
curl -s http://127.0.0.1:18001/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"x","messages":[{"role":"user","content":"hi"}],"max_tokens":10}'
```

The request returns a response. A few seconds later, as the server tries to save the now-idle slot:
```
slot slot_save_an: id  3 | task -1 | saving idle slot to prompt cache
srv   prompt_save:  - saving prompt with length 47, total state size = 63.732 MiB
ggml/src/ggml-backend-meta.cpp:1299: GGML_ASSERT(size == ggml_nbytes(tensor)) failed
```
→ process aborts (SIGABRT, exit 134).

### Failure mode 2 — end of prefill

Same launch command plus `--no-cache-idle-slots` to bypass failure mode 1. Send a larger prompt (~4500 tokens):
```
prompt='The quick brown fox jumps over the lazy dog. '
yes "$prompt" | head -n 450 | tr -d '\n' > /tmp/big
echo -n ' Summarize.' >> /tmp/big
curl -s http://127.0.0.1:18001/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d "$(jq -Rs --arg p "$(cat /tmp/big)" '{model:"x",messages:[{role:"user",content:$p}],max_tokens:20}')"
```

Prefill progresses normally:
```
prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.453
prompt processing progress, n_tokens = 4003, batch.n_tokens = 1955, progress = 0.886
prompt processing progress, n_tokens = 4515, batch.n_tokens =  512, progress = 0.999
```
Then aborts on the same assertion, same source line, as generation is about to begin.

### What works

Same build, same model, same launch command with **`--split-mode layer`** (or `--split-mode row`) runs stably through both of the above scenarios. Prefill and generation both complete, idle-slot save works. So this is specific to the meta-backend's split path, not to the build or the model.

### First Bad Commit

Not bisected, but the `ggml_backend_meta_buffer_get_tensor` function is introduced by PR #19378 (merged 2026-04-09 as commit `d6f3030047`), so it cannot be older than that.


### Relevant log output


<details>
<summary>Launch + crash log (failure mode 2)</summary>

```console
$ HIP_VISIBLE_DEVICES=0,1 ./llama-server \
    -m Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf \
    --ctx-size 16384 --port 18001 --split-mode tensor --tensor-split 1,1 \
    -fit off --no-cache-idle-slots -ngl 99 --no-webui

# ... load_tensors shows model fits across both ROCm devices ...
load_tensors:   CPU_Mapped model buffer size =   515.31 MiB
load_tensors:       Meta() model buffer size = 18177.72 MiB
llama_context:  ROCm_Host  output buffer size =     3.79 MiB
sched_reserve:  ROCm_Host compute buffer size =    40.02 MiB
main: server is listening on http://127.0.0.1:18001

# ... prefill progresses cleanly ...
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.453198
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 4003, batch.n_tokens = 1955, progress = 0.885815
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 4515, batch.n_tokens =  512, progress = 0.999115

# ... assertion hits as prefill finishes / transition to decode ...
/home/.../llama.cpp/ggml/src/ggml-backend-meta.cpp:1299: GGML_ASSERT(size == ggml_nbytes(tensor)) failed

#4  0x00007f7e4c753efb in ggml_print_backtrace () from .../libggml-base.so.0
#5  0x00007f7e4c754060 in ggml_abort () from .../libggml-base.so.0
# ... SIGABRT, exit 134 ...
```

</details>

<details>
<summary>Build configuration</summary>

```console
HIPCXX="$(hipconfig -l)/clang" HIP_PATH=/opt/rocm ROCM_PATH=/opt/rocm \
cmake -S . -B build \
    -DGGML_HIP=ON \
    -DGPU_TARGETS=gfx1201 \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_HIP_COMPILER_ROCM_ROOT=/opt/rocm \
    -DCMAKE_HIP_FLAGS="-I/opt/rocm/include" \
    -DCMAKE_EXE_LINKER_FLAGS="-L/opt/rocm/lib" \
    -DCMAKE_SHARED_LINKER_FLAGS="-L/opt/rocm/lib" \
    -GNinja
cmake --build build --config Release -j16
```

</details>

---

Both failure modes hit the same assertion at `ggml-backend-meta.cpp:1299`: `size == ggml_nbytes(tensor)` inside the multi-segment branch of `ggml_backend_meta_buffer_get_tensor`. Something is requesting a partial tensor read from a split meta buffer, which the current implementation doesn't support.

Looks like an MoE/K-quant interaction on ROCm as far as I've tested — layer-split works, and a Gemma-4 MoE fix already landed in #22129 for a different tensor-parallel MoE issue. May be the same class of problem.

Happy to test patches against this exact setup and file a PR if the fix is small and I can narrow it down.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: --split-mode tensor aborts in ggml_backend_meta_buffer_get_tensor with Qwen3 MoE Q8_K_XL on ROCm #22307

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Failure mode 1 — idle-slot cache save

Failure mode 2 — end of prefill

What works

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: --split-mode tensor aborts in ggml_backend_meta_buffer_get_tensor with Qwen3 MoE Q8_K_XL on ROCm #22307

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Failure mode 1 — idle-slot cache save

Failure mode 2 — end of prefill

What works

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions