Eval bug: ROCm error: out of memory

### Name and Version

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16304 MiB):
  Device 0: AMD Radeon RX 9060 XT, gfx1200 (0x1200), VMM: no, Wave Size: 32, VRAM: 16304 MiB
version: 8699 (11aced9b7)
built with GNU 15.2.1 for Linux x86_64

### Operating systems

Linux

### GGML backends

HIP

### Hardware

dev-build/rocm-cmake-7.1.0
dev-libs/rocm-comgr-7.1.0
dev-libs/rocm-core-7.1.0
dev-libs/rocm-device-libs-7.1.0
dev-util/rocm-smi-7.1.0
dev-util/rocminfo-7.1.0
dev-util/hip-7.1.0-r1
dev-util/hipcc-7.1.0
sci-libs/hipBLAS-7.1.0
sci-libs/hipBLAS-common-7.1.0

### Models

_No response_

### Problem description & steps to reproduce

i can run on vulkan DJLougen/Qwen-3.5-28B-A3B-REAP-GGUF:Q3_K_M with 256k context on q4
using this build with turbo3 i got ROCm error: out of memory on 256k
so i tried lower
[43389] build_info: b8699-11aced9b7
[43389] system_info: n_threads = 1 (n_threads_batch = 1) / 4 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
[43389] Running without SSL
[43389] init: using 5 threads for HTTP server
[43389] start: binding port with default address family
[43389] main: loading model
[43389] srv    load_model: loading model '/root/.cache/huggingface/hub/models--DJLougen--Qwen-3.5-28B-A3B-REAP-GGUF/snapshots/650a7b1508e4ef3328e321d491e629f05bc0b9d1/Qwen-3.5-28B-A3B-REAP-Q3_K_M.gguf'
[43389] common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
[43389] llama_params_fit_impl: projected to use 13895 MiB of device memory vs. 15564 MiB of free device memory
[43389] llama_params_fit_impl: will leave 1668 >= 348 MiB of free device memory, no changes needed
[43389] llama_params_fit: successfully fit params to free device memory

[43389] llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[43389] llama_context:  ROCm_Host  output buffer size =     0.95 MiB
[43389] llama_kv_cache:      ROCm0 KV buffer size =   500.13 MiB
[43389] llama_kv_cache: TurboQuant rotation matrices initialized (128x128)
[43389] llama_kv_cache: size =  500.00 MiB (131072 cells,  10 layers,  1/1 seqs), K (turbo3):  250.00 MiB, V (turbo3):  250.00 MiB
[43389] llama_kv_cache: upstream attention rotation disabled (TurboQuant uses kernel-level WHT)
[43389] llama_kv_cache: attn_rot_k = 0
[43389] llama_kv_cache: attn_rot_v = 0
[43389] llama_memory_recurrent:      ROCm0 RS buffer size =    62.81 MiB
[43389] llama_memory_recurrent: size =   62.81 MiB (     1 cells,  40 layers,  1 seqs), R (f32):    2.81 MiB, S (f32):   60.00 MiB
[43389] sched_reserve: reserving ...
[43389] sched_reserve: resolving fused Gated Delta Net support:
[43389] sched_reserve: fused Gated Delta Net (autoregressive) enabled
[43389] sched_reserve: fused Gated Delta Net (chunked) enabled
[43389] sched_reserve:      ROCm0 compute buffer size =   493.00 MiB
[43389] sched_reserve:  ROCm_Host compute buffer size =   264.02 MiB
[43389] sched_reserve: graph nodes  = 3749
[43389] sched_reserve: graph splits = 2
[43389] sched_reserve: reserve took 273.34 ms, sched copies = 1

after about 5 chats

[43389] slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.970 (> 0.100 thold), f_keep = 0.985
[43389] slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist 
[43389] slot launch_slot_: id  0 | task 6520 | processing task, is_child = 0
[43389] slot update_slots: id  0 | task 6520 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 54970
[43389] slot update_slots: id  0 | task 6520 | n_past = 53318, slot.prompt.tokens.size() = 54132, seq_id = 0, pos_min = 54131, n_swa = 0
[43389] slot update_slots: id  0 | task 6520 | Checking checkpoint with [53263, 53263] against 53318...
[43389] slot update_slots: id  0 | task 6520 | restored context checkpoint (pos_min = 53263, pos_max = 53263, n_tokens = 53264, n_past = 53264, size = 62.813 MiB)
[43389] slot update_slots: id  0 | task 6520 | n_tokens = 53264, memory_seq_rm [53264, end)
[43389] slot update_slots: id  0 | task 6520 | prompt processing progress, n_tokens = 54454, batch.n_tokens = 1190, progress = 0.990613
[43389] /opt/llama.cpp-turboquant-hip/ggml/src/ggml-cuda/ggml-cuda.cu:100: ROCm error
[43389] ROCm error: out of memory

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: ROCm error: out of memory #1

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Eval bug: ROCm error: out of memory #1

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions