Name and Version
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16304 MiB):
Device 0: AMD Radeon RX 9060 XT, gfx1200 (0x1200), VMM: no, Wave Size: 32, VRAM: 16304 MiB
version: 8699 (11aced9)
built with GNU 15.2.1 for Linux x86_64
Operating systems
Linux
GGML backends
HIP
Hardware
dev-build/rocm-cmake-7.1.0
dev-libs/rocm-comgr-7.1.0
dev-libs/rocm-core-7.1.0
dev-libs/rocm-device-libs-7.1.0
dev-util/rocm-smi-7.1.0
dev-util/rocminfo-7.1.0
dev-util/hip-7.1.0-r1
dev-util/hipcc-7.1.0
sci-libs/hipBLAS-7.1.0
sci-libs/hipBLAS-common-7.1.0
Models
No response
Problem description & steps to reproduce
i can run on vulkan DJLougen/Qwen-3.5-28B-A3B-REAP-GGUF:Q3_K_M with 256k context on q4
using this build with turbo3 i got ROCm error: out of memory on 256k
so i tried lower
[43389] build_info: b8699-11aced9b7
[43389] system_info: n_threads = 1 (n_threads_batch = 1) / 4 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[43389] Running without SSL
[43389] init: using 5 threads for HTTP server
[43389] start: binding port with default address family
[43389] main: loading model
[43389] srv load_model: loading model '/root/.cache/huggingface/hub/models--DJLougen--Qwen-3.5-28B-A3B-REAP-GGUF/snapshots/650a7b1508e4ef3328e321d491e629f05bc0b9d1/Qwen-3.5-28B-A3B-REAP-Q3_K_M.gguf'
[43389] common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
[43389] llama_params_fit_impl: projected to use 13895 MiB of device memory vs. 15564 MiB of free device memory
[43389] llama_params_fit_impl: will leave 1668 >= 348 MiB of free device memory, no changes needed
[43389] llama_params_fit: successfully fit params to free device memory
[43389] llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[43389] llama_context: ROCm_Host output buffer size = 0.95 MiB
[43389] llama_kv_cache: ROCm0 KV buffer size = 500.13 MiB
[43389] llama_kv_cache: TurboQuant rotation matrices initialized (128x128)
[43389] llama_kv_cache: size = 500.00 MiB (131072 cells, 10 layers, 1/1 seqs), K (turbo3): 250.00 MiB, V (turbo3): 250.00 MiB
[43389] llama_kv_cache: upstream attention rotation disabled (TurboQuant uses kernel-level WHT)
[43389] llama_kv_cache: attn_rot_k = 0
[43389] llama_kv_cache: attn_rot_v = 0
[43389] llama_memory_recurrent: ROCm0 RS buffer size = 62.81 MiB
[43389] llama_memory_recurrent: size = 62.81 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 2.81 MiB, S (f32): 60.00 MiB
[43389] sched_reserve: reserving ...
[43389] sched_reserve: resolving fused Gated Delta Net support:
[43389] sched_reserve: fused Gated Delta Net (autoregressive) enabled
[43389] sched_reserve: fused Gated Delta Net (chunked) enabled
[43389] sched_reserve: ROCm0 compute buffer size = 493.00 MiB
[43389] sched_reserve: ROCm_Host compute buffer size = 264.02 MiB
[43389] sched_reserve: graph nodes = 3749
[43389] sched_reserve: graph splits = 2
[43389] sched_reserve: reserve took 273.34 ms, sched copies = 1
after about 5 chats
[43389] slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.970 (> 0.100 thold), f_keep = 0.985
[43389] slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
[43389] slot launch_slot_: id 0 | task 6520 | processing task, is_child = 0
[43389] slot update_slots: id 0 | task 6520 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 54970
[43389] slot update_slots: id 0 | task 6520 | n_past = 53318, slot.prompt.tokens.size() = 54132, seq_id = 0, pos_min = 54131, n_swa = 0
[43389] slot update_slots: id 0 | task 6520 | Checking checkpoint with [53263, 53263] against 53318...
[43389] slot update_slots: id 0 | task 6520 | restored context checkpoint (pos_min = 53263, pos_max = 53263, n_tokens = 53264, n_past = 53264, size = 62.813 MiB)
[43389] slot update_slots: id 0 | task 6520 | n_tokens = 53264, memory_seq_rm [53264, end)
[43389] slot update_slots: id 0 | task 6520 | prompt processing progress, n_tokens = 54454, batch.n_tokens = 1190, progress = 0.990613
[43389] /opt/llama.cpp-turboquant-hip/ggml/src/ggml-cuda/ggml-cuda.cu:100: ROCm error
[43389] ROCm error: out of memory
First Bad Commit
No response
Relevant log output
Logs
Name and Version
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16304 MiB):
Device 0: AMD Radeon RX 9060 XT, gfx1200 (0x1200), VMM: no, Wave Size: 32, VRAM: 16304 MiB
version: 8699 (11aced9)
built with GNU 15.2.1 for Linux x86_64
Operating systems
Linux
GGML backends
HIP
Hardware
dev-build/rocm-cmake-7.1.0
dev-libs/rocm-comgr-7.1.0
dev-libs/rocm-core-7.1.0
dev-libs/rocm-device-libs-7.1.0
dev-util/rocm-smi-7.1.0
dev-util/rocminfo-7.1.0
dev-util/hip-7.1.0-r1
dev-util/hipcc-7.1.0
sci-libs/hipBLAS-7.1.0
sci-libs/hipBLAS-common-7.1.0
Models
No response
Problem description & steps to reproduce
i can run on vulkan DJLougen/Qwen-3.5-28B-A3B-REAP-GGUF:Q3_K_M with 256k context on q4
using this build with turbo3 i got ROCm error: out of memory on 256k
so i tried lower
[43389] build_info: b8699-11aced9b7
[43389] system_info: n_threads = 1 (n_threads_batch = 1) / 4 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[43389] Running without SSL
[43389] init: using 5 threads for HTTP server
[43389] start: binding port with default address family
[43389] main: loading model
[43389] srv load_model: loading model '/root/.cache/huggingface/hub/models--DJLougen--Qwen-3.5-28B-A3B-REAP-GGUF/snapshots/650a7b1508e4ef3328e321d491e629f05bc0b9d1/Qwen-3.5-28B-A3B-REAP-Q3_K_M.gguf'
[43389] common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
[43389] llama_params_fit_impl: projected to use 13895 MiB of device memory vs. 15564 MiB of free device memory
[43389] llama_params_fit_impl: will leave 1668 >= 348 MiB of free device memory, no changes needed
[43389] llama_params_fit: successfully fit params to free device memory
[43389] llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[43389] llama_context: ROCm_Host output buffer size = 0.95 MiB
[43389] llama_kv_cache: ROCm0 KV buffer size = 500.13 MiB
[43389] llama_kv_cache: TurboQuant rotation matrices initialized (128x128)
[43389] llama_kv_cache: size = 500.00 MiB (131072 cells, 10 layers, 1/1 seqs), K (turbo3): 250.00 MiB, V (turbo3): 250.00 MiB
[43389] llama_kv_cache: upstream attention rotation disabled (TurboQuant uses kernel-level WHT)
[43389] llama_kv_cache: attn_rot_k = 0
[43389] llama_kv_cache: attn_rot_v = 0
[43389] llama_memory_recurrent: ROCm0 RS buffer size = 62.81 MiB
[43389] llama_memory_recurrent: size = 62.81 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 2.81 MiB, S (f32): 60.00 MiB
[43389] sched_reserve: reserving ...
[43389] sched_reserve: resolving fused Gated Delta Net support:
[43389] sched_reserve: fused Gated Delta Net (autoregressive) enabled
[43389] sched_reserve: fused Gated Delta Net (chunked) enabled
[43389] sched_reserve: ROCm0 compute buffer size = 493.00 MiB
[43389] sched_reserve: ROCm_Host compute buffer size = 264.02 MiB
[43389] sched_reserve: graph nodes = 3749
[43389] sched_reserve: graph splits = 2
[43389] sched_reserve: reserve took 273.34 ms, sched copies = 1
after about 5 chats
[43389] slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.970 (> 0.100 thold), f_keep = 0.985
[43389] slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
[43389] slot launch_slot_: id 0 | task 6520 | processing task, is_child = 0
[43389] slot update_slots: id 0 | task 6520 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 54970
[43389] slot update_slots: id 0 | task 6520 | n_past = 53318, slot.prompt.tokens.size() = 54132, seq_id = 0, pos_min = 54131, n_swa = 0
[43389] slot update_slots: id 0 | task 6520 | Checking checkpoint with [53263, 53263] against 53318...
[43389] slot update_slots: id 0 | task 6520 | restored context checkpoint (pos_min = 53263, pos_max = 53263, n_tokens = 53264, n_past = 53264, size = 62.813 MiB)
[43389] slot update_slots: id 0 | task 6520 | n_tokens = 53264, memory_seq_rm [53264, end)
[43389] slot update_slots: id 0 | task 6520 | prompt processing progress, n_tokens = 54454, batch.n_tokens = 1190, progress = 0.990613
[43389] /opt/llama.cpp-turboquant-hip/ggml/src/ggml-cuda/ggml-cuda.cu:100: ROCm error
[43389] ROCm error: out of memory
First Bad Commit
No response
Relevant log output
Logs