Skip to content

Eval bug: ROCm error: out of memory #1

@dpblnt

Description

@dpblnt

Name and Version

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16304 MiB):
Device 0: AMD Radeon RX 9060 XT, gfx1200 (0x1200), VMM: no, Wave Size: 32, VRAM: 16304 MiB
version: 8699 (11aced9)
built with GNU 15.2.1 for Linux x86_64

Operating systems

Linux

GGML backends

HIP

Hardware

dev-build/rocm-cmake-7.1.0
dev-libs/rocm-comgr-7.1.0
dev-libs/rocm-core-7.1.0
dev-libs/rocm-device-libs-7.1.0
dev-util/rocm-smi-7.1.0
dev-util/rocminfo-7.1.0
dev-util/hip-7.1.0-r1
dev-util/hipcc-7.1.0
sci-libs/hipBLAS-7.1.0
sci-libs/hipBLAS-common-7.1.0

Models

No response

Problem description & steps to reproduce

i can run on vulkan DJLougen/Qwen-3.5-28B-A3B-REAP-GGUF:Q3_K_M with 256k context on q4
using this build with turbo3 i got ROCm error: out of memory on 256k
so i tried lower
[43389] build_info: b8699-11aced9b7
[43389] system_info: n_threads = 1 (n_threads_batch = 1) / 4 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[43389] Running without SSL
[43389] init: using 5 threads for HTTP server
[43389] start: binding port with default address family
[43389] main: loading model
[43389] srv load_model: loading model '/root/.cache/huggingface/hub/models--DJLougen--Qwen-3.5-28B-A3B-REAP-GGUF/snapshots/650a7b1508e4ef3328e321d491e629f05bc0b9d1/Qwen-3.5-28B-A3B-REAP-Q3_K_M.gguf'
[43389] common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
[43389] llama_params_fit_impl: projected to use 13895 MiB of device memory vs. 15564 MiB of free device memory
[43389] llama_params_fit_impl: will leave 1668 >= 348 MiB of free device memory, no changes needed
[43389] llama_params_fit: successfully fit params to free device memory

[43389] llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[43389] llama_context: ROCm_Host output buffer size = 0.95 MiB
[43389] llama_kv_cache: ROCm0 KV buffer size = 500.13 MiB
[43389] llama_kv_cache: TurboQuant rotation matrices initialized (128x128)
[43389] llama_kv_cache: size = 500.00 MiB (131072 cells, 10 layers, 1/1 seqs), K (turbo3): 250.00 MiB, V (turbo3): 250.00 MiB
[43389] llama_kv_cache: upstream attention rotation disabled (TurboQuant uses kernel-level WHT)
[43389] llama_kv_cache: attn_rot_k = 0
[43389] llama_kv_cache: attn_rot_v = 0
[43389] llama_memory_recurrent: ROCm0 RS buffer size = 62.81 MiB
[43389] llama_memory_recurrent: size = 62.81 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 2.81 MiB, S (f32): 60.00 MiB
[43389] sched_reserve: reserving ...
[43389] sched_reserve: resolving fused Gated Delta Net support:
[43389] sched_reserve: fused Gated Delta Net (autoregressive) enabled
[43389] sched_reserve: fused Gated Delta Net (chunked) enabled
[43389] sched_reserve: ROCm0 compute buffer size = 493.00 MiB
[43389] sched_reserve: ROCm_Host compute buffer size = 264.02 MiB
[43389] sched_reserve: graph nodes = 3749
[43389] sched_reserve: graph splits = 2
[43389] sched_reserve: reserve took 273.34 ms, sched copies = 1

after about 5 chats

[43389] slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.970 (> 0.100 thold), f_keep = 0.985
[43389] slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
[43389] slot launch_slot_: id 0 | task 6520 | processing task, is_child = 0
[43389] slot update_slots: id 0 | task 6520 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 54970
[43389] slot update_slots: id 0 | task 6520 | n_past = 53318, slot.prompt.tokens.size() = 54132, seq_id = 0, pos_min = 54131, n_swa = 0
[43389] slot update_slots: id 0 | task 6520 | Checking checkpoint with [53263, 53263] against 53318...
[43389] slot update_slots: id 0 | task 6520 | restored context checkpoint (pos_min = 53263, pos_max = 53263, n_tokens = 53264, n_past = 53264, size = 62.813 MiB)
[43389] slot update_slots: id 0 | task 6520 | n_tokens = 53264, memory_seq_rm [53264, end)
[43389] slot update_slots: id 0 | task 6520 | prompt processing progress, n_tokens = 54454, batch.n_tokens = 1190, progress = 0.990613
[43389] /opt/llama.cpp-turboquant-hip/ggml/src/ggml-cuda/ggml-cuda.cu:100: ROCm error
[43389] ROCm error: out of memory

First Bad Commit

No response

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions