fix(hip): force VEC FA path for quantized KV (fixes RDNA2 max_blocks_per_sm assert) by TheTom · Pull Request #13 · domvox/llama.cpp-turboquant-hip

TheTom · 2026-05-11T13:10:00Z

Summary

On HIP/ROCm, the TILE/MMA/WMMA FA paths in ggml-cuda/fattn.cu allocate unbounded f16 temp buffers proportional to the full KV cache length for any quantized KV type. The pool retains peak allocation size, so the temp buffer VRAM ends up larger than the savings from KV compression.

The same combination triggers a hard crash on RDNA2 (gfx103x):

fattn-common.cuh:1405: GGML_ASSERT(max_blocks_per_sm > 0) failed

fattn-mma-f16.cuh has no Wave32 path for RDNA2 — amd_wmma_available() returns true only for RDNA3+, amd_mfma_available() only for CDNA. RDNA2 falls through the #else branch and cudaOccupancyMaxActiveBlocksPerMultiprocessor returns 0 at the launch site, tripping the assert.

This was reported by @datore990 on RX 6600 (gfx1032 with HSA_OVERRIDE_GFX_VERSION=10.3.0) at https://github.com/domvox/llama.cpp-turboquant-hip and traced cleanly to the missing dispatch fallback.

Fix

On HIP, force the VEC kernel for quantized KV when head_dim ≤ 256 && head_dim % 64 == 0 && K_len % FATTN_KQ_STRIDE == 0:

#ifdef GGML_USE_HIP
    if ((ggml_is_quantized(K->type) || ggml_is_quantized(V->type)) && can_use_vector_kernel) {
        return BEST_FATTN_KERNEL_VEC;
    }
#endif

The VEC kernel inlines dequant in-register with zero temp-buffer overhead and works on RDNA2/3/3.5/4 + CDNA. It already understands turbo3/turbo4/turbo2 natively (see fattn-vec.cuh, K_is_turbo / V_is_turbo specialisations) and q8_0/q4_0 via the existing mixed-KV dispatch.

This is a cherry-pick of 8993d4fd7 from TheTom/llama.cpp mainline (April 2026, where it was developed against the same OOM symptom on gfx1100 + gfx1200 before the RDNA2 assert was found).

Trade-offs

Prefill throughput may drop on the path that previously selected TILE. VEC processes queries sequentially. Acceptable for the use case (quant KV in the first place implies bandwidth-bound long-context workloads, not prefill-heavy ones).
Decode is unaffected — VEC was already selected for Q->ne[1] == 1.
Limitation: head_dim > 256 (e.g. Gemma 4 full-attention d=512) cannot use VEC and still routes through TILE. Bounded temp buffer in a separate compilation unit is the proper fix for that case; not in scope here.

Test plan

Build on RDNA2 (gfx1030 / gfx1032): cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1030
Run llama-cli -m <model> -ngl 99 -fa on -ctk turbo4 -ctv turbo4 -p "Hello" -n 100 — should no longer hit max_blocks_per_sm > 0 assert
Also try -ctk q8_0 -ctv q8_0 for the upstream-quant path
Sanity check on RDNA3 (gfx1100): build + run, decode tps should be unchanged vs unpatched baseline

Diff is small (13 lines, one #ifdef GGML_USE_HIP block). No source code change in any kernel.

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers proportional to the full KV cache length for any quantized KV type. On ROCm/HIP these pool allocations persist at peak size, so the temp buffer VRAM exceeds the savings from KV compression. The combination also triggers `fattn-common.cuh:1405 GGML_ASSERT( max_blocks_per_sm > 0)` on RDNA2 (gfx103x) because the MMA-family kernels have no Wave32 SIMD path for that arch, so cudaOccupancyMaxActiveBlocksPerMultiprocessor returns 0 at the launch site. Fix: on HIP, force the VEC kernel for quantized KV when head_dim <= 256 and head_dim % 64 == 0 and K_len % FATTN_KQ_STRIDE == 0. VEC inlines the dequant in-register with zero temp-buffer overhead and works on RDNA2/3/3.5/4 + CDNA. The VEC kernel already understands turbo3/turbo4/turbo2 natively (fattn-vec.cuh, K_is_turbo / V_is_turbo specialisations). Trade-off: prefill throughput may drop on the path that previously selected TILE (VEC processes queries sequentially). Decode is unaffected since VEC was already selected for Q->ne[1] == 1. Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512) cannot use VEC and still routes through TILE. That case needs a bounded temp buffer in a separate compilation unit. Cherry-picked from upstream TheTom/llama.cpp commit 8993d4fd7 to domvox/llama.cpp-turboquant-hip @ 6a8df6c (clean HIP port). Reported by datore990 on RX 6600 (gfx1032 spoofed as gfx1030). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Do-nny · 2026-05-11T14:21:21Z

Tested on RX 6700 (gfx1031), ROCm 7.2, Fedora 44.
Was hitting the exact max_blocks_per_sm > 0 assert with --cache-type-k turbo4 --cache-type-v turbo3 on a Qwen3.6-35B-A3B(head_dim=256).
Cherry-picked f84713f onto feature/turboquant-hip-port-clean, rebuilt for gfx1031, assert is gone and inference runs correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hip): force VEC FA path for quantized KV (fixes RDNA2 max_blocks_per_sm assert)#13

fix(hip): force VEC FA path for quantized KV (fixes RDNA2 max_blocks_per_sm assert)#13
TheTom wants to merge 1 commit into
domvox:mainfrom
TheTom:fix/hip-vec-quantized-kv

TheTom commented May 11, 2026

Uh oh!

Do-nny commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TheTom commented May 11, 2026

Summary

Fix

Trade-offs

Test plan

Uh oh!

Do-nny commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants