Skip to content

fix(hip): force VEC FA path for quantized KV (fixes RDNA2 max_blocks_per_sm assert)#13

Open
TheTom wants to merge 1 commit into
domvox:mainfrom
TheTom:fix/hip-vec-quantized-kv
Open

fix(hip): force VEC FA path for quantized KV (fixes RDNA2 max_blocks_per_sm assert)#13
TheTom wants to merge 1 commit into
domvox:mainfrom
TheTom:fix/hip-vec-quantized-kv

Conversation

@TheTom
Copy link
Copy Markdown

@TheTom TheTom commented May 11, 2026

Summary

On HIP/ROCm, the TILE/MMA/WMMA FA paths in ggml-cuda/fattn.cu allocate unbounded f16 temp buffers proportional to the full KV cache length for any quantized KV type. The pool retains peak allocation size, so the temp buffer VRAM ends up larger than the savings from KV compression.

The same combination triggers a hard crash on RDNA2 (gfx103x):

fattn-common.cuh:1405: GGML_ASSERT(max_blocks_per_sm > 0) failed

fattn-mma-f16.cuh has no Wave32 path for RDNA2 — amd_wmma_available() returns true only for RDNA3+, amd_mfma_available() only for CDNA. RDNA2 falls through the #else branch and cudaOccupancyMaxActiveBlocksPerMultiprocessor returns 0 at the launch site, tripping the assert.

This was reported by @datore990 on RX 6600 (gfx1032 with HSA_OVERRIDE_GFX_VERSION=10.3.0) at https://github.com/domvox/llama.cpp-turboquant-hip and traced cleanly to the missing dispatch fallback.

Fix

On HIP, force the VEC kernel for quantized KV when head_dim ≤ 256 && head_dim % 64 == 0 && K_len % FATTN_KQ_STRIDE == 0:

#ifdef GGML_USE_HIP
    if ((ggml_is_quantized(K->type) || ggml_is_quantized(V->type)) && can_use_vector_kernel) {
        return BEST_FATTN_KERNEL_VEC;
    }
#endif

The VEC kernel inlines dequant in-register with zero temp-buffer overhead and works on RDNA2/3/3.5/4 + CDNA. It already understands turbo3/turbo4/turbo2 natively (see fattn-vec.cuh, K_is_turbo / V_is_turbo specialisations) and q8_0/q4_0 via the existing mixed-KV dispatch.

This is a cherry-pick of 8993d4fd7 from TheTom/llama.cpp mainline (April 2026, where it was developed against the same OOM symptom on gfx1100 + gfx1200 before the RDNA2 assert was found).

Trade-offs

  • Prefill throughput may drop on the path that previously selected TILE. VEC processes queries sequentially. Acceptable for the use case (quant KV in the first place implies bandwidth-bound long-context workloads, not prefill-heavy ones).
  • Decode is unaffected — VEC was already selected for Q->ne[1] == 1.
  • Limitation: head_dim > 256 (e.g. Gemma 4 full-attention d=512) cannot use VEC and still routes through TILE. Bounded temp buffer in a separate compilation unit is the proper fix for that case; not in scope here.

Test plan

  • Build on RDNA2 (gfx1030 / gfx1032): cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1030
  • Run llama-cli -m <model> -ngl 99 -fa on -ctk turbo4 -ctv turbo4 -p "Hello" -n 100 — should no longer hit max_blocks_per_sm > 0 assert
  • Also try -ctk q8_0 -ctv q8_0 for the upstream-quant path
  • Sanity check on RDNA3 (gfx1100): build + run, decode tps should be unchanged vs unpatched baseline

Diff is small (13 lines, one #ifdef GGML_USE_HIP block). No source code change in any kernel.

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to the full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, so the
temp buffer VRAM exceeds the savings from KV compression. The
combination also triggers `fattn-common.cuh:1405 GGML_ASSERT(
max_blocks_per_sm > 0)` on RDNA2 (gfx103x) because the MMA-family
kernels have no Wave32 SIMD path for that arch, so
cudaOccupancyMaxActiveBlocksPerMultiprocessor returns 0 at the
launch site.

Fix: on HIP, force the VEC kernel for quantized KV when
head_dim <= 256 and head_dim % 64 == 0 and K_len % FATTN_KQ_STRIDE
== 0. VEC inlines the dequant in-register with zero temp-buffer
overhead and works on RDNA2/3/3.5/4 + CDNA. The VEC kernel
already understands turbo3/turbo4/turbo2 natively
(fattn-vec.cuh, K_is_turbo / V_is_turbo specialisations).

Trade-off: prefill throughput may drop on the path that previously
selected TILE (VEC processes queries sequentially). Decode is
unaffected since VEC was already selected for Q->ne[1] == 1.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. That case needs a
bounded temp buffer in a separate compilation unit.

Cherry-picked from upstream TheTom/llama.cpp commit 8993d4fd7 to
domvox/llama.cpp-turboquant-hip @ 6a8df6c (clean HIP port).

Reported by datore990 on RX 6600 (gfx1032 spoofed as gfx1030).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Do-nny
Copy link
Copy Markdown

Do-nny commented May 11, 2026

Tested on RX 6700 (gfx1031), ROCm 7.2, Fedora 44.
Was hitting the exact max_blocks_per_sm > 0 assert with --cache-type-k turbo4 --cache-type-v turbo3 on a Qwen3.6-35B-A3B(head_dim=256).
Cherry-picked f84713f onto feature/turboquant-hip-port-clean, rebuilt for gfx1031, assert is gone and inference runs correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants