fix(hip): force VEC FA path for quantized KV (fixes RDNA2 max_blocks_per_sm assert)#13
Open
TheTom wants to merge 1 commit into
Open
fix(hip): force VEC FA path for quantized KV (fixes RDNA2 max_blocks_per_sm assert)#13TheTom wants to merge 1 commit into
TheTom wants to merge 1 commit into
Conversation
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers proportional to the full KV cache length for any quantized KV type. On ROCm/HIP these pool allocations persist at peak size, so the temp buffer VRAM exceeds the savings from KV compression. The combination also triggers `fattn-common.cuh:1405 GGML_ASSERT( max_blocks_per_sm > 0)` on RDNA2 (gfx103x) because the MMA-family kernels have no Wave32 SIMD path for that arch, so cudaOccupancyMaxActiveBlocksPerMultiprocessor returns 0 at the launch site. Fix: on HIP, force the VEC kernel for quantized KV when head_dim <= 256 and head_dim % 64 == 0 and K_len % FATTN_KQ_STRIDE == 0. VEC inlines the dequant in-register with zero temp-buffer overhead and works on RDNA2/3/3.5/4 + CDNA. The VEC kernel already understands turbo3/turbo4/turbo2 natively (fattn-vec.cuh, K_is_turbo / V_is_turbo specialisations). Trade-off: prefill throughput may drop on the path that previously selected TILE (VEC processes queries sequentially). Decode is unaffected since VEC was already selected for Q->ne[1] == 1. Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512) cannot use VEC and still routes through TILE. That case needs a bounded temp buffer in a separate compilation unit. Cherry-picked from upstream TheTom/llama.cpp commit 8993d4fd7 to domvox/llama.cpp-turboquant-hip @ 6a8df6c (clean HIP port). Reported by datore990 on RX 6600 (gfx1032 spoofed as gfx1030). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Tested on RX 6700 (gfx1031), ROCm 7.2, Fedora 44. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
On HIP/ROCm, the TILE/MMA/WMMA FA paths in
ggml-cuda/fattn.cuallocate unbounded f16 temp buffers proportional to the full KV cache length for any quantized KV type. The pool retains peak allocation size, so the temp buffer VRAM ends up larger than the savings from KV compression.The same combination triggers a hard crash on RDNA2 (gfx103x):
fattn-mma-f16.cuhhas no Wave32 path for RDNA2 —amd_wmma_available()returns true only for RDNA3+,amd_mfma_available()only for CDNA. RDNA2 falls through the#elsebranch andcudaOccupancyMaxActiveBlocksPerMultiprocessorreturns 0 at the launch site, tripping the assert.This was reported by @datore990 on RX 6600 (gfx1032 with
HSA_OVERRIDE_GFX_VERSION=10.3.0) at https://github.com/domvox/llama.cpp-turboquant-hip and traced cleanly to the missing dispatch fallback.Fix
On HIP, force the VEC kernel for quantized KV when
head_dim ≤ 256 && head_dim % 64 == 0 && K_len % FATTN_KQ_STRIDE == 0:The VEC kernel inlines dequant in-register with zero temp-buffer overhead and works on RDNA2/3/3.5/4 + CDNA. It already understands
turbo3/turbo4/turbo2natively (seefattn-vec.cuh,K_is_turbo/V_is_turbospecialisations) andq8_0/q4_0via the existing mixed-KV dispatch.This is a cherry-pick of
8993d4fd7fromTheTom/llama.cppmainline (April 2026, where it was developed against the same OOM symptom on gfx1100 + gfx1200 before the RDNA2 assert was found).Trade-offs
Q->ne[1] == 1.head_dim > 256(e.g. Gemma 4 full-attentiond=512) cannot use VEC and still routes through TILE. Bounded temp buffer in a separate compilation unit is the proper fix for that case; not in scope here.Test plan
cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1030llama-cli -m <model> -ngl 99 -fa on -ctk turbo4 -ctv turbo4 -p "Hello" -n 100— should no longer hitmax_blocks_per_sm > 0assert-ctk q8_0 -ctv q8_0for the upstream-quant pathDiff is small (13 lines, one
#ifdef GGML_USE_HIPblock). No source code change in any kernel.