Summary
The Lucebox vendored ggml-cuda produces numerically incorrect output on AMD RDNA3 GPUs with native wave32 (gfx1102 / RX 7600 XT). The model loads correctly and runs at full speed (~24 tok/s with Gemma 4 12B Q4_K_M), but produces degenerate logits — always outputting the same token regardless of input.
Hardware
- GPU: AMD Radeon RX 7600 XT (16 GB VRAM, gfx1102, Navi 33, wave32)
- OS: Windows 11
- ROCm: 7.1, HIP 7.2
- Compiler: Ninja + clang (ROCm bundled)
Reproduction
- Build
test_dflash with DFLASH27B_GPU_BACKEND=hip for gfx1102
- Run:
test_dflash gemma-4-12b-it-Q4_K_M.gguf --daemon --max-ctx=512
- Send:
generate <prompt.bin> 20 <out.bin> samp=1.0,0.9,50,1.0,42
- Output: 20× same token (e.g., '11111111111111111111')
Root cause analysis
The official llama.cpp ggml-cuda works correctly on this hardware (produces coherent output at 16.1 tok/s). The Lucebox vendored ggml-cuda is based on an older llama.cpp version (~late May 2026, around PR #23483) that predates proper gfx1102/wave32 support.
The Lucebox fork adds custom kernels (fattn-chunked, fattn-sparse, moe-fused, tq3-quant, turbo-wht, gated_delta_net tree variants) on top of this older base. Replacing the entire ggml-cuda with the current official version while preserving these custom additions proved complex due to accumulated API changes (GGML_HINT_SRC0_IS_HADAMARD, GGML_TYPE_TQ3_0, ggml-backend API evolution, vendors/hip.h shim changes).
Tested
- ✅
llama.cpp official (latest): correct output on gfx1102
- ❌ Lucebox vendored ggml-cuda: degenerate logits on gfx1102
- ✅ Lucebox vendored ggml-cuda: reported working on gfx1100 (RX 7900 XTX) and gfx1151 (Strix Halo) — both wave64
Suggested fix
The vendored llama.cpp needs a rebase onto current upstream llama.cpp to get the gfx1102/wave32 fixes. This requires:
- Merging upstream ggml-cuda changes into the lucebox fork
- Adjusting Lucebox custom kernels for any API changes
- Re-testing on all supported GPU architectures
Related
🤖 Generated with Claude Code
Summary
The Lucebox vendored ggml-cuda produces numerically incorrect output on AMD RDNA3 GPUs with native wave32 (gfx1102 / RX 7600 XT). The model loads correctly and runs at full speed (~24 tok/s with Gemma 4 12B Q4_K_M), but produces degenerate logits — always outputting the same token regardless of input.
Hardware
Reproduction
test_dflashwithDFLASH27B_GPU_BACKEND=hipfor gfx1102test_dflash gemma-4-12b-it-Q4_K_M.gguf --daemon --max-ctx=512generate <prompt.bin> 20 <out.bin> samp=1.0,0.9,50,1.0,42Root cause analysis
The official
llama.cppggml-cuda works correctly on this hardware (produces coherent output at 16.1 tok/s). The Lucebox vendored ggml-cuda is based on an older llama.cpp version (~late May 2026, around PR #23483) that predates proper gfx1102/wave32 support.The Lucebox fork adds custom kernels (fattn-chunked, fattn-sparse, moe-fused, tq3-quant, turbo-wht, gated_delta_net tree variants) on top of this older base. Replacing the entire ggml-cuda with the current official version while preserving these custom additions proved complex due to accumulated API changes (GGML_HINT_SRC0_IS_HADAMARD, GGML_TYPE_TQ3_0, ggml-backend API evolution, vendors/hip.h shim changes).
Tested
llama.cppofficial (latest): correct output on gfx1102Suggested fix
The vendored llama.cpp needs a rebase onto current upstream llama.cpp to get the gfx1102/wave32 fixes. This requires:
Related
🤖 Generated with Claude Code