Skip to content

Numerical bug: ggml-cuda fork produces degenerate logits on gfx1102/wave32 (RX 7600 XT) #367

@Sprize1

Description

@Sprize1

Summary

The Lucebox vendored ggml-cuda produces numerically incorrect output on AMD RDNA3 GPUs with native wave32 (gfx1102 / RX 7600 XT). The model loads correctly and runs at full speed (~24 tok/s with Gemma 4 12B Q4_K_M), but produces degenerate logits — always outputting the same token regardless of input.

Hardware

  • GPU: AMD Radeon RX 7600 XT (16 GB VRAM, gfx1102, Navi 33, wave32)
  • OS: Windows 11
  • ROCm: 7.1, HIP 7.2
  • Compiler: Ninja + clang (ROCm bundled)

Reproduction

  1. Build test_dflash with DFLASH27B_GPU_BACKEND=hip for gfx1102
  2. Run: test_dflash gemma-4-12b-it-Q4_K_M.gguf --daemon --max-ctx=512
  3. Send: generate <prompt.bin> 20 <out.bin> samp=1.0,0.9,50,1.0,42
  4. Output: 20× same token (e.g., '11111111111111111111')

Root cause analysis

The official llama.cpp ggml-cuda works correctly on this hardware (produces coherent output at 16.1 tok/s). The Lucebox vendored ggml-cuda is based on an older llama.cpp version (~late May 2026, around PR #23483) that predates proper gfx1102/wave32 support.

The Lucebox fork adds custom kernels (fattn-chunked, fattn-sparse, moe-fused, tq3-quant, turbo-wht, gated_delta_net tree variants) on top of this older base. Replacing the entire ggml-cuda with the current official version while preserving these custom additions proved complex due to accumulated API changes (GGML_HINT_SRC0_IS_HADAMARD, GGML_TYPE_TQ3_0, ggml-backend API evolution, vendors/hip.h shim changes).

Tested

  • llama.cpp official (latest): correct output on gfx1102
  • ❌ Lucebox vendored ggml-cuda: degenerate logits on gfx1102
  • ✅ Lucebox vendored ggml-cuda: reported working on gfx1100 (RX 7900 XTX) and gfx1151 (Strix Halo) — both wave64

Suggested fix

The vendored llama.cpp needs a rebase onto current upstream llama.cpp to get the gfx1102/wave32 fixes. This requires:

  1. Merging upstream ggml-cuda changes into the lucebox fork
  2. Adjusting Lucebox custom kernels for any API changes
  3. Re-testing on all supported GPU architectures

Related

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions