Numerical bug: ggml-cuda fork produces degenerate logits on gfx1102/wave32 (RX 7600 XT)

## Summary
The Lucebox vendored ggml-cuda produces numerically incorrect output on AMD RDNA3 GPUs with native wave32 (gfx1102 / RX 7600 XT). The model loads correctly and runs at full speed (~24 tok/s with Gemma 4 12B Q4_K_M), but produces degenerate logits — always outputting the same token regardless of input.

## Hardware
- GPU: AMD Radeon RX 7600 XT (16 GB VRAM, gfx1102, Navi 33, wave32)
- OS: Windows 11
- ROCm: 7.1, HIP 7.2
- Compiler: Ninja + clang (ROCm bundled)

## Reproduction
1. Build `test_dflash` with `DFLASH27B_GPU_BACKEND=hip` for gfx1102
2. Run: `test_dflash gemma-4-12b-it-Q4_K_M.gguf --daemon --max-ctx=512`
3. Send: `generate <prompt.bin> 20 <out.bin> samp=1.0,0.9,50,1.0,42`
4. Output: 20× same token (e.g., '11111111111111111111')

## Root cause analysis
The official `llama.cpp` ggml-cuda works correctly on this hardware (produces coherent output at 16.1 tok/s). The Lucebox vendored ggml-cuda is based on an older llama.cpp version (~late May 2026, around PR #23483) that predates proper gfx1102/wave32 support.

The Lucebox fork adds custom kernels (fattn-chunked, fattn-sparse, moe-fused, tq3-quant, turbo-wht, gated_delta_net tree variants) on top of this older base. Replacing the entire ggml-cuda with the current official version while preserving these custom additions proved complex due to accumulated API changes (GGML_HINT_SRC0_IS_HADAMARD, GGML_TYPE_TQ3_0, ggml-backend API evolution, vendors/hip.h shim changes).

## Tested
- ✅ `llama.cpp` official (latest): correct output on gfx1102
- ❌ Lucebox vendored ggml-cuda: degenerate logits on gfx1102
- ✅ Lucebox vendored ggml-cuda: reported working on gfx1100 (RX 7900 XTX) and gfx1151 (Strix Halo) — both wave64

## Suggested fix
The vendored llama.cpp needs a rebase onto current upstream llama.cpp to get the gfx1102/wave32 fixes. This requires:
1. Merging upstream ggml-cuda changes into the lucebox fork
2. Adjusting Lucebox custom kernels for any API changes
3. Re-testing on all supported GPU architectures

## Related
- PR #366: Windows HIP build port (independent of this numerical bug)
- Official llama.cpp commit range: ~#23483 → #24357 (~900 PRs of divergence)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerical bug: ggml-cuda fork produces degenerate logits on gfx1102/wave32 (RX 7600 XT) #367

Summary

Hardware

Reproduction

Root cause analysis

Tested

Suggested fix

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Numerical bug: ggml-cuda fork produces degenerate logits on gfx1102/wave32 (RX 7600 XT) #367

Description

Summary

Hardware

Reproduction

Root cause analysis

Tested

Suggested fix

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions