Skip to content

UPSTREAM PR #21629: cuda: Q1_0 initial backend#1343

Open
loci-dev wants to merge 4 commits into
mainfrom
loci/pr-21629-q1-cuda
Open

UPSTREAM PR #21629: cuda: Q1_0 initial backend#1343
loci-dev wants to merge 4 commits into
mainfrom
loci/pr-21629-q1-cuda

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#21629

Overview

Follow up after merging of Q1_0 CPU PR. This PR adds the relevant CUDA backend.
Seems also this works for AMD in some cases that was a nice surprise :)

See a live demo of Bonsai 8B using these CUDA kernels and llama-server on hugging-face space prism-ml/Bonsai-demo, using a L40S GPU and getting decent speeds. Each request running on one gpu with a naive load balancer (just for demo purposes).

Models:

Questions:

  • I could not get DP4A working for these kernels, kept getting wrong results, is that required or okay to do cuBLAS fallback for that, seems its for few generation ago?
  • I tried tuning the kernel a bit but now sure if its fully optimized. Surprisingly get similar speeds for 4090 and 5090.

llama-bench (-fa 1)

Device: NVIDIA RTX 5090 (32 GB), CUDA backend

Bonsai-1.7B (231.13 MiB, 1.72B params)

model size params backend ngl fa test t/s
qwen3 1.7B 231.13 MiB 1.72 B CUDA 99 1 pp512 29249.58 ± 4403.05
qwen3 1.7B 231.13 MiB 1.72 B CUDA 99 1 tg128 626.18 ± 7.55

Bonsai-4B (540.09 MiB, 4.02B params)

model size params backend ngl fa test t/s
qwen3 4B 540.09 MiB 4.02 B CUDA 99 1 pp512 18621.21 ± 1839.94
qwen3 4B 540.09 MiB 4.02 B CUDA 99 1 tg128 485.21 ± 2.35

Bonsai-8B (1.07 GiB, 8.19B params)

model size params backend ngl fa test t/s
qwen3 8B 1.07 GiB 8.19 B CUDA 99 1 pp512 12287.47 ± 719.62
qwen3 8B 1.07 GiB 8.19 B CUDA 99 1 tg128 373.77 ± 2.01

End-to-end testing: KL Divergence (Q1_0 vs unpacked into FP16)

To test accuracy of the CUDA backend, we compare the KL divergence of the Q1_0 model against the unpacked FP16 model. The weights are equivalent so checking the logits gives us a good indication of the accuracy of the CUDA backend. Ran on 20 chunks of wikitext-2-raw, ctx 512.

For each model testing vs the unpacked version here: https://huggingface.co/collections/prism-ml/bonsai-auxiliary

Model Mean KLD RMS Δp Same top p pp512 (t/s) tg128 (t/s)
8B 0.000514 0.635% 98.706% 12287 374
4B 0.000429 0.593% 98.510% 18621 485
1.7B 0.000419 0.555% 98.941% 29250 626

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, AI was used to help debug initial kernels and adding debugging prints, etc. Those codes are not included in this PR. Ran the PR with llama-bench, and also KL divergence tests above to ensure correctness.

@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 11, 2026

No meaningful performance changes were detected across 125311 analyzed functions in the following binaries: build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.llama-bench, build.bin.libllama.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli.

💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 9 times, most recently from d101579 to 63ab8d1 Compare April 18, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants