UPSTREAM PR #21629: cuda: Q1_0 initial backend by loci-dev · Pull Request #1343 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-11T02:17:43Z

Note

Source pull request: ggml-org/llama.cpp#21629

Overview

Follow up after merging of Q1_0 CPU PR. This PR adds the relevant CUDA backend.
Seems also this works for AMD in some cases that was a nice surprise :)

See a live demo of Bonsai 8B using these CUDA kernels and llama-server on hugging-face space prism-ml/Bonsai-demo, using a L40S GPU and getting decent speeds. Each request running on one gpu with a naive load balancer (just for demo purposes).

Models:

prism-ml/Bonsai-8B-gguf
prism-ml/Bonsai-4B-gguf
prism-ml/Bonsai-1.7B-gguf
See more details here: 1-bit-bonsai-8b-whitepaper.pdf
See a working demo from our Bonsai-demo repo: Bonsai-demo

Questions:

I could not get DP4A working for these kernels, kept getting wrong results, is that required or okay to do cuBLAS fallback for that, seems its for few generation ago?
I tried tuning the kernel a bit but now sure if its fully optimized. Surprisingly get similar speeds for 4090 and 5090.

llama-bench (-fa 1)

Device: NVIDIA RTX 5090 (32 GB), CUDA backend

Bonsai-1.7B (231.13 MiB, 1.72B params)

model	size	params	backend	ngl	fa	test	t/s
qwen3 1.7B	231.13 MiB	1.72 B	CUDA	99	1	pp512	29249.58 ± 4403.05
qwen3 1.7B	231.13 MiB	1.72 B	CUDA	99	1	tg128	626.18 ± 7.55

Bonsai-4B (540.09 MiB, 4.02B params)

model	size	params	backend	ngl	fa	test	t/s
qwen3 4B	540.09 MiB	4.02 B	CUDA	99	1	pp512	18621.21 ± 1839.94
qwen3 4B	540.09 MiB	4.02 B	CUDA	99	1	tg128	485.21 ± 2.35

Bonsai-8B (1.07 GiB, 8.19B params)

model	size	params	backend	ngl	fa	test	t/s
qwen3 8B	1.07 GiB	8.19 B	CUDA	99	1	pp512	12287.47 ± 719.62
qwen3 8B	1.07 GiB	8.19 B	CUDA	99	1	tg128	373.77 ± 2.01

End-to-end testing: KL Divergence (Q1_0 vs unpacked into FP16)

To test accuracy of the CUDA backend, we compare the KL divergence of the Q1_0 model against the unpacked FP16 model. The weights are equivalent so checking the logits gives us a good indication of the accuracy of the CUDA backend. Ran on 20 chunks of wikitext-2-raw, ctx 512.

For each model testing vs the unpacked version here: https://huggingface.co/collections/prism-ml/bonsai-auxiliary

Model	Mean KLD	RMS Δp	Same top p	pp512 (t/s)	tg128 (t/s)
8B	0.000514	0.635%	98.706%	12287	374
4B	0.000429	0.593%	98.510%	18621	485
1.7B	0.000419	0.555%	98.941%	29250	626

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, AI was used to help debug initial kernels and adding debugging prints, etc. Those codes are not included in this PR. Ran the PR with llama-bench, and also KL divergence tests above to ensure correctness.

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

loci-review · 2026-04-11T03:15:55Z

No meaningful performance changes were detected across 125311 analyzed functions in the following binaries: build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.llama-bench, build.bin.libllama.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli.

💬 Questions? Tag @loci-dev

khosravipasha and others added 4 commits April 8, 2026 00:55

[cuda] initial Q1_0 backend

7c3501a

remove unused code, fix AMD MMA guard

84ab75f

attempt to support dp4a

bca0c0b

Apply suggestions from code review

05b0c84

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

loci-dev temporarily deployed to PROD__AL_DEMO April 11, 2026 02:17 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 9 times, most recently from d101579 to 63ab8d1 Compare April 18, 2026 02:17

loci-dev force-pushed the main branch 2 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #21629: cuda: Q1_0 initial backend#1343

UPSTREAM PR #21629: cuda: Q1_0 initial backend#1343
loci-dev wants to merge 4 commits into
mainfrom
loci/pr-21629-q1-cuda

loci-dev commented Apr 11, 2026

Uh oh!

loci-review Bot commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Apr 11, 2026

Overview

llama-bench (-fa 1)

Bonsai-1.7B (231.13 MiB, 1.72B params)

Bonsai-4B (540.09 MiB, 4.02B params)

Bonsai-8B (1.07 GiB, 8.19B params)

End-to-end testing: KL Divergence (Q1_0 vs unpacked into FP16)

Requirements

Uh oh!

loci-review Bot commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants