onnx: run MatMulNBits int4 through the fused Q4_0 block-quant matmul by czoli1976 · Pull Request #2340 · sonos/tract

czoli1976 · 2026-06-04T07:44:19Z

int4-quantized LLM weights now stay int4 in memory instead of being expanded to f32 at load:
linear layers are ~7.1× smaller than f32 / ~3.6× than f16 (Mistral-7B linears 28 → 4 GB),
near-lossless. For block_size=32 + symmetric + K % 32 == 0, MatMulNBits keeps its weight
as a Q4_0 block-quant constant (dequantized inside the matmul packer), reusing the existing
block-quant path. Anything else falls back to the f32 weight — nothing that worked before changes.

Decode is modestly faster on CPU (1.1× at 0.5B → 1.33× at Phi-3-mini, M=1): the block-quant
packer dequantizes to f32 and runs the f32 kernel, so the win there is weight bandwidth, not int4
compute. On the Metal backend the same Q4_0 weight is what the existing kernel_mul_mv_q4_0_f32
int4 kernel consumes (not benched here).

What changed

Q4_0::pack_prequantized (linalg): builds Q4_0 storage from the existing 4-bit values without
re-quantizing — only the f16 scale is rounded, so the model's own int4 weights are preserved.
mat_mul_nbits.rs: the gated case emits the Q4_0 constant + a group-axis EinSum (mirroring
de_block_quant); the general case is untouched.

Validation

pack_prequantized round-trip + linalg block_quant suite green. ONNX round-trips: runtime weight
is Q4_0(...) not f32; eligible shapes near-lossless, ineligible bit-exact via fallback. Full
workspace build + blast-radius suite (linalg 3830 proptests) green; fmt + clippy clean.

Sources

Apple, On-Device Llama 3.1 with Core ML — data-free linear_symmetric / int4 / per_block / block_size=32 recipe.
llama.cpp / GGML — the Q4_0 block format.
Microsoft ONNX Runtime — com.microsoft.MatMulNBits (the imported int4 format).
tract's existing block_quant machinery (Q4_0, PackedBlockQuantFormat, einsum_matmul,
de_block_quant).

🤖 Generated with Claude Code

For block_size 32 + symmetric + K a multiple of 32, MatMulNBits now keeps the weight as a Q4_0 block-quant constant (dequantized inside the matmul packer) instead of materializing a full f32 weight, realizing the runtime memory saving. Q4_0.pack_prequantized builds the storage from the original int4 nibbles without re-quantizing, so only the f16 scale is rounded. Other block sizes / asymmetric / partial last block fall back to f32. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

czoli1976 · 2026-06-04T07:52:38Z

It grows with model size (tiny Qwen 1.10× → Phi-3 1.33×), because larger weight matrices are more memory-bound, so int4's reduced bandwidth helps more. The ~2× from the Apple research would still need a true int4-compute kernel or more bandwidth-bound hardware.

Metal has it already so and it's the bigger win. Two confirmations:

Weight stays Q4_0 on the device: input fact Nr 1: 512,512,F32 🔍 FromHost(... 🔍 Q4_0([512,512])) (non-plain storage) — the int4 weight is uploaded as Q4_0 and consumed by MetalGgmlGemm, never dequantized to f32.
Metal int4 = 0.488 ms vs f32 = 1.146 ms → 2.35× faster at K=N=4096.

This was referenced Jun 4, 2026

MatMulNBits / Q4_0: int4-compute CPU kernel — W4A8 (int8-dot) or W4A16 (f16-unpack)? #2341

Open

ONNX import: strict symbolic-shape unification rejects common LLM decode exports #2343

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

onnx: run MatMulNBits int4 through the fused Q4_0 block-quant matmul#2340

onnx: run MatMulNBits int4 through the fused Q4_0 block-quant matmul#2340
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:feature/int4-matmulnbits-fused

czoli1976 commented Jun 4, 2026

Uh oh!

czoli1976 commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

czoli1976 commented Jun 4, 2026

What changed

Validation

Sources

Uh oh!

czoli1976 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

czoli1976 commented Jun 4, 2026 •

edited

Loading