Skip to content

onnx: run MatMulNBits int4 through the fused Q4_0 block-quant matmul#2340

Open
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:feature/int4-matmulnbits-fused
Open

onnx: run MatMulNBits int4 through the fused Q4_0 block-quant matmul#2340
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:feature/int4-matmulnbits-fused

Conversation

@czoli1976
Copy link
Copy Markdown
Contributor

int4-quantized LLM weights now stay int4 in memory instead of being expanded to f32 at load:
linear layers are ~7.1× smaller than f32 / ~3.6× than f16 (Mistral-7B linears 28 → 4 GB),
near-lossless. For block_size=32 + symmetric + K % 32 == 0, MatMulNBits keeps its weight
as a Q4_0 block-quant constant (dequantized inside the matmul packer), reusing the existing
block-quant path. Anything else falls back to the f32 weight — nothing that worked before changes.

Decode is modestly faster on CPU (1.1× at 0.5B → 1.33× at Phi-3-mini, M=1): the block-quant
packer dequantizes to f32 and runs the f32 kernel, so the win there is weight bandwidth, not int4
compute. On the Metal backend the same Q4_0 weight is what the existing kernel_mul_mv_q4_0_f32
int4 kernel consumes (not benched here).

What changed

  • Q4_0::pack_prequantized (linalg): builds Q4_0 storage from the existing 4-bit values without
    re-quantizing — only the f16 scale is rounded, so the model's own int4 weights are preserved.
  • mat_mul_nbits.rs: the gated case emits the Q4_0 constant + a group-axis EinSum (mirroring
    de_block_quant); the general case is untouched.

Validation

pack_prequantized round-trip + linalg block_quant suite green. ONNX round-trips: runtime weight
is Q4_0(...) not f32; eligible shapes near-lossless, ineligible bit-exact via fallback. Full
workspace build + blast-radius suite (linalg 3830 proptests) green; fmt + clippy clean.

Sources

  • Apple, On-Device Llama 3.1 with Core ML — data-free linear_symmetric / int4 / per_block / block_size=32 recipe.
  • llama.cpp / GGML — the Q4_0 block format.
  • Microsoft ONNX Runtime — com.microsoft.MatMulNBits (the imported int4 format).
  • tract's existing block_quant machinery (Q4_0, PackedBlockQuantFormat, einsum_matmul,
    de_block_quant).

🤖 Generated with Claude Code

For block_size 32 + symmetric + K a multiple of 32, MatMulNBits now keeps the weight
as a Q4_0 block-quant constant (dequantized inside the matmul packer) instead of
materializing a full f32 weight, realizing the runtime memory saving. Q4_0.pack_prequantized
builds the storage from the original int4 nibbles without re-quantizing, so only the f16
scale is rounded. Other block sizes / asymmetric / partial last block fall back to f32.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@czoli1976
Copy link
Copy Markdown
Contributor Author

czoli1976 commented Jun 4, 2026

image image It grows with model size (tiny Qwen 1.10× → Phi-3 1.33×), because larger weight matrices are more memory-bound, so int4's reduced bandwidth helps more. The ~2× from the Apple research would still need a true int4-compute kernel or more bandwidth-bound hardware.

Metal has it already so and it's the bigger win. Two confirmations:

  1. Weight stays Q4_0 on the device: input fact Nr 1: 512,512,F32 🔍 FromHost(... 🔍 Q4_0([512,512])) (non-plain storage) — the int4 weight is uploaded as Q4_0 and consumed by MetalGgmlGemm, never dequantized to f32.
  2. Metal int4 = 0.488 ms vs f32 = 1.146 ms → 2.35× faster at K=N=4096.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant