onnx: run MatMulNBits int4 through the fused Q4_0 block-quant matmul#2340
Open
czoli1976 wants to merge 1 commit into
Open
onnx: run MatMulNBits int4 through the fused Q4_0 block-quant matmul#2340czoli1976 wants to merge 1 commit into
czoli1976 wants to merge 1 commit into
Conversation
For block_size 32 + symmetric + K a multiple of 32, MatMulNBits now keeps the weight as a Q4_0 block-quant constant (dequantized inside the matmul packer) instead of materializing a full f32 weight, realizing the runtime memory saving. Q4_0.pack_prequantized builds the storage from the original int4 nibbles without re-quantizing, so only the f16 scale is rounded. Other block sizes / asymmetric / partial last block fall back to f32. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


int4-quantized LLM weights now stay int4 in memory instead of being expanded to f32 at load:
linear layers are ~7.1× smaller than f32 / ~3.6× than f16 (Mistral-7B linears 28 → 4 GB),
near-lossless. For
block_size=32+ symmetric +K % 32 == 0,MatMulNBitskeeps its weightas a
Q4_0block-quant constant (dequantized inside the matmul packer), reusing the existingblock-quant path. Anything else falls back to the f32 weight — nothing that worked before changes.
Decode is modestly faster on CPU (1.1× at 0.5B → 1.33× at Phi-3-mini, M=1): the block-quant
packer dequantizes to f32 and runs the f32 kernel, so the win there is weight bandwidth, not int4
compute. On the Metal backend the same
Q4_0weight is what the existingkernel_mul_mv_q4_0_f32int4 kernel consumes (not benched here).
What changed
Q4_0::pack_prequantized(linalg): builds Q4_0 storage from the existing 4-bit values withoutre-quantizing — only the f16 scale is rounded, so the model's own int4 weights are preserved.
mat_mul_nbits.rs: the gated case emits the Q4_0 constant + a group-axis EinSum (mirroringde_block_quant); the general case is untouched.Validation
pack_prequantizedround-trip + linalg block_quant suite green. ONNX round-trips: runtime weightis
Q4_0(...)not f32; eligible shapes near-lossless, ineligible bit-exact via fallback. Fullworkspace build + blast-radius suite (linalg 3830 proptests) green; fmt + clippy clean.
Sources
linear_symmetric / int4 / per_block / block_size=32recipe.Q4_0block format.com.microsoft.MatMulNBits(the imported int4 format).block_quantmachinery (Q4_0,PackedBlockQuantFormat,einsum_matmul,de_block_quant).🤖 Generated with Claude Code