[OpBuilder] Support HTP MatMulNBits. by minfhong-qti · Pull Request #288 · onnxruntime/onnxruntime-qnn

minfhong-qti · 2026-04-22T09:48:15Z

Description

Extend MatMulNBits op builder from GPU only to HTP. Restrict the support in 2bits/4bits and block size 32/64 for HTP. Unlike GPU using BlockEncoding, HTP requires using BwFloatBlockEncoding with bitwidth set and quint8 for tensor. Thus, extend QnnQuantParamsWrapper to accept setting encodings for QNN_QUANTIZATION_ENCODING_BW_FLOAT_BLOCK type. Per HTP request, MatMulNBits is transformed into Conv2d with necessary Reshape around due to HTP timeline not able to complete the implementation for MatMul in time.
Test: UT while extending hardcoding utility functions for 2bit support.

Motivation and Context

Enable for MSFT 2bit Phi3 model.

qti-chuteng · 2026-05-08T13:58:51Z

[M-1] onnxruntime/test/providers/qnn/matmulnbits_test.cc:190-386
All 18 new QnnHTPBackendTests testcases call GTEST_SKIP() << "Skip this testcase before QAIRT 2.47." — the entire HTP path has zero CI coverage. The kill-test fence is broken: even with the entire HTP branch commented out, this PR would still pass 100% of tests. The v2 force-push rewrote the quant-data transform with an XOR-mask trick — precisely the kind of high-risk refactor that needs runnable tests.

Suggested fix:

Add at least 1–2 sanity tests that can actually run (e.g. bits=4 + block_size=32), or
Replace GTEST_SKIP with the gtest DISABLED_ prefix so they are visible to lint / dashboard tooling. Track "QAIRT 2.47 uplevel" as a TODO / ticket.

[M-2] onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc:305-314
ProcessInputs / ProcessAttributesAndOutputs read bits / block_size with default 0 and immediately compute kByteBits / bits and (N * K) / block_size. Today QNN EP guarantees IsOpSupported runs first, so no actual SIGFPE; but this drops a previously-safe default (4 / 32) and any future "validate-only" path or refactor that exposes these methods will divide by zero.

Suggested fix:

RETURN_IF_NOT(bits > 0 && block_size > 0,
              "Internal error: bits/block_size must be set before ProcessInputs.");

or restore defaults to legal values (bits=4, block_size=32).

[M-3] onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc:580
HTP branch reads output_info.shape[0..2] without verifying rank=3 in ProcessAttributesAndOutputs. IsOpSupported only validates input A's rank, not the output's. Same pattern at line 327 for reshape_output_shape. ONNX spec implies output rank == input rank, but this cross-function inference is implicit.

Suggested fix:

RETURN_IF(is_htp_backend && output_info.shape.size() != 3,
          "Unsupported output rank, expecting 3D for HTP backend.");

And add a static-shape guard in IsOpSupported so dynamic-shape input A doesn't reach the HTP path.

[M-4] onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc:17-81
Doc-comment header is inconsistent with the implementation:

bits : 2(HTP), 4(GPU/HTP) — missing bits=8 which kHtpSupportedBitsAndBlockSizeMultipliers clearly supports.
block_size : 32(GPU.HTP), 64(HTP) — period should be slash; per-bits multiplier rule (16 / 8 / 4) is not captured.
scales : ... fp/16fp32 ... — token order mangled (should be fp16/fp32).
GPU section (lines 40, 41): [N * K / block_size)] has an unbalanced bracket.
HTP section omits the bias input even though IsOpSupported (line 286) still rejects bias.

Minor

[N-1] matmulnbits_op_builder.cc:162,301
bool is_htp_backend = IsNpuBackend(...) — variable name conflates HTP with NPU. NPU == HTP today is a coincidence; if a future SDK adds another NPU backend, this code will silently mis-treat it as HTP. Either rename to is_npu_backend (matching conv_op_builder.cc) or add a // QNN currently only has HTP as NPU backend comment.

[N-2] matmulnbits_op_builder.cc:583

QNN_DATATYPE_FLOAT_16,  // Explicitly override to float16.

Comment doesn't explain why — the rationale (HTP Conv2d FP kernel only supports fp16, hence the pre-Cast(fp16) and post-Cast(fp32) workaround) should be stated here so a future reader knows when to remove the override.

Suggested fix:

QNN_DATATYPE_FLOAT_16, // HTP Conv2d FP kernel only supports fp16; cf. pre-Cast(fp16) above and post-Cast(fp32) below for fp32 graphs.

[N-3] matmulnbits_op_builder.cc:114
kHtpSupportedBitsAndBlockSizeMultipliers{{2, 16}, {4, 8}, {8, 4}} — value semantics ("multiplier") are unclear without reading IsOpSupported. Add a doc comment, or split into kHtpSupportedBits + kHtpMinBlockBytes.

[N-4] matmulnbits_op_builder.cc:194
// TODO: Float16 DLC serialization failing. — outdated. This PR does enable fp16 input support (line 205-206). Either delete the TODO or rephrase as "Validate fp16 DLC serialization end-to-end after QAIRT 2.47 uplevel".

[N-5] qnn_utils.h:503-534
New TwoDimensionTranspose<T>(const std::vector<T>&, ...) overload shares the name with the existing TwoDimensionTranspose(const QnnModelWrapper&, ...) despite very different APIs. Rename to TwoDimensionTransposeBuffer (or similar) and use gsl::span<const T> / gsl::span<const uint32_t> for view parameters per checklist [B2].

[N-6] qnn_utils.h:528
std::memcpy(&transposed_data[dst_index], &data[src_index], sizeof(T)); is unnecessary for T = uint8_t and would be UB if the template were ever instantiated with a non-trivial T. Use transposed_data[dst_index] = data[src_index]; or add static_assert(std::is_trivially_copyable_v<T>).

[N-7] matmulnbits_op_builder.cc:184-286
Trailing-period style is inconsistent across RETURN_IF / RETURN_IF_NOT messages — e.g. line 184 has a period, line 187 / 224 don't, line 206 does. Standardize on trailing periods.

Extend MatMulNBits op builder from GPU only to HTP. Restrict the support in 2bits/4bits and block size 32/64 for HTP. Unlike GPU using BlockEncoding, HTP requires using BwFloatBlockEncoding with bitwidth set and quint8 for tensor. Thus, extend QnnQuantParamsWrapper to accept setting encodings for QNN_QUANTIZATION_ENCODING_BW_FLOAT_BLOCK type. Per HTP request, MatMulNBits is transformed into Conv2d with necessary Reshape and Cast around due to HTP timeline not able to complete the implementation for MatMul in time. Test: UT while extending hardcoding utility functions for 2/4/8bit. TODO: Re-enable testcases once QAIRT is upleveled to 2.47.

minfhong-qti · 2026-05-12T07:38:57Z

[M-1] onnxruntime/test/providers/qnn/matmulnbits_test.cc:190-386 All 18 new QnnHTPBackendTests testcases call GTEST_SKIP() << "Skip this testcase before QAIRT 2.47." — the entire HTP path has zero CI coverage. The kill-test fence is broken: even with the entire HTP branch commented out, this PR would still pass 100% of tests. The v2 force-push rewrote the quant-data transform with an XOR-mask trick — precisely the kind of high-risk refactor that needs runnable tests.

Suggested fix:

Add at least 1–2 sanity tests that can actually run (e.g. bits=4 + block_size=32), or

Replace GTEST_SKIP with the gtest DISABLED_ prefix so they are visible to lint / dashboard tooling. Track "QAIRT 2.47 uplevel" as a TODO / ticket.

[M-2] onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc:305-314 ProcessInputs / ProcessAttributesAndOutputs read bits / block_size with default 0 and immediately compute kByteBits / bits and (N * K) / block_size. Today QNN EP guarantees IsOpSupported runs first, so no actual SIGFPE; but this drops a previously-safe default (4 / 32) and any future "validate-only" path or refactor that exposes these methods will divide by zero.

Suggested fix:
RETURN_IF_NOT(bits > 0 && block_size > 0,
              "Internal error: bits/block_size must be set before ProcessInputs.");
or restore defaults to legal values (bits=4, block_size=32).

[M-3] onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc:580 HTP branch reads output_info.shape[0..2] without verifying rank=3 in ProcessAttributesAndOutputs. IsOpSupported only validates input A's rank, not the output's. Same pattern at line 327 for reshape_output_shape. ONNX spec implies output rank == input rank, but this cross-function inference is implicit.

Suggested fix:
RETURN_IF(is_htp_backend && output_info.shape.size() != 3,
          "Unsupported output rank, expecting 3D for HTP backend.");
And add a static-shape guard in IsOpSupported so dynamic-shape input A doesn't reach the HTP path.

[M-4] onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc:17-81 Doc-comment header is inconsistent with the implementation:

bits : 2(HTP), 4(GPU/HTP) — missing bits=8 which kHtpSupportedBitsAndBlockSizeMultipliers clearly supports.

block_size : 32(GPU.HTP), 64(HTP) — period should be slash; per-bits multiplier rule (16 / 8 / 4) is not captured.

scales : ... fp/16fp32 ... — token order mangled (should be fp16/fp32).

GPU section (lines 40, 41): [N * K / block_size)] has an unbalanced bracket.

HTP section omits the bias input even though IsOpSupported (line 286) still rejects bias.

Minor

[N-1] matmulnbits_op_builder.cc:162,301 bool is_htp_backend = IsNpuBackend(...) — variable name conflates HTP with NPU. NPU == HTP today is a coincidence; if a future SDK adds another NPU backend, this code will silently mis-treat it as HTP. Either rename to is_npu_backend (matching conv_op_builder.cc) or add a // QNN currently only has HTP as NPU backend comment.

[N-2] matmulnbits_op_builder.cc:583
QNN_DATATYPE_FLOAT_16,  // Explicitly override to float16.
Comment doesn't explain why — the rationale (HTP Conv2d FP kernel only supports fp16, hence the pre-Cast(fp16) and post-Cast(fp32) workaround) should be stated here so a future reader knows when to remove the override.

Suggested fix:
QNN_DATATYPE_FLOAT_16, // HTP Conv2d FP kernel only supports fp16; cf. pre-Cast(fp16) above and post-Cast(fp32) below for fp32 graphs.
[N-3] matmulnbits_op_builder.cc:114 kHtpSupportedBitsAndBlockSizeMultipliers{{2, 16}, {4, 8}, {8, 4}} — value semantics ("multiplier") are unclear without reading IsOpSupported. Add a doc comment, or split into kHtpSupportedBits + kHtpMinBlockBytes.

[N-4] matmulnbits_op_builder.cc:194 // TODO: Float16 DLC serialization failing. — outdated. This PR does enable fp16 input support (line 205-206). Either delete the TODO or rephrase as "Validate fp16 DLC serialization end-to-end after QAIRT 2.47 uplevel".

[N-5] qnn_utils.h:503-534 New TwoDimensionTranspose<T>(const std::vector<T>&, ...) overload shares the name with the existing TwoDimensionTranspose(const QnnModelWrapper&, ...) despite very different APIs. Rename to TwoDimensionTransposeBuffer (or similar) and use gsl::span<const T> / gsl::span<const uint32_t> for view parameters per checklist [B2].

[N-6] qnn_utils.h:528 std::memcpy(&transposed_data[dst_index], &data[src_index], sizeof(T)); is unnecessary for T = uint8_t and would be UB if the template were ever instantiated with a non-trivial T. Use transposed_data[dst_index] = data[src_index]; or add static_assert(std::is_trivially_copyable_v<T>).

[N-7] matmulnbits_op_builder.cc:184-286 Trailing-period style is inconsistent across RETURN_IF / RETURN_IF_NOT messages — e.g. line 184 has a period, line 187 / 224 don't, line 206 does. Standardize on trailing periods.

Most comments are applied.

huaychou

LGTM. Could you please check if #307 will impact MatmulNBits support?

minfhong-qti · 2026-05-13T05:30:14Z

LGTM. Could you please check if #307 will impact MatmulNBits support?

I think it will not impact as that PR is handling separate Q/DQ nodes while MatMulNBits inherently has scales/offsets as direct inputs.

Extend MatMulNBits op builder from GPU only to HTP. Restrict the support in 2bits/4bits and block size 32/64 for HTP. Unlike GPU using BlockEncoding, HTP requires using BwFloatBlockEncoding with bitwidth set and quint8 for tensor. Thus, extend QnnQuantParamsWrapper to accept setting encodings for QNN_QUANTIZATION_ENCODING_BW_FLOAT_BLOCK type. Per HTP request, MatMulNBits is transformed into Conv2d with necessary Reshape and Cast around due to HTP timeline not able to complete the implementation for MatMul in time. Test: UT while extending hardcoding utility functions for 2/4/8bit. TODO: Re-enable testcases once QAIRT is upleveled to v2.47.

minfhong-qti force-pushed the dev/minfhong/matmulnbits-htp branch 4 times, most recently from 9544c74 to ad8326e Compare April 29, 2026 10:32

minfhong-qti force-pushed the dev/minfhong/matmulnbits-htp branch from ad8326e to 28c530b Compare April 30, 2026 02:19

minfhong-qti marked this pull request as ready for review April 30, 2026 08:10

minfhong-qti requested review from qti-ashwshan, qti-chuteng, qti-jkilpatrick, qti-kromero, qti-yuduo, tirupath-qti and yath1 as code owners April 30, 2026 08:10

minfhong-qti requested review from huaychou and qti-hungjuiw April 30, 2026 08:11