Skip to content

[OpBuilder] Support HTP MatMulNBits.#288

Merged
qti-chuteng merged 1 commit into
mainfrom
dev/minfhong/matmulnbits-htp
May 14, 2026
Merged

[OpBuilder] Support HTP MatMulNBits.#288
qti-chuteng merged 1 commit into
mainfrom
dev/minfhong/matmulnbits-htp

Conversation

@minfhong-qti

Copy link
Copy Markdown
Collaborator

Description

Extend MatMulNBits op builder from GPU only to HTP. Restrict the support in 2bits/4bits and block size 32/64 for HTP. Unlike GPU using BlockEncoding, HTP requires using BwFloatBlockEncoding with bitwidth set and quint8 for tensor. Thus, extend QnnQuantParamsWrapper to accept setting encodings for QNN_QUANTIZATION_ENCODING_BW_FLOAT_BLOCK type. Per HTP request, MatMulNBits is transformed into Conv2d with necessary Reshape around due to HTP timeline not able to complete the implementation for MatMul in time.
Test: UT while extending hardcoding utility functions for 2bit support.

Motivation and Context

Enable for MSFT 2bit Phi3 model.

@minfhong-qti minfhong-qti force-pushed the dev/minfhong/matmulnbits-htp branch 4 times, most recently from 9544c74 to ad8326e Compare April 29, 2026 10:32
@minfhong-qti minfhong-qti force-pushed the dev/minfhong/matmulnbits-htp branch from ad8326e to 28c530b Compare April 30, 2026 02:19
@minfhong-qti minfhong-qti marked this pull request as ready for review April 30, 2026 08:10
Comment thread onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc Outdated
Comment thread onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc Outdated
Comment thread onnxruntime/test/providers/qnn/matmulnbits_test.cc Outdated
Comment thread onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc Outdated
Comment thread onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc Outdated
Comment thread onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc Outdated
Comment thread onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc Outdated
Comment thread onnxruntime/test/unittest_util/qdq_test_utils.h Outdated
@minfhong-qti minfhong-qti force-pushed the dev/minfhong/matmulnbits-htp branch from e5c53fa to 58d46e3 Compare May 8, 2026 05:34
@minfhong-qti minfhong-qti requested a review from qti-hungjuiw May 8, 2026 06:04
Comment thread onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc Outdated
@qti-chuteng

Copy link
Copy Markdown
Collaborator

[M-1] onnxruntime/test/providers/qnn/matmulnbits_test.cc:190-386
All 18 new QnnHTPBackendTests testcases call GTEST_SKIP() << "Skip this testcase before QAIRT 2.47." — the entire HTP path has zero CI coverage. The kill-test fence is broken: even with the entire HTP branch commented out, this PR would still pass 100% of tests. The v2 force-push rewrote the quant-data transform with an XOR-mask trick — precisely the kind of high-risk refactor that needs runnable tests.

Suggested fix:

  • Add at least 1–2 sanity tests that can actually run (e.g. bits=4 + block_size=32), or
  • Replace GTEST_SKIP with the gtest DISABLED_ prefix so they are visible to lint / dashboard tooling. Track "QAIRT 2.47 uplevel" as a TODO / ticket.

[M-2] onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc:305-314
ProcessInputs / ProcessAttributesAndOutputs read bits / block_size with default 0 and immediately compute kByteBits / bits and (N * K) / block_size. Today QNN EP guarantees IsOpSupported runs first, so no actual SIGFPE; but this drops a previously-safe default (4 / 32) and any future "validate-only" path or refactor that exposes these methods will divide by zero.

Suggested fix:

RETURN_IF_NOT(bits > 0 && block_size > 0,
              "Internal error: bits/block_size must be set before ProcessInputs.");

or restore defaults to legal values (bits=4, block_size=32).


[M-3] onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc:580
HTP branch reads output_info.shape[0..2] without verifying rank=3 in ProcessAttributesAndOutputs. IsOpSupported only validates input A's rank, not the output's. Same pattern at line 327 for reshape_output_shape. ONNX spec implies output rank == input rank, but this cross-function inference is implicit.

Suggested fix:

RETURN_IF(is_htp_backend && output_info.shape.size() != 3,
          "Unsupported output rank, expecting 3D for HTP backend.");

And add a static-shape guard in IsOpSupported so dynamic-shape input A doesn't reach the HTP path.


[M-4] onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc:17-81
Doc-comment header is inconsistent with the implementation:

  • bits : 2(HTP), 4(GPU/HTP) — missing bits=8 which kHtpSupportedBitsAndBlockSizeMultipliers clearly supports.
  • block_size : 32(GPU.HTP), 64(HTP) — period should be slash; per-bits multiplier rule (16 / 8 / 4) is not captured.
  • scales : ... fp/16fp32 ... — token order mangled (should be fp16/fp32).
  • GPU section (lines 40, 41): [N * K / block_size)] has an unbalanced bracket.
  • HTP section omits the bias input even though IsOpSupported (line 286) still rejects bias.

Minor

[N-1] matmulnbits_op_builder.cc:162,301
bool is_htp_backend = IsNpuBackend(...) — variable name conflates HTP with NPU. NPU == HTP today is a coincidence; if a future SDK adds another NPU backend, this code will silently mis-treat it as HTP. Either rename to is_npu_backend (matching conv_op_builder.cc) or add a // QNN currently only has HTP as NPU backend comment.


[N-2] matmulnbits_op_builder.cc:583

QNN_DATATYPE_FLOAT_16,  // Explicitly override to float16.

Comment doesn't explain why — the rationale (HTP Conv2d FP kernel only supports fp16, hence the pre-Cast(fp16) and post-Cast(fp32) workaround) should be stated here so a future reader knows when to remove the override.

Suggested fix:

QNN_DATATYPE_FLOAT_16, // HTP Conv2d FP kernel only supports fp16; cf. pre-Cast(fp16) above and post-Cast(fp32) below for fp32 graphs.

[N-3] matmulnbits_op_builder.cc:114
kHtpSupportedBitsAndBlockSizeMultipliers{{2, 16}, {4, 8}, {8, 4}} — value semantics ("multiplier") are unclear without reading IsOpSupported. Add a doc comment, or split into kHtpSupportedBits + kHtpMinBlockBytes.


[N-4] matmulnbits_op_builder.cc:194
// TODO: Float16 DLC serialization failing. — outdated. This PR does enable fp16 input support (line 205-206). Either delete the TODO or rephrase as "Validate fp16 DLC serialization end-to-end after QAIRT 2.47 uplevel".


[N-5] qnn_utils.h:503-534
New TwoDimensionTranspose<T>(const std::vector<T>&, ...) overload shares the name with the existing TwoDimensionTranspose(const QnnModelWrapper&, ...) despite very different APIs. Rename to TwoDimensionTransposeBuffer (or similar) and use gsl::span<const T> / gsl::span<const uint32_t> for view parameters per checklist [B2].


[N-6] qnn_utils.h:528
std::memcpy(&transposed_data[dst_index], &data[src_index], sizeof(T)); is unnecessary for T = uint8_t and would be UB if the template were ever instantiated with a non-trivial T. Use transposed_data[dst_index] = data[src_index]; or add static_assert(std::is_trivially_copyable_v<T>).


[N-7] matmulnbits_op_builder.cc:184-286
Trailing-period style is inconsistent across RETURN_IF / RETURN_IF_NOT messages — e.g. line 184 has a period, line 187 / 224 don't, line 206 does. Standardize on trailing periods.

Extend MatMulNBits op builder from GPU only to HTP. Restrict the support
in 2bits/4bits and block size 32/64 for HTP. Unlike GPU using
BlockEncoding, HTP requires using BwFloatBlockEncoding with bitwidth
set and quint8 for tensor. Thus, extend QnnQuantParamsWrapper to accept
setting encodings for QNN_QUANTIZATION_ENCODING_BW_FLOAT_BLOCK type.
Per HTP request, MatMulNBits is transformed into Conv2d with necessary
Reshape and Cast around due to HTP timeline not able to complete the
implementation for MatMul in time.
Test: UT while extending hardcoding utility functions for 2/4/8bit.
TODO: Re-enable testcases once QAIRT is upleveled to 2.47.
@minfhong-qti minfhong-qti force-pushed the dev/minfhong/matmulnbits-htp branch from 58d46e3 to 6abd0e1 Compare May 12, 2026 07:38
@minfhong-qti

Copy link
Copy Markdown
Collaborator Author

[M-1] onnxruntime/test/providers/qnn/matmulnbits_test.cc:190-386 All 18 new QnnHTPBackendTests testcases call GTEST_SKIP() << "Skip this testcase before QAIRT 2.47." — the entire HTP path has zero CI coverage. The kill-test fence is broken: even with the entire HTP branch commented out, this PR would still pass 100% of tests. The v2 force-push rewrote the quant-data transform with an XOR-mask trick — precisely the kind of high-risk refactor that needs runnable tests.

Suggested fix:

  • Add at least 1–2 sanity tests that can actually run (e.g. bits=4 + block_size=32), or
  • Replace GTEST_SKIP with the gtest DISABLED_ prefix so they are visible to lint / dashboard tooling. Track "QAIRT 2.47 uplevel" as a TODO / ticket.

[M-2] onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc:305-314 ProcessInputs / ProcessAttributesAndOutputs read bits / block_size with default 0 and immediately compute kByteBits / bits and (N * K) / block_size. Today QNN EP guarantees IsOpSupported runs first, so no actual SIGFPE; but this drops a previously-safe default (4 / 32) and any future "validate-only" path or refactor that exposes these methods will divide by zero.

Suggested fix:

RETURN_IF_NOT(bits > 0 && block_size > 0,
              "Internal error: bits/block_size must be set before ProcessInputs.");

or restore defaults to legal values (bits=4, block_size=32).

[M-3] onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc:580 HTP branch reads output_info.shape[0..2] without verifying rank=3 in ProcessAttributesAndOutputs. IsOpSupported only validates input A's rank, not the output's. Same pattern at line 327 for reshape_output_shape. ONNX spec implies output rank == input rank, but this cross-function inference is implicit.

Suggested fix:

RETURN_IF(is_htp_backend && output_info.shape.size() != 3,
          "Unsupported output rank, expecting 3D for HTP backend.");

And add a static-shape guard in IsOpSupported so dynamic-shape input A doesn't reach the HTP path.

[M-4] onnxruntime/core/providers/qnn/builder/opbuilder/matmulnbits_op_builder.cc:17-81 Doc-comment header is inconsistent with the implementation:

  • bits : 2(HTP), 4(GPU/HTP) — missing bits=8 which kHtpSupportedBitsAndBlockSizeMultipliers clearly supports.
  • block_size : 32(GPU.HTP), 64(HTP) — period should be slash; per-bits multiplier rule (16 / 8 / 4) is not captured.
  • scales : ... fp/16fp32 ... — token order mangled (should be fp16/fp32).
  • GPU section (lines 40, 41): [N * K / block_size)] has an unbalanced bracket.
  • HTP section omits the bias input even though IsOpSupported (line 286) still rejects bias.

Minor

[N-1] matmulnbits_op_builder.cc:162,301 bool is_htp_backend = IsNpuBackend(...) — variable name conflates HTP with NPU. NPU == HTP today is a coincidence; if a future SDK adds another NPU backend, this code will silently mis-treat it as HTP. Either rename to is_npu_backend (matching conv_op_builder.cc) or add a // QNN currently only has HTP as NPU backend comment.

[N-2] matmulnbits_op_builder.cc:583

QNN_DATATYPE_FLOAT_16,  // Explicitly override to float16.

Comment doesn't explain why — the rationale (HTP Conv2d FP kernel only supports fp16, hence the pre-Cast(fp16) and post-Cast(fp32) workaround) should be stated here so a future reader knows when to remove the override.

Suggested fix:

QNN_DATATYPE_FLOAT_16, // HTP Conv2d FP kernel only supports fp16; cf. pre-Cast(fp16) above and post-Cast(fp32) below for fp32 graphs.

[N-3] matmulnbits_op_builder.cc:114 kHtpSupportedBitsAndBlockSizeMultipliers{{2, 16}, {4, 8}, {8, 4}} — value semantics ("multiplier") are unclear without reading IsOpSupported. Add a doc comment, or split into kHtpSupportedBits + kHtpMinBlockBytes.

[N-4] matmulnbits_op_builder.cc:194 // TODO: Float16 DLC serialization failing. — outdated. This PR does enable fp16 input support (line 205-206). Either delete the TODO or rephrase as "Validate fp16 DLC serialization end-to-end after QAIRT 2.47 uplevel".

[N-5] qnn_utils.h:503-534 New TwoDimensionTranspose<T>(const std::vector<T>&, ...) overload shares the name with the existing TwoDimensionTranspose(const QnnModelWrapper&, ...) despite very different APIs. Rename to TwoDimensionTransposeBuffer (or similar) and use gsl::span<const T> / gsl::span<const uint32_t> for view parameters per checklist [B2].

[N-6] qnn_utils.h:528 std::memcpy(&transposed_data[dst_index], &data[src_index], sizeof(T)); is unnecessary for T = uint8_t and would be UB if the template were ever instantiated with a non-trivial T. Use transposed_data[dst_index] = data[src_index]; or add static_assert(std::is_trivially_copyable_v<T>).

[N-7] matmulnbits_op_builder.cc:184-286 Trailing-period style is inconsistent across RETURN_IF / RETURN_IF_NOT messages — e.g. line 184 has a period, line 187 / 224 don't, line 206 does. Standardize on trailing periods.

Most comments are applied.

@huaychou huaychou left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you please check if #307 will impact MatmulNBits support?

@minfhong-qti

Copy link
Copy Markdown
Collaborator Author

LGTM. Could you please check if #307 will impact MatmulNBits support?

I think it will not impact as that PR is handling separate Q/DQ nodes while MatMulNBits inherently has scales/offsets as direct inputs.

@qti-chuteng qti-chuteng merged commit 4623658 into main May 14, 2026
42 checks passed
@qti-chuteng qti-chuteng deleted the dev/minfhong/matmulnbits-htp branch May 14, 2026 10:02
qti-mbadnara pushed a commit that referenced this pull request May 14, 2026
Extend MatMulNBits op builder from GPU only to HTP. Restrict the support
in 2bits/4bits and block size 32/64 for HTP. Unlike GPU using
BlockEncoding, HTP requires using BwFloatBlockEncoding with bitwidth
set and quint8 for tensor. Thus, extend QnnQuantParamsWrapper to accept
setting encodings for QNN_QUANTIZATION_ENCODING_BW_FLOAT_BLOCK type.
Per HTP request, MatMulNBits is transformed into Conv2d with necessary
Reshape and Cast around due to HTP timeline not able to complete the
implementation for MatMul in time.
Test: UT while extending hardcoding utility functions for 2/4/8bit.
TODO: Re-enable testcases once QAIRT is upleveled to v2.47.
qti-mbadnara pushed a commit that referenced this pull request May 20, 2026
Extend MatMulNBits op builder from GPU only to HTP. Restrict the support
in 2bits/4bits and block size 32/64 for HTP. Unlike GPU using
BlockEncoding, HTP requires using BwFloatBlockEncoding with bitwidth
set and quint8 for tensor. Thus, extend QnnQuantParamsWrapper to accept
setting encodings for QNN_QUANTIZATION_ENCODING_BW_FLOAT_BLOCK type.
Per HTP request, MatMulNBits is transformed into Conv2d with necessary
Reshape and Cast around due to HTP timeline not able to complete the
implementation for MatMul in time.
Test: UT while extending hardcoding utility functions for 2/4/8bit.
TODO: Re-enable testcases once QAIRT is upleveled to v2.47.
qti-mbadnara pushed a commit that referenced this pull request May 20, 2026
Extend MatMulNBits op builder from GPU only to HTP. Restrict the support
in 2bits/4bits and block size 32/64 for HTP. Unlike GPU using
BlockEncoding, HTP requires using BwFloatBlockEncoding with bitwidth
set and quint8 for tensor. Thus, extend QnnQuantParamsWrapper to accept
setting encodings for QNN_QUANTIZATION_ENCODING_BW_FLOAT_BLOCK type.
Per HTP request, MatMulNBits is transformed into Conv2d with necessary
Reshape and Cast around due to HTP timeline not able to complete the
implementation for MatMul in time.
Test: UT while extending hardcoding utility functions for 2/4/8bit.
TODO: Re-enable testcases once QAIRT is upleveled to v2.47.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants