[OpBuilder] Support HTP MatMulNBits.#288
Conversation
9544c74 to
ad8326e
Compare
ad8326e to
28c530b
Compare
28c530b to
e5c53fa
Compare
e5c53fa to
58d46e3
Compare
|
[M-1] Suggested fix:
[M-2] Suggested fix: RETURN_IF_NOT(bits > 0 && block_size > 0,
"Internal error: bits/block_size must be set before ProcessInputs.");or restore defaults to legal values ( [M-3] Suggested fix: RETURN_IF(is_htp_backend && output_info.shape.size() != 3,
"Unsupported output rank, expecting 3D for HTP backend.");And add a static-shape guard in [M-4]
Minor[N-1] [N-2] QNN_DATATYPE_FLOAT_16, // Explicitly override to float16.Comment doesn't explain why — the rationale (HTP Conv2d FP kernel only supports fp16, hence the pre-Cast(fp16) and post-Cast(fp32) workaround) should be stated here so a future reader knows when to remove the override. Suggested fix: QNN_DATATYPE_FLOAT_16, // HTP Conv2d FP kernel only supports fp16; cf. pre-Cast(fp16) above and post-Cast(fp32) below for fp32 graphs.[N-3] [N-4] [N-5] [N-6] [N-7] |
Extend MatMulNBits op builder from GPU only to HTP. Restrict the support in 2bits/4bits and block size 32/64 for HTP. Unlike GPU using BlockEncoding, HTP requires using BwFloatBlockEncoding with bitwidth set and quint8 for tensor. Thus, extend QnnQuantParamsWrapper to accept setting encodings for QNN_QUANTIZATION_ENCODING_BW_FLOAT_BLOCK type. Per HTP request, MatMulNBits is transformed into Conv2d with necessary Reshape and Cast around due to HTP timeline not able to complete the implementation for MatMul in time. Test: UT while extending hardcoding utility functions for 2/4/8bit. TODO: Re-enable testcases once QAIRT is upleveled to 2.47.
58d46e3 to
6abd0e1
Compare
Most comments are applied. |
I think it will not impact as that PR is handling separate Q/DQ nodes while MatMulNBits inherently has scales/offsets as direct inputs. |
Extend MatMulNBits op builder from GPU only to HTP. Restrict the support in 2bits/4bits and block size 32/64 for HTP. Unlike GPU using BlockEncoding, HTP requires using BwFloatBlockEncoding with bitwidth set and quint8 for tensor. Thus, extend QnnQuantParamsWrapper to accept setting encodings for QNN_QUANTIZATION_ENCODING_BW_FLOAT_BLOCK type. Per HTP request, MatMulNBits is transformed into Conv2d with necessary Reshape and Cast around due to HTP timeline not able to complete the implementation for MatMul in time. Test: UT while extending hardcoding utility functions for 2/4/8bit. TODO: Re-enable testcases once QAIRT is upleveled to v2.47.
Extend MatMulNBits op builder from GPU only to HTP. Restrict the support in 2bits/4bits and block size 32/64 for HTP. Unlike GPU using BlockEncoding, HTP requires using BwFloatBlockEncoding with bitwidth set and quint8 for tensor. Thus, extend QnnQuantParamsWrapper to accept setting encodings for QNN_QUANTIZATION_ENCODING_BW_FLOAT_BLOCK type. Per HTP request, MatMulNBits is transformed into Conv2d with necessary Reshape and Cast around due to HTP timeline not able to complete the implementation for MatMul in time. Test: UT while extending hardcoding utility functions for 2/4/8bit. TODO: Re-enable testcases once QAIRT is upleveled to v2.47.
Extend MatMulNBits op builder from GPU only to HTP. Restrict the support in 2bits/4bits and block size 32/64 for HTP. Unlike GPU using BlockEncoding, HTP requires using BwFloatBlockEncoding with bitwidth set and quint8 for tensor. Thus, extend QnnQuantParamsWrapper to accept setting encodings for QNN_QUANTIZATION_ENCODING_BW_FLOAT_BLOCK type. Per HTP request, MatMulNBits is transformed into Conv2d with necessary Reshape and Cast around due to HTP timeline not able to complete the implementation for MatMul in time. Test: UT while extending hardcoding utility functions for 2/4/8bit. TODO: Re-enable testcases once QAIRT is upleveled to v2.47.
Description
Extend MatMulNBits op builder from GPU only to HTP. Restrict the support in 2bits/4bits and block size 32/64 for HTP. Unlike GPU using BlockEncoding, HTP requires using BwFloatBlockEncoding with bitwidth set and quint8 for tensor. Thus, extend QnnQuantParamsWrapper to accept setting encodings for QNN_QUANTIZATION_ENCODING_BW_FLOAT_BLOCK type. Per HTP request, MatMulNBits is transformed into Conv2d with necessary Reshape around due to HTP timeline not able to complete the implementation for MatMul in time.
Test: UT while extending hardcoding utility functions for 2bit support.
Motivation and Context
Enable for MSFT 2bit Phi3 model.