[QNN EP] 2.3.0 RC2 Changes#386
Closed
qti-mbadnara wants to merge 13 commits into
Closed
Conversation
b0b1a4c to
def424d
Compare
29b0cb3 to
238c109
Compare
* re-enable faled ut back
…ub.com/onnxruntime/onnxruntime-qnn into dev/qti-mbadnara/rel-2.3.0-rc2-changes
* [QNN EP] Fuse Dynamic MatMulInteger pattern into Float QNN MatMul
Adds DQMatMulIntegerFusion, a new IQnnNodeGroup that recognizes the
ONNX dynamic-quantization MatMul pattern emitted by tooling like
onnxruntime quantization (QDQ for activations, integer weights) and
folds it into a single QNN float MatMul, avoiding the int8/uint8
MatMulInteger op which QNN does not natively support.
Pattern matched (ONNX, starting at MatMulInteger):
x --> DynamicQuantizeLinear --> (a_q, a_scale, a_zp)
a_q, B, a_zp, B_zp --> MatMulInteger --> Cast(FLOAT)
a_scale, B_scale_init --> parallel Mul
Cast.out, parallel_Mul.out--> requant Mul
requant_Mul.out, bias_init--> Add (optional)
Rewrite (QNN):
x ---------------------------------+
| (input[0] of MatMul)
v
B --> [Dequantize(B_scale,B_zp)] --> MatMul --> [Add(bias)] --> out
Extend MatMulNBits op builder from GPU only to HTP. Restrict the support in 2bits/4bits and block size 32/64 for HTP. Unlike GPU using BlockEncoding, HTP requires using BwFloatBlockEncoding with bitwidth set and quint8 for tensor. Thus, extend QnnQuantParamsWrapper to accept setting encodings for QNN_QUANTIZATION_ENCODING_BW_FLOAT_BLOCK type. Per HTP request, MatMulNBits is transformed into Conv2d with necessary Reshape and Cast around due to HTP timeline not able to complete the implementation for MatMul in time. Test: UT while extending hardcoding utility functions for 2/4/8bit. TODO: Re-enable testcases once QAIRT is upleveled to v2.47.
* [QNN EP] Fix ReshapeGemmFusion rank-5 input regression QNN HTP FullyConnected rejects input tensors with rank > 4. PR #232 added ReshapeGemmFusionGroup which bypasses the input Reshape and passes the original (pre-reshape) tensor directly to QNN FC. This makes MatMul receives a rank-5 input and causes it fall back to CPU with error 3110 "incorrect Rank 5". Fix: add a rank guard in CheckShape so the fusion is skipped when the pre-reshape tensor has rank > 4. The standalone GemmOpBuilder then handles the Gemm with the already-flattened rank-2 input as before.
When beta=0.0, Y = alpha*(A@B) + 0*C simplifies to alpha*(A@B), so the bias input can be dropped entirely. Map these Gemm nodes to QNN FullyConnected without bias instead of falling back to CPU. Add CPU and HTP QDQ unit tests covering this case.
* [QNN EP] Fuse DynamicQuantizeLinear + ConvInteger pattern into float QNN Conv2d
Introduces DQConvIntegerFusion, a new IQnnNodeGroup fusion that rewrites the
dynamic-quantize ConvInteger subgraph into a floating-point QNN Conv2d node,
as QNN HTP doesn't support Dynamic Quantization.
x --> DynamicQuantizeLinear --> (a_q, a_scale, a_zp)
a_q, B_int8, a_zp, B_zp_int8 --> ConvInteger --> Cast(FLOAT)
a_scale, B_scale_init --> Mul
Cast.out, parallel_Mul.out --> Mul
requant_Mul.out, bias_init --> Add [optional]
x --> Transpose(NCHW->NHWC) -------+
| (activation input)
v
B --> [Dequantize(B_scale,B_zp)] --> Conv2d --> Transpose(NHWC->NCHW) --> [Add(bias)] --> out
* Per-channel B_scale: HTP Dequantize does not accept per-channel quant inputs,
so the int8 weight is pre-dequantized to float32 offline and emitted as a
STATIC float tensor without a Dequantize op.
- Remove IsNpuBackend guard so DQConvIntegerFusion works on all QNN backends.
- Drop unused #include "core/providers/qnn/builder/qnn_def.h".
- Reject sibling absorption when any sibling ConvInteger is not
structurally fusible (IsConvIntegerStructurallyFusible + DQL
consumer walk in TryFusion).
- Reject ConvInteger with non-static rank-4 output shape early in
TryFusion to avoid claiming the DQL and failing later in
CreateOrValidateOnQnn.
- Replace silent return-on-missing-HTP-JSON with GTEST_SKIP in
fusion tests.
- Promote kFusionType to DQConvIntegerFusion::kType to remove the
literal duplication.
- Add depthwise coverage (per-tensor and per-channel B_scale) for
the QNN_OP_DEPTH_WISE_CONV_2D path.
- Add negative sibling-rejection test (one sibling has runtime
B_zp; assert neither sibling fuses).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-picked the following from the
mainlinetorel-2.3.0as part ofORT QNN EP 2.3.0 RC2ORT QNN EP 2.3.0 RC1Changes were added as part of this PR