Skip to content

[QNN EP] 2.3.0 RC2 Changes#423

Merged
qti-mbadnara merged 12 commits into
rel-2.3.0from
dev/qti-mbadnara/rel-2.3.0-rc2-changes
May 21, 2026
Merged

[QNN EP] 2.3.0 RC2 Changes#423
qti-mbadnara merged 12 commits into
rel-2.3.0from
dev/qti-mbadnara/rel-2.3.0-rc2-changes

Conversation

qti-mbadnara and others added 11 commits May 20, 2026 15:33
* [QNN EP] Fuse Dynamic MatMulInteger pattern into Float QNN MatMul

Adds DQMatMulIntegerFusion, a new IQnnNodeGroup that recognizes the
ONNX dynamic-quantization MatMul pattern emitted by tooling like
onnxruntime quantization (QDQ for activations, integer weights) and
folds it into a single QNN float MatMul, avoiding the int8/uint8
MatMulInteger op which QNN does not natively support.

Pattern matched (ONNX, starting at MatMulInteger):

    x --> DynamicQuantizeLinear --> (a_q, a_scale, a_zp)
    a_q, B, a_zp, B_zp        --> MatMulInteger --> Cast(FLOAT)
    a_scale, B_scale_init     --> parallel Mul
    Cast.out, parallel_Mul.out--> requant Mul
    requant_Mul.out, bias_init--> Add  (optional)

Rewrite (QNN):

    x ---------------------------------+
                                       |  (input[0] of MatMul)
                                       v
    B --> [Dequantize(B_scale,B_zp)] --> MatMul --> [Add(bias)] --> out
* [QNN EP] Fix ReshapeGemmFusion rank-5 input regression

QNN HTP FullyConnected rejects input tensors with rank > 4. PR #232 added
ReshapeGemmFusionGroup which bypasses the input Reshape and passes the
original (pre-reshape) tensor directly to QNN FC.
This makes MatMul receives a rank-5 input and causes it fall back to CPU
with error 3110 "incorrect Rank 5".

Fix: add a rank guard in CheckShape so the fusion is skipped when the
pre-reshape tensor has rank > 4. The standalone GemmOpBuilder then handles
the Gemm with the already-flattened rank-2 input as before.
When beta=0.0, Y = alpha*(A@B) + 0*C simplifies to alpha*(A@B), so
the bias input can be dropped entirely. Map these Gemm nodes to QNN
FullyConnected without bias instead of falling back to CPU.

Add CPU and HTP QDQ unit tests covering this case.
* [QNN EP] Fuse DynamicQuantizeLinear + ConvInteger pattern into float QNN Conv2d

Introduces DQConvIntegerFusion, a new IQnnNodeGroup fusion that rewrites the
dynamic-quantize ConvInteger subgraph into a floating-point QNN Conv2d node,
as QNN HTP doesn't support Dynamic Quantization.

  x --> DynamicQuantizeLinear --> (a_q, a_scale, a_zp)
        a_q, B_int8, a_zp, B_zp_int8 --> ConvInteger --> Cast(FLOAT)
        a_scale, B_scale_init        --> Mul
        Cast.out, parallel_Mul.out   --> Mul
        requant_Mul.out, bias_init   --> Add [optional]

  x --> Transpose(NCHW->NHWC) -------+
                                     |   (activation input)
                                     v
  B --> [Dequantize(B_scale,B_zp)] --> Conv2d --> Transpose(NHWC->NCHW) --> [Add(bias)] --> out

  * Per-channel B_scale: HTP Dequantize does not accept per-channel quant inputs,
    so the int8 weight is pre-dequantized to float32 offline and emitted as a
    STATIC float tensor without a Dequantize op.

- Remove IsNpuBackend guard so DQConvIntegerFusion works on all QNN backends.
- Drop unused #include "core/providers/qnn/builder/qnn_def.h".
- Reject sibling absorption when any sibling ConvInteger is not
  structurally fusible (IsConvIntegerStructurallyFusible + DQL
  consumer walk in TryFusion).
- Reject ConvInteger with non-static rank-4 output shape early in
  TryFusion to avoid claiming the DQL and failing later in
  CreateOrValidateOnQnn.
- Replace silent return-on-missing-HTP-JSON with GTEST_SKIP in
  fusion tests.
- Promote kFusionType to DQConvIntegerFusion::kType to remove the
  literal duplication.
- Add depthwise coverage (per-tensor and per-channel B_scale) for
  the QNN_OP_DEPTH_WISE_CONV_2D path.
- Add negative sibling-rejection test (one sibling has runtime
  B_zp; assert neither sibling fuses).
…ng_ratio=0 in RoiAlign (#389)

* [QNN EP] Fix RoiAlign Support

Previously the RoiAlign op builder rejected:
- coordinate_transformation_mode=half_pixel (ONNX opset-16 default)
- sampling_ratio=0 (adaptive mode, ONNX default)

This change:
- Allows both output_half_pixel and half_pixel in IsOpSupported
- Passes QNN_OP_ROI_ALIGN_PARAM_ALIGNED=true when half_pixel
- Passes QNN_OP_ROI_ALIGN_PARAM_NUM_SAMPLES_Y/X only when
  sampling_ratio > 0; omits the param for sampling_ratio=0 so QNN
  uses its adaptive default (-1), which is compatible with all backends
- Fix disabled RoiAlign UT
Extend MatMulNBits op builder from GPU only to HTP. Restrict the support
in 2bits/4bits and block size 32/64 for HTP. Unlike GPU using
BlockEncoding, HTP requires using BwFloatBlockEncoding with bitwidth
set and quint8 for tensor. Thus, extend QnnQuantParamsWrapper to accept
setting encodings for QNN_QUANTIZATION_ENCODING_BW_FLOAT_BLOCK type.
Per HTP request, MatMulNBits is transformed into Conv2d with necessary
Reshape and Cast around due to HTP timeline not able to complete the
implementation for MatMul in time.
Test: UT while extending hardcoding utility functions for 2/4/8bit.
TODO: Re-enable testcases once QAIRT is upleveled to v2.47.
@qti-mbadnara qti-mbadnara merged commit e80e501 into rel-2.3.0 May 21, 2026
59 checks passed
@qti-mbadnara qti-mbadnara deleted the dev/qti-mbadnara/rel-2.3.0-rc2-changes branch May 21, 2026 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants