[QNN EP]: Fusion of multiply and reciprocal to divide#302
[QNN EP]: Fusion of multiply and reciprocal to divide#302ankipand-qti wants to merge 23 commits into
Conversation
…-work:onnxruntime/onnxruntime-qnn into dev/ankipand_qcom/reciprocal_multiply_fusion
b95789d to
089d459
Compare
|
[C-1]
if (QnnHTPBackendTests::ShouldSkipIfHtpArchIsLessThanOrEqualTo(QNN_HTP_DEVICE_ARCH_V68)) {
GTEST_SKIP() << "FP16 fusion requires HTP arch > V68";
}Major[M-1] The PR's stated goal is "fuse Reciprocal+Mul → Div", but it also (a) makes Recommend splitting into a follow-up PR with a minimal HTP-failure reproducer, an explanation of why the QDQ 1.0 case is accepted but the fp32/fp16 case is not (with the exact SDK version), and a "behavior change" section listing the model patterns that move from All to None/Some. If kept here, at least flip the affected tests to [M-2] The Q-output → DQ-output traversal for QDQGroup Reciprocal is implemented twice (in [M-3] In the build path, three Minor[N-1] The new [N-2] The [N-3] Alphabetical reorder of [N-4]
[N-5] CWD-relative |
Signed-off-by: ankipand-qti <ankipand@qti.qualcomm.com>
tirupath-qti
left a comment
There was a problem hiding this comment.
Please note: keep important information short and clear in comments.
| /// | ||
| /// Quantized (QDQGroup): | ||
| /// | ||
| /// [denominator] --> DQ --> Reciprocal --> Q --+ |
There was a problem hiding this comment.
we should avoid fusing this pattern as 1/b is no longer separately quantized.
| RETURN_IF_NOT(outputs.size() == 1, "Reciprocal operator must have exactly 1 output."); | ||
|
|
||
| // Check input type is float for CPU. | ||
| // On the QNN CPU backend only float32 is accepted; other backends (HTP, GPU) |
There was a problem hiding this comment.
this is not adding anything. Please avoid this change.
|
|
||
| #include "core/providers/qnn/builder/op_builder_factory.h" | ||
| #include "core/providers/qnn/builder/opbuilder/base_op_builder.h" | ||
| #include "core/providers/qnn/builder/qnn_def.h" |
There was a problem hiding this comment.
this is not needed as there are no changes in this file.
| return nullptr; | ||
| } | ||
| if (recip_is_mul_input0 && recip_is_mul_input1) { | ||
| // Defence-in-depth: same reasoning as the SingleNode branch above. |
There was a problem hiding this comment.
The control can still reach here for edge case: Mul(reciprocal(b), reciprocal(b));
keep a simple comment.
There was a problem hiding this comment.
I think earlier check about single consumer prevents this unless I'm misunderstanding
| // (HardSigmoid) as its target. That fusion shares a single root tensor x | ||
| // for both branches: | ||
| // | ||
| // [x] --> HardSigmoid --+ |
There was a problem hiding this comment.
what is the relation of HardSigmoid in this fusion?
| // Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries. | ||
| // SPDX-License-Identifier: MIT | ||
|
|
||
| // ============================================================================= |
There was a problem hiding this comment.
nit: throughout these comments seem a bit verbose, not sure if that is aligned with other fusion comments
| // | ||
| // Note: explicit input/output count guards for Reciprocal (unary) and Mul | ||
| // (binary) are intentionally absent — ONNX spec compliance is assumed per | ||
| // the QNN EP review checklist [T06]. GetChildNodeUnitAllowQdq (Step 2) and |
There was a problem hiding this comment.
appears to reference some document not publicly available
| #include "core/providers/qnn/builder/qnn_node_group/reciprocal_mul_fusion.h" | ||
|
|
||
| #include <array> | ||
| #include <gsl/gsl> |
There was a problem hiding this comment.
I think this should be after the standard library includes
| return nullptr; | ||
| } | ||
| if (recip_is_mul_input0 && recip_is_mul_input1) { | ||
| // Defence-in-depth: same reasoning as the SingleNode branch above. |
There was a problem hiding this comment.
I think earlier check about single consumer prevents this unless I'm misunderstanding
…-work:onnxruntime/onnxruntime-qnn into dev/ankipand_qcom/reciprocal_multiply_fusion
Description
The QNN HTP/DSP backend has no native Reciprocal operator. When a standalone Reciprocal feeds directly into a Mul, the two-node sub-graph is mathematically equivalent to a single Div:
Mul(a, Reciprocal(b)) == Div(a, b)I have added ReciprocalMulFusion, an IQnnNodeGroup that detects this pattern and lowers it to a single QNN_OP_ELEMENT_WISE_DIVIDE node, keeping the entire computation on the NPU accelerator and avoiding a CPU fallback.
Motivation and Context
Why change is required: Without the fusion, FP16 and FP32 models with the Reciprocal -> Mul pattern would either fall back to CPU or use an unnecessarily inefficient two-node implementation on the accelerator.
Problem the change solves: Reciprocal -> Mul falls back to CPU without the fusion