[QNN EP]: Fusion of multiply and reciprocal to divide by ankipand-qti · Pull Request #302 · onnxruntime/onnxruntime-qnn

ankipand-qti · 2026-04-27T13:52:48Z

Description

The QNN HTP/DSP backend has no native Reciprocal operator. When a standalone Reciprocal feeds directly into a Mul, the two-node sub-graph is mathematically equivalent to a single Div:
Mul(a, Reciprocal(b)) == Div(a, b)

I have added ReciprocalMulFusion, an IQnnNodeGroup that detects this pattern and lowers it to a single QNN_OP_ELEMENT_WISE_DIVIDE node, keeping the entire computation on the NPU accelerator and avoiding a CPU fallback.

Motivation and Context

Why change is required: Without the fusion, FP16 and FP32 models with the Reciprocal -> Mul pattern would either fall back to CPU or use an unnecessarily inefficient two-node implementation on the accelerator.
Problem the change solves: Reciprocal -> Mul falls back to CPU without the fusion

CLAassistant · 2026-04-27T13:52:56Z

All committers have signed the CLA.

…-work:onnxruntime/onnxruntime-qnn into dev/ankipand_qcom/reciprocal_multiply_fusion

qti-chuteng · 2026-05-08T10:14:07Z

[C-1] onnxruntime/test/providers/qnn/qnn_node_group/reciprocal_mul_fusion_test.cc:630

QnnHTPBackendTests::ShouldSkipIfHtpFp16Unsupported() does not exist in the test framework — onnxruntime_test_all will fail to link on every platform. This is the only occurrence of that name in the entire repo. Use the same helper that line 588 of this file uses for the U16 case:

if (QnnHTPBackendTests::ShouldSkipIfHtpArchIsLessThanOrEqualTo(QNN_HTP_DEVICE_ARCH_V68)) {
  GTEST_SKIP() << "FP16 fusion requires HTP arch > V68";
}

Major

[M-1] onnxruntime/core/providers/qnn/builder/opbuilder/reciprocal_op_builder.cc:41-54, onnxruntime/test/providers/qnn/simple_op_test.cc:1098-1122

The PR's stated goal is "fuse Reciprocal+Mul → Div", but it also (a) makes ReciprocalOpBuilder::IsOpSupported reject unquantized Reciprocal on HTP and (b) flips the existing Reciprocal_Basic_FLOAT from ExpectedEPNodeAssignment::All to ::None. This is scope creep and a behavior regression: any standalone fp32/fp16 Reciprocal without a downstream Mul (some LayerNorm variants, alternative softmax forms) will now silently fall back to CPU EP, undetectable from the PR description. Also, the inline rationale ("HTP cannot execute ElementWiseDivide(static_1.0, dynamic_x)") is inconsistent with Reciprocal_QU8 continuing to pass via the same Div(static_1.0, dynamic_x) lowering on the QDQ path.

Recommend splitting into a follow-up PR with a minimal HTP-failure reproducer, an explanation of why the QDQ 1.0 case is accepted but the fp32/fp16 case is not (with the exact SDK version), and a "behavior change" section listing the model patterns that move from All to None/Some. If kept here, at least flip the affected tests to DISABLED_ with a TODO instead of changing the expected EP assignment.

[M-2] onnxruntime/core/providers/qnn/builder/qnn_node_group/reciprocal_mul_fusion.cc:316-348, 411-431

The Q-output → DQ-output traversal for QDQGroup Reciprocal is implemented twice (in TryFusion Step 3 and again in CreateOrValidateOnQnn). Same logic, slightly different RETURN_IF_NOT strings. Resolve numerator_def / denominator_def once in TryFusion, store them as members on the fusion object, and have CreateOrValidateOnQnn consume the cached values. This is the pattern that hardsigmoid_mul_fusion.cc follows.

[M-3] onnxruntime/core/providers/qnn/builder/qnn_node_group/reciprocal_mul_fusion.cc:466-501

In the build path, three MakeTensorWrapper calls are unconditional, but the resulting wrappers are dropped (via std::move) only inside if (!IsQnnTensorWrapperExist(...)) blocks. When the tensor is already registered (common: graph inputs touched by upstream ops, denominator shared across LayerNorm-like patterns), the just-built wrapper is silently discarded, wasting a GetTensorInfo + shape resolution + quant-param extraction + vector allocation. Wrap each MakeTensorWrapper behind the existence check. The same pattern in gelu_fusion.cc is also wrong, but as a brand-new file this is the right place to fix it.

Minor

[N-1] onnxruntime/test/providers/qnn/simple_op_test.cc:1119-1135

The new Reciprocal_FP16 test pins ExpectedEPNodeAssignment::None. As a "lock current unsupported state" canary it has limited value — best treated together with M-1 (move to follow-up PR or use DISABLED_ prefix with a TODO).

[N-2] onnxruntime/core/providers/qnn/builder/qnn_node_group/reciprocal_mul_fusion.cc:301-305, 346-348

The recip_is_input0 && recip_is_input1 "degenerate case" branch is effectively dead code — GetChildNodeUnitAllowQdq's single-consumer guard already prevents the producer-feeds-both-Mul-slots scenario. Keep the branch as defence-in-depth, but reword the comment so future readers don't mistake it for a critical guard.

[N-3] onnxruntime/core/providers/qnn/builder/qnn_node_group/qnn_node_group.cc:24-27, 81-86

Alphabetical reorder of spacetodepth_fusion.h include and the Erf op-type case is bundled with the feature commit. Prefer a separate [NFC] commit; if force-pushing again is impractical, just call it out in the PR description.

[N-4] onnxruntime/test/providers/qnn/qnn_node_group/reciprocal_mul_fusion_test.cc (whole file)

json_dir everywhere — existing fusion tests (gelu, channel_shuffle, gather_transpose_reshape) use json_qnn_graph_dir. Pure consistency.

[N-5] onnxruntime/test/providers/qnn/qnn_node_group/reciprocal_mul_fusion_test.cc (every TEST_F)

CWD-relative json_dir paths are fragile under --gtest_repeat / --gtest_shuffle and across CI runners with different CWDs. Tolerated by other fusion tests today; consider a shared RAII helper or temp_directory_path() in a future cleanup PR.

Signed-off-by: ankipand-qti <ankipand@qti.qualcomm.com>

tirupath-qti

Please note: keep important information short and clear in comments.

tirupath-qti · 2026-05-27T18:21:33Z

+///
+/// Quantized (QDQGroup):
+///
+///   [denominator] --> DQ --> Reciprocal --> Q --+


we should avoid fusing this pattern as 1/b is no longer separately quantized.

tirupath-qti · 2026-05-27T18:33:11Z

  RETURN_IF_NOT(outputs.size() == 1, "Reciprocal operator must have exactly 1 output.");

-  // Check input type is float for CPU.
+  // On the QNN CPU backend only float32 is accepted; other backends (HTP, GPU)


this is not adding anything. Please avoid this change.

tirupath-qti · 2026-05-27T18:33:25Z


 #include "core/providers/qnn/builder/op_builder_factory.h"
 #include "core/providers/qnn/builder/opbuilder/base_op_builder.h"
+#include "core/providers/qnn/builder/qnn_def.h"


this is not needed as there are no changes in this file.

tirupath-qti · 2026-05-27T18:36:19Z

+      return nullptr;
+    }
+    if (recip_is_mul_input0 && recip_is_mul_input1) {
+      // Defence-in-depth: same reasoning as the SingleNode branch above.


The control can still reach here for edge case: Mul(reciprocal(b), reciprocal(b));
keep a simple comment.

I think earlier check about single consumer prevents this unless I'm misunderstanding

tirupath-qti · 2026-05-27T18:37:31Z

+// (HardSigmoid) as its target.  That fusion shares a single root tensor x
+// for both branches:
+//
+//   [x] --> HardSigmoid --+


what is the relation of HardSigmoid in this fusion?

qti-kromero · 2026-05-27T21:21:01Z

+// Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries.
+// SPDX-License-Identifier: MIT
+
+// =============================================================================


nit: throughout these comments seem a bit verbose, not sure if that is aligned with other fusion comments

qti-kromero · 2026-05-27T21:32:29Z

+//
+// Note: explicit input/output count guards for Reciprocal (unary) and Mul
+// (binary) are intentionally absent — ONNX spec compliance is assumed per
+// the QNN EP review checklist [T06].  GetChildNodeUnitAllowQdq (Step 2) and


appears to reference some document not publicly available

qti-kromero · 2026-05-27T21:45:50Z

+#include "core/providers/qnn/builder/qnn_node_group/reciprocal_mul_fusion.h"
+
+#include <array>
+#include <gsl/gsl>


I think this should be after the standard library includes

qti-kromero · 2026-05-27T21:59:43Z

+      return nullptr;
+    }
+    if (recip_is_mul_input0 && recip_is_mul_input1) {
+      // Defence-in-depth: same reasoning as the SingleNode branch above.


I think earlier check about single consumer prevents this unless I'm misunderstanding

…-work:onnxruntime/onnxruntime-qnn into dev/ankipand_qcom/reciprocal_multiply_fusion

[QNN EP]: Fusion of multiply and reciprocal with divide

ed6617a

ankipand-qti requested review from qti-ashwshan, qti-chuteng, qti-jkilpatrick, qti-kromero, qti-yuduo, quic-ykota and tirupath-qti as code owners April 27, 2026 13:52

ankipand-qti added 13 commits April 27, 2026 19:31

Adding a missing include header file

be5ccad

Removing extra header file

0182c12

Merge branch 'main' into dev/ankipand_qcom/reciprocal_multiply_fusion

0d380df

Using ORT datatype instead of MLFloat16

eafe732

Lint runner fixed

4b77854

Merge branch 'main' into dev/ankipand_qcom/reciprocal_multiply_fusion

1d7ce45

Addressing review comments

d4bc03f

Merge branch 'dev/ankipand_qcom/reciprocal_multiply_fusion' of github…

8c7232e

…-work:onnxruntime/onnxruntime-qnn into dev/ankipand_qcom/reciprocal_multiply_fusion

Addressing failing test cases

7a3fbb8

Addressing the failing test cases

f130384

Found a bug for reciprocal output during testing

e9ee222

Fixing the bug in test code

541d9c1

Fixing some bugs in test code

089d459

ankipand-qti force-pushed the dev/ankipand_qcom/reciprocal_multiply_fusion branch from b95789d to 089d459 Compare April 29, 2026 16:44

Merge branch 'main' into dev/ankipand_qcom/reciprocal_multiply_fusion

4a9e694

ankipand-qti requested a review from yath1 as a code owner April 30, 2026 03:20

ankipand-qti added 3 commits May 6, 2026 19:49

Addressing the review comments

7b6f9f0

Lint issues fix

d811e37

Fixed test code problems

0e5c53d

Addressing the review comments

b76b5a7

qti-chuteng approved these changes May 13, 2026

View reviewed changes

ankipand-qti added 2 commits May 13, 2026 10:21

Adding V68 skips for Linux ARM64

d3e7ef0

Merge branch 'main' into dev/ankipand_qcom/reciprocal_multiply_fusion

cce27f4

Signed-off-by: ankipand-qti <ankipand@qti.qualcomm.com>

tirupath-qti reviewed May 27, 2026

View reviewed changes

qti-kromero reviewed May 27, 2026

View reviewed changes

ankipand-qti added 2 commits June 19, 2026 09:43

Handling the review comments

bf743cc

Merge branch 'dev/ankipand_qcom/reciprocal_multiply_fusion' of github…

10418f7

…-work:onnxruntime/onnxruntime-qnn into dev/ankipand_qcom/reciprocal_multiply_fusion

Conversation

ankipand-qti commented Apr 27, 2026

Description

Motivation and Context

Uh oh!

CLAassistant commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qti-chuteng commented May 8, 2026

Major

Minor

Uh oh!

tirupath-qti left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CLAassistant commented Apr 27, 2026 •

edited

Loading