Skip to content

[QNN EP]: Fusion of multiply and reciprocal to divide#302

Open
ankipand-qti wants to merge 23 commits into
mainfrom
dev/ankipand_qcom/reciprocal_multiply_fusion
Open

[QNN EP]: Fusion of multiply and reciprocal to divide#302
ankipand-qti wants to merge 23 commits into
mainfrom
dev/ankipand_qcom/reciprocal_multiply_fusion

Conversation

@ankipand-qti

Copy link
Copy Markdown
Collaborator

Description

The QNN HTP/DSP backend has no native Reciprocal operator. When a standalone Reciprocal feeds directly into a Mul, the two-node sub-graph is mathematically equivalent to a single Div:
Mul(a, Reciprocal(b)) == Div(a, b)

I have added ReciprocalMulFusion, an IQnnNodeGroup that detects this pattern and lowers it to a single QNN_OP_ELEMENT_WISE_DIVIDE node, keeping the entire computation on the NPU accelerator and avoiding a CPU fallback.

Motivation and Context

Why change is required: Without the fusion, FP16 and FP32 models with the Reciprocal -> Mul pattern would either fall back to CPU or use an unnecessarily inefficient two-node implementation on the accelerator.
Problem the change solves: Reciprocal -> Mul falls back to CPU without the fusion

@CLAassistant

CLAassistant commented Apr 27, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@ankipand-qti ankipand-qti force-pushed the dev/ankipand_qcom/reciprocal_multiply_fusion branch from b95789d to 089d459 Compare April 29, 2026 16:44
@ankipand-qti ankipand-qti requested a review from yath1 as a code owner April 30, 2026 03:20
@qti-chuteng

Copy link
Copy Markdown
Collaborator

[C-1] onnxruntime/test/providers/qnn/qnn_node_group/reciprocal_mul_fusion_test.cc:630

QnnHTPBackendTests::ShouldSkipIfHtpFp16Unsupported() does not exist in the test framework — onnxruntime_test_all will fail to link on every platform. This is the only occurrence of that name in the entire repo. Use the same helper that line 588 of this file uses for the U16 case:

if (QnnHTPBackendTests::ShouldSkipIfHtpArchIsLessThanOrEqualTo(QNN_HTP_DEVICE_ARCH_V68)) {
  GTEST_SKIP() << "FP16 fusion requires HTP arch > V68";
}

Major

[M-1] onnxruntime/core/providers/qnn/builder/opbuilder/reciprocal_op_builder.cc:41-54, onnxruntime/test/providers/qnn/simple_op_test.cc:1098-1122

The PR's stated goal is "fuse Reciprocal+Mul → Div", but it also (a) makes ReciprocalOpBuilder::IsOpSupported reject unquantized Reciprocal on HTP and (b) flips the existing Reciprocal_Basic_FLOAT from ExpectedEPNodeAssignment::All to ::None. This is scope creep and a behavior regression: any standalone fp32/fp16 Reciprocal without a downstream Mul (some LayerNorm variants, alternative softmax forms) will now silently fall back to CPU EP, undetectable from the PR description. Also, the inline rationale ("HTP cannot execute ElementWiseDivide(static_1.0, dynamic_x)") is inconsistent with Reciprocal_QU8 continuing to pass via the same Div(static_1.0, dynamic_x) lowering on the QDQ path.

Recommend splitting into a follow-up PR with a minimal HTP-failure reproducer, an explanation of why the QDQ 1.0 case is accepted but the fp32/fp16 case is not (with the exact SDK version), and a "behavior change" section listing the model patterns that move from All to None/Some. If kept here, at least flip the affected tests to DISABLED_ with a TODO instead of changing the expected EP assignment.

[M-2] onnxruntime/core/providers/qnn/builder/qnn_node_group/reciprocal_mul_fusion.cc:316-348, 411-431

The Q-output → DQ-output traversal for QDQGroup Reciprocal is implemented twice (in TryFusion Step 3 and again in CreateOrValidateOnQnn). Same logic, slightly different RETURN_IF_NOT strings. Resolve numerator_def / denominator_def once in TryFusion, store them as members on the fusion object, and have CreateOrValidateOnQnn consume the cached values. This is the pattern that hardsigmoid_mul_fusion.cc follows.

[M-3] onnxruntime/core/providers/qnn/builder/qnn_node_group/reciprocal_mul_fusion.cc:466-501

In the build path, three MakeTensorWrapper calls are unconditional, but the resulting wrappers are dropped (via std::move) only inside if (!IsQnnTensorWrapperExist(...)) blocks. When the tensor is already registered (common: graph inputs touched by upstream ops, denominator shared across LayerNorm-like patterns), the just-built wrapper is silently discarded, wasting a GetTensorInfo + shape resolution + quant-param extraction + vector allocation. Wrap each MakeTensorWrapper behind the existence check. The same pattern in gelu_fusion.cc is also wrong, but as a brand-new file this is the right place to fix it.

Minor

[N-1] onnxruntime/test/providers/qnn/simple_op_test.cc:1119-1135

The new Reciprocal_FP16 test pins ExpectedEPNodeAssignment::None. As a "lock current unsupported state" canary it has limited value — best treated together with M-1 (move to follow-up PR or use DISABLED_ prefix with a TODO).

[N-2] onnxruntime/core/providers/qnn/builder/qnn_node_group/reciprocal_mul_fusion.cc:301-305, 346-348

The recip_is_input0 && recip_is_input1 "degenerate case" branch is effectively dead code — GetChildNodeUnitAllowQdq's single-consumer guard already prevents the producer-feeds-both-Mul-slots scenario. Keep the branch as defence-in-depth, but reword the comment so future readers don't mistake it for a critical guard.

[N-3] onnxruntime/core/providers/qnn/builder/qnn_node_group/qnn_node_group.cc:24-27, 81-86

Alphabetical reorder of spacetodepth_fusion.h include and the Erf op-type case is bundled with the feature commit. Prefer a separate [NFC] commit; if force-pushing again is impractical, just call it out in the PR description.

[N-4] onnxruntime/test/providers/qnn/qnn_node_group/reciprocal_mul_fusion_test.cc (whole file)

json_dir everywhere — existing fusion tests (gelu, channel_shuffle, gather_transpose_reshape) use json_qnn_graph_dir. Pure consistency.

[N-5] onnxruntime/test/providers/qnn/qnn_node_group/reciprocal_mul_fusion_test.cc (every TEST_F)

CWD-relative json_dir paths are fragile under --gtest_repeat / --gtest_shuffle and across CI runners with different CWDs. Tolerated by other fusion tests today; consider a shared RAII helper or temp_directory_path() in a future cleanup PR.

@tirupath-qti tirupath-qti left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note: keep important information short and clear in comments.

///
/// Quantized (QDQGroup):
///
/// [denominator] --> DQ --> Reciprocal --> Q --+

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should avoid fusing this pattern as 1/b is no longer separately quantized.

RETURN_IF_NOT(outputs.size() == 1, "Reciprocal operator must have exactly 1 output.");

// Check input type is float for CPU.
// On the QNN CPU backend only float32 is accepted; other backends (HTP, GPU)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not adding anything. Please avoid this change.


#include "core/providers/qnn/builder/op_builder_factory.h"
#include "core/providers/qnn/builder/opbuilder/base_op_builder.h"
#include "core/providers/qnn/builder/qnn_def.h"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed as there are no changes in this file.

return nullptr;
}
if (recip_is_mul_input0 && recip_is_mul_input1) {
// Defence-in-depth: same reasoning as the SingleNode branch above.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The control can still reach here for edge case: Mul(reciprocal(b), reciprocal(b));
keep a simple comment.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think earlier check about single consumer prevents this unless I'm misunderstanding

// (HardSigmoid) as its target. That fusion shares a single root tensor x
// for both branches:
//
// [x] --> HardSigmoid --+

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the relation of HardSigmoid in this fusion?

// Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries.
// SPDX-License-Identifier: MIT

// =============================================================================

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: throughout these comments seem a bit verbose, not sure if that is aligned with other fusion comments

//
// Note: explicit input/output count guards for Reciprocal (unary) and Mul
// (binary) are intentionally absent — ONNX spec compliance is assumed per
// the QNN EP review checklist [T06]. GetChildNodeUnitAllowQdq (Step 2) and

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appears to reference some document not publicly available

#include "core/providers/qnn/builder/qnn_node_group/reciprocal_mul_fusion.h"

#include <array>
#include <gsl/gsl>

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be after the standard library includes

return nullptr;
}
if (recip_is_mul_input0 && recip_is_mul_input1) {
// Defence-in-depth: same reasoning as the SingleNode branch above.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think earlier check about single consumer prevents this unless I'm misunderstanding

…-work:onnxruntime/onnxruntime-qnn into dev/ankipand_qcom/reciprocal_multiply_fusion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants