Skip to content

[QNN EP] Make graph partitioner aware of multi-NodeUnit fusions#464

Open
qti-ashwshan wants to merge 6 commits into
mainfrom
dev/ashwshan/node-group-aware-partitioner
Open

[QNN EP] Make graph partitioner aware of multi-NodeUnit fusions#464
qti-ashwshan wants to merge 6 commits into
mainfrom
dev/ashwshan/node-group-aware-partitioner

Conversation

@qti-ashwshan

Copy link
Copy Markdown
Collaborator

Description

Make the QNN EP graph partitioner aware of multi-NodeUnit fusions (IQnnNodeGroup).

The partitioner today is OrtNodeUnit-aware but blind to IQnnNodeGroup. Its BFS schedules per-NodeUnit, so members of a multi-NodeUnit fusion (e.g. LPBQ MatMul/Gemm, LayerNorm, Gelu, ChannelShuffle, SpaceToDepth, ReshapeGemm) can be scheduled into different partitions when an unsupported op falls between them. At Compile time, QnnModel::ComposeGraph re-runs GetQnnNodeGroups on each partition subgraph; with members missing from the subgraph, TryFusion returns nullptr and the fusion silently drops. Depending on the fusion this either degrades to a slower per-op path or aborts compile entirely (e.g. when the per-op path can't handle the encoding the fusion was producing).

This PR rewrites the partitioner to schedule on IQnnNodeGroup as the atomic unit. Every member of a multi-NodeUnit fusion is guaranteed to land in the same partition, or none.

Approach

Add QnnNodeGroupInfo describing each IQnnNodeGroup (members, target, supported flag, external in-degree). Every OrtNodeUnit ends up in exactly one group — real fusion or 1-member QnnNodeUnitWrapper — so BFS becomes branch-free.
Compute external_in_degree per group via a per-edge producer walk that excludes intra-group edges and initializers.
Replace the NodeUnit-level BFS in CreateSupportedPartitionNodeGroups with a group-level BFS: when admitted, all member OrtNodes are emitted atomically into the current partition.
Add a cycle-detection + demotion loop in GetSupportedNodes for the rare case where a multi-NodeUnit group cannot be scheduled atomically without forming a cyclic dependency through an unsupported op. Such groups get demoted to 1-member groups and re-checked via the standalone op builder, then either run independently on QNN or fall back to CPU.
Files

qnn_ep_utils.h — QnnNodeGroupInfo struct, ComputeGroupExternalInDegree, extended CreateSupportedPartitionNodeGroups signature.
qnn_ep_utils.cc — group-level BFS, edge-walker helper.
qnn_execution_provider.h/.cc — extended private GetSupportedNodes to build the group set, cycle detection + demotion loop, updated single call site in GetCapabilityImpl.
Compatibility

No public API change. GetSupportedNodes and CreateSupportedPartitionNodeGroups are internal with single in-tree callers.
No IQnnNodeGroup interface change. Authors of new fusions get atomic partitioning automatically — no extra plumbing.
No op builder change. All existing IsOpSupported / AddToModelBuilder paths are untouched.
GetCapabilityImpl ABI to ORT core unchanged. Only the function body.
EPContext path is unaffected (doesn't go through the partitioner).

Motivation and Context

Problem

A model can have a perfectly valid multi-NodeUnit fusion pattern that QNN can compile to a single optimized op, but a single unsupported op anywhere upstream of the fusion's target node is enough for BFS to split the fusion's members across partitions. The fusion then silently fails to fire at Compile time. Symptoms vary:

For fusions with a usable per-op fallback (Gelu, LayerNorm with constant gamma, ChannelShuffle, etc.): performance regression. Output is correct but the fused QNN op never gets built.
For fusions where the per-op fallback can't handle the encoding produced by the fusion (LPBQ MatMul/Gemm with per-channel block-quantized weights, etc.): outright Compile failure. The standalone DQ/Q/MatMul builders reject the per-channel block-quantized weight tensor that only the LPBQ fusion knows how to consume.
This is an entire class of bug, not a single-fusion or single-model issue. Any future multi-NodeUnit fusion is silently exposed to the same hazard until this fix lands.

@minfhong-qti

Copy link
Copy Markdown
Collaborator

Do you have an example that would fail in current partition? Since this change extremely complicates the partition logic, I would prefer to seek for alternative solutions first before modifying it. Thanks!

@qti-ashwshan qti-ashwshan self-assigned this Jun 1, 2026
@qti-ashwshan

Copy link
Copy Markdown
Collaborator Author

Do you have an example that would fail in current partition? Since this change extremely complicates the partition logic, I would prefer to seek for alternative solutions first before modifying it. Thanks!

This will help - microsoft/onnxruntime#26325

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants