[QNN EP] Make graph partitioner aware of multi-NodeUnit fusions#464
Open
qti-ashwshan wants to merge 6 commits into
Open
[QNN EP] Make graph partitioner aware of multi-NodeUnit fusions#464qti-ashwshan wants to merge 6 commits into
qti-ashwshan wants to merge 6 commits into
Conversation
Collaborator
|
Do you have an example that would fail in current partition? Since this change extremely complicates the partition logic, I would prefer to seek for alternative solutions first before modifying it. Thanks! |
…v/ashwshan/node-group-aware-partitioner
Collaborator
Author
This will help - microsoft/onnxruntime#26325 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Make the QNN EP graph partitioner aware of multi-NodeUnit fusions (IQnnNodeGroup).
The partitioner today is OrtNodeUnit-aware but blind to IQnnNodeGroup. Its BFS schedules per-NodeUnit, so members of a multi-NodeUnit fusion (e.g. LPBQ MatMul/Gemm, LayerNorm, Gelu, ChannelShuffle, SpaceToDepth, ReshapeGemm) can be scheduled into different partitions when an unsupported op falls between them. At Compile time, QnnModel::ComposeGraph re-runs GetQnnNodeGroups on each partition subgraph; with members missing from the subgraph, TryFusion returns nullptr and the fusion silently drops. Depending on the fusion this either degrades to a slower per-op path or aborts compile entirely (e.g. when the per-op path can't handle the encoding the fusion was producing).
This PR rewrites the partitioner to schedule on IQnnNodeGroup as the atomic unit. Every member of a multi-NodeUnit fusion is guaranteed to land in the same partition, or none.
Approach
Add QnnNodeGroupInfo describing each IQnnNodeGroup (members, target, supported flag, external in-degree). Every OrtNodeUnit ends up in exactly one group — real fusion or 1-member QnnNodeUnitWrapper — so BFS becomes branch-free.
Compute external_in_degree per group via a per-edge producer walk that excludes intra-group edges and initializers.
Replace the NodeUnit-level BFS in CreateSupportedPartitionNodeGroups with a group-level BFS: when admitted, all member OrtNodes are emitted atomically into the current partition.
Add a cycle-detection + demotion loop in GetSupportedNodes for the rare case where a multi-NodeUnit group cannot be scheduled atomically without forming a cyclic dependency through an unsupported op. Such groups get demoted to 1-member groups and re-checked via the standalone op builder, then either run independently on QNN or fall back to CPU.
Files
qnn_ep_utils.h — QnnNodeGroupInfo struct, ComputeGroupExternalInDegree, extended CreateSupportedPartitionNodeGroups signature.
qnn_ep_utils.cc — group-level BFS, edge-walker helper.
qnn_execution_provider.h/.cc — extended private GetSupportedNodes to build the group set, cycle detection + demotion loop, updated single call site in GetCapabilityImpl.
Compatibility
No public API change. GetSupportedNodes and CreateSupportedPartitionNodeGroups are internal with single in-tree callers.
No IQnnNodeGroup interface change. Authors of new fusions get atomic partitioning automatically — no extra plumbing.
No op builder change. All existing IsOpSupported / AddToModelBuilder paths are untouched.
GetCapabilityImpl ABI to ORT core unchanged. Only the function body.
EPContext path is unaffected (doesn't go through the partitioner).
Motivation and Context
Problem
A model can have a perfectly valid multi-NodeUnit fusion pattern that QNN can compile to a single optimized op, but a single unsupported op anywhere upstream of the fusion's target node is enough for BFS to split the fusion's members across partitions. The fusion then silently fails to fire at Compile time. Symptoms vary:
For fusions with a usable per-op fallback (Gelu, LayerNorm with constant gamma, ChannelShuffle, etc.): performance regression. Output is correct but the fused QNN op never gets built.
For fusions where the per-op fallback can't handle the encoding produced by the fusion (LPBQ MatMul/Gemm with per-channel block-quantized weights, etc.): outright Compile failure. The standalone DQ/Q/MatMul builders reject the per-channel block-quantized weight tensor that only the LPBQ fusion knows how to consume.
This is an entire class of bug, not a single-fusion or single-model issue. Any future multi-NodeUnit fusion is silently exposed to the same hazard until this fix lands.