Use CapabilityBasedPartitioner in AotiPartitioner (#20384)#20384
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20384
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 389 PendingAs of commit a67fd35 with merge base c9ef423 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@shoumikhin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109040727. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Switches the AOTInductor-based partitioning used by the CUDA/Metal backends (via AotiPartitioner) from a single hand-rolled “tag everything” partition to PyTorch’s shared CapabilityBasedPartitioner, so delegated regions are emitted as valid convex partitions when prior lowered regions or unsupported nodes split the graph.
Changes:
- Refactors
AotiPartitioner.partition()to useCapabilityBasedPartitionerover non-lowered nodes, producing one or more convex partitions instead of a single global tag. - Ensures control-flow branch
get_attrnodes (forcond/map/while_loop/scan) inherit the control-flow op’s partition tag so they lower into the same submodule. - Extends CUDA partitioner tests to validate multi-partition behavior, control-flow tagging, and shared-constant handling across partitions.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| backends/aoti/aoti_partitioner.py | Replaces single-tag partitioning with CapabilityBasedPartitioner-driven convex partitions; preserves constant/mutated-buffer tagging and adds control-flow get_attr tag propagation. |
| backends/cuda/tests/test_cuda_partitioner.py | Updates assumptions about partition tags and adds targeted tests for split-graph multi-partitions, control-flow get_attr tagging, and shared constants across partitions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary: AotiPartitioner (the base for the CUDA and Metal backends) groups the ops it delegates into one partition, by hand. Every other ExecuTorch backend (XNNPACK, Vulkan, CoreML) uses the shared CapabilityBasedPartitioner helper instead. This switches AotiPartitioner to that helper too. Why: 1. Consistency -- same partitioning path as the other backends, and a real OperatorSupport hook instead of a hand-rolled tagging loop. 2. It can break. A delegate has to be one connected chunk of the graph. If the ops being delegated aren't all next to each other (some other node sits in between), putting them all in one partition is invalid and lowering crashes with "AssertionError: Invalid partition, found dependency cycles". CapabilityBasedPartitioner returns several maximal convex partitions instead, each of which fuses cleanly. No change for the common case: if every op can be delegated, you still get exactly one partition (no extra delegate boundaries). When a non-delegated node splits the delegated ops, this emits one partition (and one delegate boundary) per island, which is the cost of producing a valid program. Control-flow ops (cond/map/while_loop/scan) keep their branch get_attr operands in the same partition, and constant/buffer tagging is unchanged. Differential Revision: D109040727
2e0db98 to
4ccff6d
Compare
Summary: AotiPartitioner (the base for the CUDA and Metal backends) groups the ops it delegates into one partition, by hand. Every other ExecuTorch backend (XNNPACK, Vulkan, CoreML) uses the shared CapabilityBasedPartitioner helper instead. This switches AotiPartitioner to that helper too. Why: 1. Consistency -- same partitioning path as the other backends, and a real OperatorSupport hook instead of a hand-rolled tagging loop. 2. It can break. A delegate has to be one connected chunk of the graph. If the ops being delegated aren't all next to each other (some other node sits in between), putting them all in one partition is invalid and lowering crashes with "AssertionError: Invalid partition, found dependency cycles". CapabilityBasedPartitioner returns several maximal convex partitions instead, each of which fuses cleanly. No change for the common case: if every op can be delegated, you still get exactly one partition (no extra delegate boundaries). When a non-delegated node splits the delegated ops, this emits one partition (and one delegate boundary) per island, which is the cost of producing a valid program. Control-flow ops (cond/map/while_loop/scan) keep their branch get_attr operands in the same partition, and constant/buffer tagging is unchanged. Differential Revision: D109040727
4ccff6d to
e3b2b11
Compare
| tag_constant_data(exported_program) | ||
| tag_mutated_buffer(exported_program) | ||
|
|
||
| # A constant that still has users feeds only a prior delegate; tagging it | ||
| # would fail backend lowering's same-tag check (its user keeps the prior | ||
| # tag). tag_constant_data already claimed the ones this partition uses, so | ||
| # tag only the genuinely unused constants here. | ||
| for node in exported_program.graph.nodes: | ||
| if ( | ||
| node.op == "placeholder" | ||
| and not node.users | ||
| and "delegation_tag" not in node.meta | ||
| and ( | ||
| is_param(exported_program, node) | ||
| or is_buffer(exported_program, node) | ||
| or is_lifted_tensor_constant(exported_program, node) | ||
| ) | ||
| ): | ||
| node.meta["delegation_tag"] = tag | ||
| # tag_constant_data only tags constants that have users; tag the | ||
| # genuinely unused ones too so none are left dangling. |
Summary: AotiPartitioner (the base for the CUDA and Metal backends) groups the ops it delegates into one partition, by hand. Every other ExecuTorch backend (XNNPACK, Vulkan, CoreML) uses the shared CapabilityBasedPartitioner helper instead. This switches AotiPartitioner to that helper too. Why: 1. Consistency -- same partitioning path as the other backends, and a real OperatorSupport hook instead of a hand-rolled tagging loop. 2. It can break. A delegate has to be one connected chunk of the graph. If the ops being delegated aren't all next to each other (some other node sits in between), putting them all in one partition is invalid and lowering crashes with "AssertionError: Invalid partition, found dependency cycles". CapabilityBasedPartitioner returns several maximal convex partitions instead, each of which fuses cleanly. No change for the common case: if every op can be delegated, you still get exactly one partition (no extra delegate boundaries). When a non-delegated node splits the delegated ops, this emits one partition (and one delegate boundary) per island, which is the cost of producing a valid program. Control-flow ops (cond/map/while_loop/scan) keep their branch get_attr operands in the same partition, and constant/buffer tagging is unchanged. Reviewed By: Gasoonjia Differential Revision: D109040727
e3b2b11 to
a67fd35
Compare
Summary:
AotiPartitioner (the base for the CUDA and Metal backends) groups the ops it
delegates into one partition, by hand. Every other ExecuTorch backend (XNNPACK,
Vulkan, CoreML) uses the shared CapabilityBasedPartitioner helper instead. This
switches AotiPartitioner to that helper too.
Why:
OperatorSupport hook instead of a hand-rolled tagging loop.
ops being delegated aren't all next to each other (some other node sits in
between), putting them all in one partition is invalid and lowering crashes
with "AssertionError: Invalid partition, found dependency cycles".
CapabilityBasedPartitioner returns several maximal convex partitions instead,
each of which fuses cleanly.
No change for the common case: if every op can be delegated, you still get
exactly one partition (no extra delegate boundaries). When a non-delegated node
splits the delegated ops, this emits one partition (and one delegate boundary)
per island, which is the cost of producing a valid program. Control-flow ops
(cond/map/while_loop/scan) keep their branch get_attr operands in the same
partition, and constant/buffer tagging is unchanged.
Reviewed By: Gasoonjia
Differential Revision: D109040727