Skip to content

Use CapabilityBasedPartitioner in AotiPartitioner (#20384)#20384

Merged
shoumikhin merged 1 commit into
pytorch:mainfrom
shoumikhin:export-D109040727
Jun 18, 2026
Merged

Use CapabilityBasedPartitioner in AotiPartitioner (#20384)#20384
shoumikhin merged 1 commit into
pytorch:mainfrom
shoumikhin:export-D109040727

Conversation

@shoumikhin

@shoumikhin shoumikhin commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary:

AotiPartitioner (the base for the CUDA and Metal backends) groups the ops it
delegates into one partition, by hand. Every other ExecuTorch backend (XNNPACK,
Vulkan, CoreML) uses the shared CapabilityBasedPartitioner helper instead. This
switches AotiPartitioner to that helper too.

Why:

  1. Consistency -- same partitioning path as the other backends, and a real
    OperatorSupport hook instead of a hand-rolled tagging loop.
  2. It can break. A delegate has to be one connected chunk of the graph. If the
    ops being delegated aren't all next to each other (some other node sits in
    between), putting them all in one partition is invalid and lowering crashes
    with "AssertionError: Invalid partition, found dependency cycles".
    CapabilityBasedPartitioner returns several maximal convex partitions instead,
    each of which fuses cleanly.

No change for the common case: if every op can be delegated, you still get
exactly one partition (no extra delegate boundaries). When a non-delegated node
splits the delegated ops, this emits one partition (and one delegate boundary)
per island, which is the cost of producing a valid program. Control-flow ops
(cond/map/while_loop/scan) keep their branch get_attr operands in the same
partition, and constant/buffer tagging is unchanged.

Reviewed By: Gasoonjia

Differential Revision: D109040727

Copilot AI review requested due to automatic review settings June 18, 2026 17:29
@pytorch-bot

pytorch-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20384

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 389 Pending

As of commit a67fd35 with merge base c9ef423 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 18, 2026
@meta-codesync

meta-codesync Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@shoumikhin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109040727.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Switches the AOTInductor-based partitioning used by the CUDA/Metal backends (via AotiPartitioner) from a single hand-rolled “tag everything” partition to PyTorch’s shared CapabilityBasedPartitioner, so delegated regions are emitted as valid convex partitions when prior lowered regions or unsupported nodes split the graph.

Changes:

  • Refactors AotiPartitioner.partition() to use CapabilityBasedPartitioner over non-lowered nodes, producing one or more convex partitions instead of a single global tag.
  • Ensures control-flow branch get_attr nodes (for cond/map/while_loop/scan) inherit the control-flow op’s partition tag so they lower into the same submodule.
  • Extends CUDA partitioner tests to validate multi-partition behavior, control-flow tagging, and shared-constant handling across partitions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
backends/aoti/aoti_partitioner.py Replaces single-tag partitioning with CapabilityBasedPartitioner-driven convex partitions; preserves constant/mutated-buffer tagging and adds control-flow get_attr tag propagation.
backends/cuda/tests/test_cuda_partitioner.py Updates assumptions about partition tags and adds targeted tests for split-graph multi-partitions, control-flow get_attr tagging, and shared constants across partitions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@meta-codesync meta-codesync Bot changed the title Use CapabilityBasedPartitioner in AotiPartitioner Use CapabilityBasedPartitioner in AotiPartitioner (#20384) Jun 18, 2026
shoumikhin added a commit to shoumikhin/executorch that referenced this pull request Jun 18, 2026
Summary:

AotiPartitioner (the base for the CUDA and Metal backends) groups the ops it
delegates into one partition, by hand. Every other ExecuTorch backend (XNNPACK,
Vulkan, CoreML) uses the shared CapabilityBasedPartitioner helper instead. This
switches AotiPartitioner to that helper too.

Why:
1. Consistency -- same partitioning path as the other backends, and a real
   OperatorSupport hook instead of a hand-rolled tagging loop.
2. It can break. A delegate has to be one connected chunk of the graph. If the
   ops being delegated aren't all next to each other (some other node sits in
   between), putting them all in one partition is invalid and lowering crashes
   with "AssertionError: Invalid partition, found dependency cycles".
   CapabilityBasedPartitioner returns several maximal convex partitions instead,
   each of which fuses cleanly.

No change for the common case: if every op can be delegated, you still get
exactly one partition (no extra delegate boundaries). When a non-delegated node
splits the delegated ops, this emits one partition (and one delegate boundary)
per island, which is the cost of producing a valid program. Control-flow ops
(cond/map/while_loop/scan) keep their branch get_attr operands in the same
partition, and constant/buffer tagging is unchanged.

Differential Revision: D109040727
shoumikhin added a commit to shoumikhin/executorch that referenced this pull request Jun 18, 2026
Summary:

AotiPartitioner (the base for the CUDA and Metal backends) groups the ops it
delegates into one partition, by hand. Every other ExecuTorch backend (XNNPACK,
Vulkan, CoreML) uses the shared CapabilityBasedPartitioner helper instead. This
switches AotiPartitioner to that helper too.

Why:
1. Consistency -- same partitioning path as the other backends, and a real
   OperatorSupport hook instead of a hand-rolled tagging loop.
2. It can break. A delegate has to be one connected chunk of the graph. If the
   ops being delegated aren't all next to each other (some other node sits in
   between), putting them all in one partition is invalid and lowering crashes
   with "AssertionError: Invalid partition, found dependency cycles".
   CapabilityBasedPartitioner returns several maximal convex partitions instead,
   each of which fuses cleanly.

No change for the common case: if every op can be delegated, you still get
exactly one partition (no extra delegate boundaries). When a non-delegated node
splits the delegated ops, this emits one partition (and one delegate boundary)
per island, which is the cost of producing a valid program. Control-flow ops
(cond/map/while_loop/scan) keep their branch get_attr operands in the same
partition, and constant/buffer tagging is unchanged.

Differential Revision: D109040727
Copilot AI review requested due to automatic review settings June 18, 2026 19:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment on lines 103 to +107
tag_constant_data(exported_program)
tag_mutated_buffer(exported_program)

# A constant that still has users feeds only a prior delegate; tagging it
# would fail backend lowering's same-tag check (its user keeps the prior
# tag). tag_constant_data already claimed the ones this partition uses, so
# tag only the genuinely unused constants here.
for node in exported_program.graph.nodes:
if (
node.op == "placeholder"
and not node.users
and "delegation_tag" not in node.meta
and (
is_param(exported_program, node)
or is_buffer(exported_program, node)
or is_lifted_tensor_constant(exported_program, node)
)
):
node.meta["delegation_tag"] = tag
# tag_constant_data only tags constants that have users; tag the
# genuinely unused ones too so none are left dangling.
Summary:

AotiPartitioner (the base for the CUDA and Metal backends) groups the ops it
delegates into one partition, by hand. Every other ExecuTorch backend (XNNPACK,
Vulkan, CoreML) uses the shared CapabilityBasedPartitioner helper instead. This
switches AotiPartitioner to that helper too.

Why:
1. Consistency -- same partitioning path as the other backends, and a real
   OperatorSupport hook instead of a hand-rolled tagging loop.
2. It can break. A delegate has to be one connected chunk of the graph. If the
   ops being delegated aren't all next to each other (some other node sits in
   between), putting them all in one partition is invalid and lowering crashes
   with "AssertionError: Invalid partition, found dependency cycles".
   CapabilityBasedPartitioner returns several maximal convex partitions instead,
   each of which fuses cleanly.

No change for the common case: if every op can be delegated, you still get
exactly one partition (no extra delegate boundaries). When a non-delegated node
splits the delegated ops, this emits one partition (and one delegate boundary)
per island, which is the cost of producing a valid program. Control-flow ops
(cond/map/while_loop/scan) keep their branch get_attr operands in the same
partition, and constant/buffer tagging is unchanged.

Reviewed By: Gasoonjia

Differential Revision: D109040727
@shoumikhin shoumikhin merged commit 5241b4e into pytorch:main Jun 18, 2026
546 of 572 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda ciflow/metal ciflow/mlx ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants