Minor speedup for model lowering: Skip redundant run_decompositions when no ops match decomp table (#18496) by apullin · Pull Request #18496 · pytorch/executorch

apullin · 2026-03-25T16:03:40Z

Summary:

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

lower() before: 82 s
lower() after: 71 s
Delta: -11 s (-13 %)

Differential Revision: D96489903

pytorch-bot · 2026-03-25T16:03:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18496

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 6 Unrelated Failures, 5 Unclassified Failures

As of commit 33061a2 with merge base 55c54c7 ():

NEW FAILURES - The following jobs have failed:

pull / test-arm-backend-no-driver (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t ce431c068c650d8b8021175e04839c79617b1513b6450e41420a5632c562ca74 /exec failed with exit code 1
pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t b542fb8183da035957495463ed19691f4908e382a4447f0f571e94a34320b091 /exec failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 48a71f7bbe51e9401e9db0d9995581c4148107106b2cde160932a966c42a9ec6 /exec failed with exit code 1
trunk / test-arm-backend-ethos-u (test_pytest_ops_ethos_u85) / linux-job (gh)
RuntimeError: Command docker exec -t 7f9f2f78755dc4364c7c9b58f97030c7fc089d26607599cef667aa6e5908291a /exec failed with exit code 1
trunk / unittest-release / linux / linux-job (gh)
RuntimeError: Command docker exec -t 789c7ee217f9a11c86c77a0af12114058411b5aa1b044736e40cc14369d9fcba /exec failed with exit code 1

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

trunk / test-models-macos-cpu (ic4, xnnpack-quantization-delegation) / macos-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: operator torchvision::nms does not exist
trunk / test-models-macos-cpu (llama3_2_vision_encoder, portable) / macos-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
trunk / test-models-macos-cpu (mobilebert, xnnpack-quantization-delegation) / macos-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
trunk / test-models-macos-cpu (mv3, portable) / macos-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
trunk / test-models-macos-cpu (vit, xnnpack-quantization-delegation) / macos-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: operator torchvision::nms does not exist

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / test-models-macos-mps / macos-job (gh) (similar failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (trunk failure)
pull / unittest / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
trunk / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (trunk failure)
trunk / unittest-release / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-03-25T16:03:48Z

@apullin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96489903.

github-actions · 2026-03-25T16:05:43Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…orch#18496) Summary: Adds an early-exit check to _gen_edge_manager_for_partitioners: before calling program.run_decompositions(table), scan the graph for ops that appear in the decomposition table. If none are found, skip the call entirely. Each run_decompositions call performs a full re-export of the program via make_fx(), re-tracing every node through FakeTensor dispatch. On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times; the early-exit eliminates at least one redundant call where the previous pass already decomposed all matching ops. The check recursively walks control flow submodules (cond/map/scan) to avoid incorrectly skipping when decomposable ops are nested. ## Benchmark Model: small CNN feature extractor (~50K params, 9 conv layers with LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline). Graph: ~1200 nodes. lower() before: 82 s lower() after: 71 s Delta: -11 s (-13 %) Differential Revision: D96489903

…orch#18496) Summary: Pull Request resolved: pytorch#18496 Adds an early-exit check to _gen_edge_manager_for_partitioners: before calling program.run_decompositions(table), scan the graph for ops that appear in the decomposition table. If none are found, skip the call entirely. Each run_decompositions call performs a full re-export of the program via make_fx(), re-tracing every node through FakeTensor dispatch. On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times; the early-exit eliminates at least one redundant call where the previous pass already decomposed all matching ops. The check recursively walks control flow submodules (cond/map/scan) to avoid incorrectly skipping when decomposable ops are nested. ## Benchmark Model: small CNN feature extractor (~50K params, 9 conv layers with LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline). Graph: ~1200 nodes. lower() before: 82 s lower() after: 71 s Delta: -11 s (-13 %) Differential Revision: D96489903

…hen no ops match decomp table (pytorch#18496) Summary: Pull Request resolved: pytorch#18496 Adds an early-exit check to _gen_edge_manager_for_partitioners: before calling program.run_decompositions(table), scan the graph for ops that appear in the decomposition table. If none are found, skip the call entirely. Each run_decompositions call performs a full re-export of the program via make_fx(), re-tracing every node through FakeTensor dispatch. On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times; the early-exit eliminates at least one redundant call where the previous pass already decomposed all matching ops. The check recursively walks control flow submodules (cond/map/scan) to avoid incorrectly skipping when decomposable ops are nested. ## Benchmark Model: small CNN feature extractor (~50K params, 9 conv layers with LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline). Graph: ~1200 nodes. lower() before: 82 s lower() after: 71 s Delta: -11 s (-13 %) Differential Revision: D96489903

…hen no ops match decomp table (pytorch#18496) Summary: Adds an early-exit check to _gen_edge_manager_for_partitioners: before calling program.run_decompositions(table), scan the graph for ops that appear in the decomposition table. If none are found, skip the call entirely. Each run_decompositions call performs a full re-export of the program via make_fx(), re-tracing every node through FakeTensor dispatch. On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times; the early-exit eliminates at least one redundant call where the previous pass already decomposed all matching ops. The check recursively walks control flow submodules (cond/map/scan) to avoid incorrectly skipping when decomposable ops are nested. ## Benchmark Model: small CNN feature extractor (~50K params, 9 conv layers with LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline). Graph: ~1200 nodes. lower() before: 82 s lower() after: 71 s Delta: -11 s (-13 %) Differential Revision: D96489903

…hen no ops match decomp table (pytorch#18496) Summary: Pull Request resolved: pytorch#18496 Adds an early-exit check to _gen_edge_manager_for_partitioners: before calling program.run_decompositions(table), scan the graph for ops that appear in the decomposition table. If none are found, skip the call entirely. Each run_decompositions call performs a full re-export of the program via make_fx(), re-tracing every node through FakeTensor dispatch. On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times; the early-exit eliminates at least one redundant call where the previous pass already decomposed all matching ops. The check recursively walks control flow submodules (cond/map/scan) to avoid incorrectly skipping when decomposable ops are nested. ## Benchmark Model: small CNN feature extractor (~50K params, 9 conv layers with LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline). Graph: ~1200 nodes. lower() before: 82 s lower() after: 71 s Delta: -11 s (-13 %) Differential Revision: D96489903

Summary: Adds stock model (non-sleep) profiling tests to the modai lowering profiling suite. These serve as a baseline/validation for the ExportPass speedup work (D97528110) without requiring sleep/FBLearner dependencies. ## New profiling functions (sleepmodels_lowering_profile.py) - _profile_arm_model_lowering(): generic helper using the same modai pipeline (Input → recipe → PTQ → Manager → export → lower) so timings are directly comparable to the sleep model profiling - profile_resnet8_lowering(): ResNet8 (MLPerf Tiny CIFAR-10), ~77K params, 32x32 input — small residual CNN with skip connections - profile_mobilenet_v1_025_lowering(): MobileNetV1-0.25 (MLPerf Tiny VWW), ~217K params, 96x96 input — depthwise-separable CNN ## New test methods - test_profile_resnet8_lowering() - test_profile_mobilenet_v1_025_lowering() Both confirmed passing: ResNet8: https://www.internalfb.com/intern/testinfra/testrun/20266198338067913 MobileNetV1-0.25: https://www.internalfb.com/intern/testinfra/testrun/32088147347033640 ## Buck changes - fbcode/executorch/examples/models/TARGETS + BUCK: add mlperf_tiny target (wraps xplat/executorch/examples/models/mlperf_tiny/*.py) - fbcode/healthtech/common/tests/BUCK: add //executorch/examples/models:mlperf_tiny dep Differential Revision: D101254299

…ions, (uncommitted/untracked changes) (pytorch#18497) Summary: Pull Request resolved: pytorch#18497 Adds infrastructure for skipping and fast-copying unchanged nodes during ExportPass execution, then annotates ~60 ARM backend passes to use it. ## Changes ### 1. should_run() hook on ExportPass / ArmPass Subclasses that declare a `targeted_ops` class attribute (a set of op overloads) can be skipped entirely when the graph contains none of their target ops. ArmPass provides a default implementation via inheritance. ### 2. Fast-copy for cold nodes When a pass declares `targeted_ops`, nodes whose ops are NOT in the set are copied into the new graph via `graph.node_copy()` instead of full FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x). Includes a safety guard: nodes without `val` metadata (e.g. nodes inserted by `call()` overrides before `super().call()`) fall back to full dispatch instead of propagating None. ### 3. FakeTensor cache extension Context manager `_extend_faketensor_cache_builtins()` temporarily extends the FakeTensor dispatch cache to cover ExecuTorch op namespaces (quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant re-dispatches for non-builtin ops across 50+ passes. ### 4. __init_subclass__ auto-discovery on ArmPass Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or `_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated automatically at class definition time — no manual annotation needed. ### 5. targeted_ops annotations on ~60 ARM passes Each annotation is a one-liner declaring the ops the pass checks in `call_operator()`. Combined with should_run() and fast-copy, this achieves the measured speedup below. ## Benchmark Model: small CNN feature extractor (~50K params, 9 conv layers with LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline). Graph: ~1200 nodes, 146 ExportPass invocations. lower() before: 186 s lower() after: 100 s Passes skipped: 53 of 146 Delta: -86 s (-46 %) Adds should_run() hook to ExportPass that subclasses can override to skip execution when a pass has no work to do. ArmPass implements a default that checks a targeted_ops class attribute against the graph's call_function nodes. Also adds: - _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy instead of full FakeTensor dispatch for cold nodes in passes that declare targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms. - _extend_faketensor_cache_builtins context manager that extends FakeTensor dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.) - __init_subclass__ on ArmPass for auto-discovery of targeted_ops from existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes - targeted_ops annotations on ~60 ARM pass subclasses Measured on SleepNet featurizer (U55 lowering): lower(): 185s -> 96s = -89s (-48%) Differential Revision: D97528110

…hen no ops match decomp table (pytorch#18496) Summary: Pull Request resolved: pytorch#18496 Adds an early-exit check to _gen_edge_manager_for_partitioners: before calling program.run_decompositions(table), scan the graph for ops that appear in the decomposition table. If none are found, skip the call entirely. Each run_decompositions call performs a full re-export of the program via make_fx(), re-tracing every node through FakeTensor dispatch. On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times; the early-exit eliminates at least one redundant call where the previous pass already decomposed all matching ops. The check recursively walks control flow submodules (cond/map/scan) to avoid incorrectly skipping when decomposable ops are nested. ## Benchmark Model: small CNN feature extractor (~50K params, 9 conv layers with LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline). Graph: ~1200 nodes. lower() before: 82 s lower() after: 71 s Delta: -11 s (-13 %) Differential Revision: D96489903

apullin requested review from JacobSzwejbka and larryliu0820 as code owners March 25, 2026 16:03

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 25, 2026

meta-codesync Bot added fb-exported meta-exported labels Mar 25, 2026

meta-codesync Bot changed the title ~~Skip redundant run_decompositions when no ops match decomp table~~ Skip redundant run_decompositions when no ops match decomp table (#18496) Mar 25, 2026

apullin force-pushed the export-D96489903 branch from 931a377 to b1a74f8 Compare March 25, 2026 22:29

meta-codesync Bot changed the title ~~Skip redundant run_decompositions when no ops match decomp table (#18496)~~ Skip redundant run_decompositions when no ops match decomp table Mar 26, 2026

apullin force-pushed the export-D96489903 branch from b1a74f8 to 4f1a818 Compare March 26, 2026 15:07

meta-codesync Bot changed the title ~~Skip redundant run_decompositions when no ops match decomp table~~ Skip redundant run_decompositions when no ops match decomp table (#18496) Mar 30, 2026

apullin force-pushed the export-D96489903 branch from 4f1a818 to 93d0a70 Compare March 30, 2026 16:07

apullin force-pushed the export-D96489903 branch 2 times, most recently from 559036a to 77d036d Compare March 30, 2026 21:27

apullin requested a review from digantdesai as a code owner March 30, 2026 21:27

apullin changed the title ~~Skip redundant run_decompositions when no ops match decomp table (#18496)~~ Minor speedup for model lowering: Skip redundant run_decompositions when no ops match decomp table (#18496) Apr 2, 2026

apullin force-pushed the export-D96489903 branch from 77d036d to 00408ce Compare April 13, 2026 19:30

apullin force-pushed the export-D96489903 branch from 00408ce to a9c9746 Compare April 17, 2026 16:03

apullin force-pushed the export-D96489903 branch from a9c9746 to e10f77c Compare April 20, 2026 18:32

apullin force-pushed the export-D96489903 branch from e10f77c to c4a7945 Compare May 12, 2026 20:48

apullin force-pushed the export-D96489903 branch from 17be2e1 to c055892 Compare June 2, 2026 18:02

apullin force-pushed the export-D96489903 branch from c055892 to bff4940 Compare June 2, 2026 18:07

apullin force-pushed the export-D96489903 branch from bff4940 to 406df1f Compare June 4, 2026 23:54

apullin force-pushed the export-D96489903 branch from 406df1f to 391f321 Compare June 4, 2026 23:58

apullin force-pushed the export-D96489903 branch from 391f321 to 3718908 Compare June 16, 2026 17:08

apullin had a problem deploying to cadence June 16, 2026 17:08 — with GitHub Actions Error

apullin force-pushed the export-D96489903 branch from 3718908 to 8af8c68 Compare June 16, 2026 17:15

apullin had a problem deploying to cadence June 16, 2026 17:16 — with GitHub Actions Failure

apullin force-pushed the export-D96489903 branch from 8af8c68 to 25e74e8 Compare June 17, 2026 00:00

apullin had a problem deploying to cadence June 17, 2026 00:00 — with GitHub Actions Error

apullin force-pushed the export-D96489903 branch from 25e74e8 to 3a31c14 Compare June 17, 2026 00:05

apullin had a problem deploying to cadence June 17, 2026 00:07 — with GitHub Actions Failure

Andrew Pullin added 2 commits June 18, 2026 21:37

apullin force-pushed the export-D96489903 branch from 3a31c14 to 6817eda Compare June 19, 2026 04:42

apullin force-pushed the export-D96489903 branch from 6817eda to 33061a2 Compare June 19, 2026 04:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor speedup for model lowering: Skip redundant run_decompositions when no ops match decomp table (#18496)#18496

Minor speedup for model lowering: Skip redundant run_decompositions when no ops match decomp table (#18496)#18496
apullin wants to merge 3 commits into
pytorch:mainfrom
apullin:export-D96489903

apullin commented Mar 25, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

pytorch-bot Bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Mar 25, 2026

Uh oh!

github-actions Bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

apullin commented Mar 25, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

pytorch-bot Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18496

❌ 5 New Failures, 6 Unrelated Failures, 5 Unclassified Failures

Uh oh!

meta-codesync Bot commented Mar 25, 2026

Uh oh!

github-actions Bot commented Mar 25, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

apullin commented Mar 25, 2026 •

edited by meta-codesync Bot

Loading

pytorch-bot Bot commented Mar 25, 2026 •

edited

Loading

This PR needs a `release notes:` label