Skip to content

Minor speedup for model lowering: Skip redundant run_decompositions when no ops match decomp table (#18496)#18496

Open
apullin wants to merge 3 commits into
pytorch:mainfrom
apullin:export-D96489903
Open

Minor speedup for model lowering: Skip redundant run_decompositions when no ops match decomp table (#18496)#18496
apullin wants to merge 3 commits into
pytorch:mainfrom
apullin:export-D96489903

Conversation

@apullin

@apullin apullin commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

Summary:

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

lower() before: 82 s
lower() after: 71 s
Delta: -11 s (-13 %)

Differential Revision: D96489903

@pytorch-bot

pytorch-bot Bot commented Mar 25, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18496

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 6 Unrelated Failures, 5 Unclassified Failures

As of commit 33061a2 with merge base 55c54c7 (image):

NEW FAILURES - The following jobs have failed:

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 25, 2026
@meta-codesync

meta-codesync Bot commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

@apullin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96489903.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-codesync meta-codesync Bot changed the title Skip redundant run_decompositions when no ops match decomp table Skip redundant run_decompositions when no ops match decomp table (#18496) Mar 25, 2026
apullin pushed a commit to apullin/executorch that referenced this pull request Mar 25, 2026
…orch#18496)

Summary:

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@meta-codesync meta-codesync Bot changed the title Skip redundant run_decompositions when no ops match decomp table (#18496) Skip redundant run_decompositions when no ops match decomp table Mar 26, 2026
@meta-codesync meta-codesync Bot changed the title Skip redundant run_decompositions when no ops match decomp table Skip redundant run_decompositions when no ops match decomp table (#18496) Mar 30, 2026
apullin pushed a commit to apullin/executorch that referenced this pull request Mar 30, 2026
…orch#18496)

Summary:

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
apullin pushed a commit to apullin/executorch that referenced this pull request Mar 30, 2026
…orch#18496)

Summary:

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@apullin apullin force-pushed the export-D96489903 branch 2 times, most recently from 559036a to 77d036d Compare March 30, 2026 21:27
@apullin apullin requested a review from digantdesai as a code owner March 30, 2026 21:27
apullin pushed a commit to apullin/executorch that referenced this pull request Mar 30, 2026
…orch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@apullin apullin changed the title Skip redundant run_decompositions when no ops match decomp table (#18496) Minor speedup for model lowering: Skip redundant run_decompositions when no ops match decomp table (#18496) Apr 2, 2026
apullin pushed a commit to apullin/executorch that referenced this pull request Apr 13, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
apullin pushed a commit to apullin/executorch that referenced this pull request Apr 17, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
apullin pushed a commit to apullin/executorch that referenced this pull request Apr 20, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
apullin pushed a commit to apullin/executorch that referenced this pull request May 12, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@apullin apullin force-pushed the export-D96489903 branch from e10f77c to c4a7945 Compare May 12, 2026 20:48
@apullin apullin force-pushed the export-D96489903 branch from 17be2e1 to c055892 Compare June 2, 2026 18:02
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 2, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 2, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@apullin apullin force-pushed the export-D96489903 branch from c055892 to bff4940 Compare June 2, 2026 18:07
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 4, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@apullin apullin force-pushed the export-D96489903 branch from bff4940 to 406df1f Compare June 4, 2026 23:54
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 4, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@apullin apullin force-pushed the export-D96489903 branch from 406df1f to 391f321 Compare June 4, 2026 23:58
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 16, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@apullin apullin force-pushed the export-D96489903 branch from 391f321 to 3718908 Compare June 16, 2026 17:08
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 16, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@apullin apullin force-pushed the export-D96489903 branch from 3718908 to 8af8c68 Compare June 16, 2026 17:15
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 17, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@apullin apullin force-pushed the export-D96489903 branch from 8af8c68 to 25e74e8 Compare June 17, 2026 00:00
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 17, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@apullin apullin force-pushed the export-D96489903 branch from 25e74e8 to 3a31c14 Compare June 17, 2026 00:05
Andrew Pullin added 2 commits June 18, 2026 21:37
Summary:
Adds stock model (non-sleep) profiling tests to the modai lowering
profiling suite. These serve as a baseline/validation for the ExportPass
speedup work (D97528110) without requiring sleep/FBLearner dependencies.

## New profiling functions (sleepmodels_lowering_profile.py)

- _profile_arm_model_lowering(): generic helper using the same modai
  pipeline (Input → recipe → PTQ → Manager → export → lower) so timings
  are directly comparable to the sleep model profiling
- profile_resnet8_lowering(): ResNet8 (MLPerf Tiny CIFAR-10), ~77K params,
  32x32 input — small residual CNN with skip connections
- profile_mobilenet_v1_025_lowering(): MobileNetV1-0.25 (MLPerf Tiny VWW),
  ~217K params, 96x96 input — depthwise-separable CNN

## New test methods

- test_profile_resnet8_lowering()
- test_profile_mobilenet_v1_025_lowering()

Both confirmed passing:
  ResNet8: https://www.internalfb.com/intern/testinfra/testrun/20266198338067913
  MobileNetV1-0.25: https://www.internalfb.com/intern/testinfra/testrun/32088147347033640

## Buck changes

- fbcode/executorch/examples/models/TARGETS + BUCK: add mlperf_tiny target
  (wraps xplat/executorch/examples/models/mlperf_tiny/*.py)
- fbcode/healthtech/common/tests/BUCK: add //executorch/examples/models:mlperf_tiny dep

Differential Revision: D101254299
…ions, (uncommitted/untracked changes) (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 19, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@apullin apullin force-pushed the export-D96489903 branch from 3a31c14 to 6817eda Compare June 19, 2026 04:42
…hen no ops match decomp table (pytorch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@apullin apullin force-pushed the export-D96489903 branch from 6817eda to 33061a2 Compare June 19, 2026 04:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported module: arm Issues related to arm backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant