Skip to content

Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497)#18497

Open
apullin wants to merge 2 commits into
pytorch:mainfrom
apullin:export-D97528110
Open

Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497)#18497
apullin wants to merge 2 commits into
pytorch:mainfrom
apullin:export-D97528110

Conversation

@apullin

@apullin apullin commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

Summary:

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

Changes

1. should_run() hook on ExportPass / ArmPass

Subclasses that declare a targeted_ops class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

2. Fast-copy for cold nodes

When a pass declares targeted_ops, nodes whose ops are NOT in the set
are copied into the new graph via graph.node_copy() instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without val metadata (e.g. nodes
inserted by call() overrides before super().call()) fall back to
full dispatch instead of propagating None.

3. FakeTensor cache extension

Context manager _extend_faketensor_cache_builtins() temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

4. init_subclass auto-discovery on ArmPass

Subclasses with existing _TARGET_OPS, _supported_ops, or
_EDGE_OPS/_ATEN_OPS attributes get targeted_ops populated
automatically at class definition time — no manual annotation needed.

5. targeted_ops annotations on ~60 ARM passes

Each annotation is a one-liner declaring the ops the pass checks in
call_operator(). Combined with should_run() and fast-copy, this
achieves the measured speedup below.

Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

lower() before: 186 s
lower() after: 100 s
Passes skipped: 53 of 146
Delta: -86 s (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:

  • _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
    instead of full FakeTensor dispatch for cold nodes in passes that declare
    targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
  • _extend_faketensor_cache_builtins context manager that extends FakeTensor
    dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
  • init_subclass on ArmPass for auto-discovery of targeted_ops from
    existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
  • targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
lower(): 185s -> 96s = -89s (-48%)

Differential Revision: D97528110

@pytorch-bot

pytorch-bot Bot commented Mar 25, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18497

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 5 Unrelated Failures

As of commit 0f7ab19 with merge base 55c54c7 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 25, 2026
@meta-codesync

meta-codesync Bot commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

@apullin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D97528110.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-codesync meta-codesync Bot changed the title Add should_run() + fast-copy infrastructure with targeted_ops annotations Add should_run() + fast-copy infrastructure with targeted_ops annotations, [executorch][arm] Add should_run() + fast-copy infrastructure with targeted_ops annotations Mar 25, 2026
@apullin apullin changed the title Add should_run() + fast-copy infrastructure with targeted_ops annotations, [executorch][arm] Add should_run() + fast-copy infrastructure with targeted_ops annotations Add should_run() + fast-copy infrastructure with targeted_ops annotations Mar 25, 2026
apullin pushed a commit to apullin/executorch that referenced this pull request Mar 25, 2026
…ions, [executorch][arm] Add should_run() + fast-copy infrastructure with targeted_ops annotations (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
@meta-codesync meta-codesync Bot changed the title Add should_run() + fast-copy infrastructure with targeted_ops annotations Add should_run() + fast-copy infrastructure with targeted_ops annotations, [executorch][arm] Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497) Mar 25, 2026
apullin pushed a commit to apullin/executorch that referenced this pull request Mar 25, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
@meta-codesync meta-codesync Bot changed the title Add should_run() + fast-copy infrastructure with targeted_ops annotations, [executorch][arm] Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497) Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497) Mar 25, 2026
apullin pushed a commit to apullin/executorch that referenced this pull request Mar 25, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
@apullin apullin force-pushed the export-D97528110 branch 3 times, most recently from 485f99d to 417b280 Compare March 25, 2026 23:40
apullin pushed a commit to apullin/executorch that referenced this pull request Mar 25, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
apullin pushed a commit to apullin/executorch that referenced this pull request Mar 25, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
@apullin apullin force-pushed the export-D97528110 branch 2 times, most recently from ad7b73c to f401907 Compare March 26, 2026 06:34
apullin pushed a commit to apullin/executorch that referenced this pull request Mar 26, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
apullin pushed a commit to apullin/executorch that referenced this pull request Mar 26, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
apullin pushed a commit to apullin/executorch that referenced this pull request Mar 30, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
apullin pushed a commit to apullin/executorch that referenced this pull request May 12, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
@apullin apullin force-pushed the export-D97528110 branch from 5c00029 to 7432007 Compare May 12, 2026 20:45
apullin pushed a commit to apullin/executorch that referenced this pull request May 12, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
apullin pushed a commit to apullin/executorch that referenced this pull request May 12, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
@apullin apullin force-pushed the export-D97528110 branch from 7432007 to 350a0a4 Compare May 12, 2026 20:52
@apullin apullin force-pushed the export-D97528110 branch from 350a0a4 to a7db52b Compare May 22, 2026 16:04
@linux-foundation-easycla

linux-foundation-easycla Bot commented May 22, 2026

Copy link
Copy Markdown

CLA Missing ID

apullin pushed a commit to apullin/executorch that referenced this pull request May 22, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
apullin pushed a commit to apullin/executorch that referenced this pull request May 22, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
apullin pushed a commit to apullin/executorch that referenced this pull request May 22, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
@apullin apullin force-pushed the export-D97528110 branch from a7db52b to e440e89 Compare May 22, 2026 16:09
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 2, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 2, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
@apullin apullin force-pushed the export-D97528110 branch from e440e89 to 81d6764 Compare June 2, 2026 18:03
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 2, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
@apullin apullin force-pushed the export-D97528110 branch from 81d6764 to efb6b27 Compare June 2, 2026 18:08
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 4, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 4, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
@apullin apullin force-pushed the export-D97528110 branch from efb6b27 to e867fe5 Compare June 4, 2026 23:54
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 4, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
@apullin apullin force-pushed the export-D97528110 branch from e867fe5 to 9640089 Compare June 4, 2026 23:59
@apullin apullin force-pushed the export-D97528110 branch from 9640089 to db3ec5a Compare June 16, 2026 17:00
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 16, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
apullin pushed a commit to apullin/executorch that referenced this pull request Jun 16, 2026
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
@apullin apullin force-pushed the export-D97528110 branch from db3ec5a to 1619c04 Compare June 16, 2026 17:06
Andrew Pullin and others added 2 commits June 18, 2026 21:37
Summary:
Adds stock model (non-sleep) profiling tests to the modai lowering
profiling suite. These serve as a baseline/validation for the ExportPass
speedup work (D97528110) without requiring sleep/FBLearner dependencies.

## New profiling functions (sleepmodels_lowering_profile.py)

- _profile_arm_model_lowering(): generic helper using the same modai
  pipeline (Input → recipe → PTQ → Manager → export → lower) so timings
  are directly comparable to the sleep model profiling
- profile_resnet8_lowering(): ResNet8 (MLPerf Tiny CIFAR-10), ~77K params,
  32x32 input — small residual CNN with skip connections
- profile_mobilenet_v1_025_lowering(): MobileNetV1-0.25 (MLPerf Tiny VWW),
  ~217K params, 96x96 input — depthwise-separable CNN

## New test methods

- test_profile_resnet8_lowering()
- test_profile_mobilenet_v1_025_lowering()

Both confirmed passing:
  ResNet8: https://www.internalfb.com/intern/testinfra/testrun/20266198338067913
  MobileNetV1-0.25: https://www.internalfb.com/intern/testinfra/testrun/32088147347033640

## Buck changes

- fbcode/executorch/examples/models/TARGETS + BUCK: add mlperf_tiny target
  (wraps xplat/executorch/examples/models/mlperf_tiny/*.py)
- fbcode/healthtech/common/tests/BUCK: add //executorch/examples/models:mlperf_tiny dep

Differential Revision: D101254299
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported module: arm Issues related to arm backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant