Add nxp backend profiling support#19225
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19225
Note: Links to docs will display an error until the docs builds have been completed. ❌ 6 New Failures, 1 Cancelled Job, 2 Unrelated Failures, 2 Unclassified FailuresAs of commit 54cf6d4 with merge base e88fd04 ( NEW FAILURES - The following jobs have failed:
UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:
CANCELLED JOB - The following job was cancelled. Please retry:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| PATTERN_VERBOSE_KERNELS = (r"\"subgraph_(?P<subgraph>\d+)\"\:\s*" | ||
| r"Operators:[\s\S]*?" | ||
| r"Kernels:\s*(?P<kernels>[\s\S]*?)" | ||
| r"\s*(NeutronOperator|^$|=)") |
There was a problem hiding this comment.
Why is the r"\s*(NeutronOperator|^$|=)") at the end of the pattern, when in the example above the NeutronOperator is in the beginning?
There was a problem hiding this comment.
I will add one more NeutronOperator entry to the example above. Each subgraph block starts with a line containing NeutronOperator "subgraph_xxx" and includes all following lines (such as Operators and Kernels) until the next NeutronOperator line. This way, the entire subgraph information is captured.
There was a problem hiding this comment.
Added one more NeutronOperator entry to the example above the pattern
|
|
||
| [name, inputs[], outputs[]] | ||
|
|
||
| """ |
There was a problem hiding this comment.
Nit: This doc-string does not seem very useful. Is it intentionally like this?
| if name_matches == len(tf_node.outputs): | ||
| subgraph_idxs.append(subgraph.location) | ||
| if subgraph_idxs: | ||
| tflite_to_neutron_dict[tf_idx] = tuple(subgraph_idxs) |
There was a problem hiding this comment.
So if a TFLite operator has the same inputs as a Neutron operator, they are matched in the dictionary?
- I think the doc-string for this method should mention the approach.
- Is it reliable? What about models with residual connections (or any parallel nodes). You can have many different parallel operators which have the exact same inputs.
There was a problem hiding this comment.
It is a slightly more complicated than just the same. When a single TFLite layer is mapped to multiple Neutron layers, the inputs are named using the format "tflite_input/additional_name". For this reason, I use "in" for comparison.
- I will add the approach to the doc-string.
- Reliability: some times the algorithm cannot find corresponding layers according inputs/outputs, but I can see them with additional effort. It concerns mainly PAD operators. In this case they just not mapped.
Residual connections and parallel nodes: the algorithm compares all inputs. For example, a convolution has three inputs: input tensor, filter tensor and bias tensor. For parallel nodes input tensor can be the same, but filter and bias differ.
There was a problem hiding this comment.
Thank you for the explanation regarding the tensor names. It makes more sense now.
Regarding point 2:
You could still have a MaxPool/AvgPool (or something else that only has 1 input tensor) in parallel consuming the same input. Wouldn't the algorithm fail in this case?
There was a problem hiding this comment.
Fixed the issue by adding a post-check to ensure that a TFLite node is not mapped to more than one single-input node. If it is mapped to multiple single-input nodes, only the node with the corresponding name is retained.
| for additional details. | ||
|
|
||
| The following command creates a profilable `cifar10_nxp_delegate.pte` model and the corresponding `ETRecord` file, | ||
| compatible with the **i.MX RT700** board using **MCUXpresso SDK 25.06**: |
There was a problem hiding this comment.
Is 25.06 correct?
I have no insights here, it just seems a bit old.
There was a problem hiding this comment.
The latest version is 26.3. But the overview section references minimum SDK version 25.12. I tested everything using version 25.06, and the dim-order section also refers to 25.06. I prefer to keep the version I tested with.
There was a problem hiding this comment.
... or remove the sentence.
There was a problem hiding this comment.
And what eIQ Neutron SDK you used?
What is the dependency on MCUX 25.06?
There was a problem hiding this comment.
Removed the MCUX version from documentation. I used 25.06.
| # Check if ETRecord was created and saved. | ||
| assert f"The ETRecord for the model was saved to" in process_output | ||
| assert etrecord_file.exists(), f"ETRecord file not created at {etrecord_file}" | ||
|
|
There was a problem hiding this comment.
How about also calling the analyzing_with_inspector.py here to make sure everything works as expected?
There was a problem hiding this comment.
To run Inspector, an etdump file is required. It contains profiling information from the board. The generated etrecord is only an optional input for Inspector and cannot be opened without the corresponding etdump file.
I added profiling tests in generic_tests/tests_profiling.py. These tests verify the correctness of the generated mapping for simpler models. But don't store any etrecords.
This test should only confirm that the record is created and stored successfully during model convertion.
There was a problem hiding this comment.
Makes sense, thanks for the explanation. If this test focuses on checking that the ETRecord is generated, can't we add another test somewhere which also runs the profiling and checks the results? I didn't find such a test in test_profiling.py, sorry if I missed something.
There was a problem hiding this comment.
test_aot_example.py checks that an ETRecord is generated and stored when profiling is enabled.
test_profiling.py checks that the debug handle map is generated correctly.
It may look like ETRecord is not generated in test_profiling.py, but this is not true. It is generated, just not saved to a file.
How it works:
- I enable automatic ETRecord generation in
to_edge_transform_and_lower(in executorch_pipeline.py):
edge_program_manager = to_edge_transform_and_lower( ..., **generate_etrecord=use_profiling** ) - Then I compute the neutron-to-edge map. This part is covered by the
test_profiling.pytests. - I return the map as part of PreprocessResult in the preprocess method (in nxp_backend.py):
return PreprocessResult( processed_bytes=binary, debug_handle_map=neutron_to_edge_map )
ETRecord generation and attaching the map are handled by devtools, which already have their own tests.
My task is only to check that the ETRecord contains the correct map. I print the map in the logs, so it is enough to verify it there. And also verify that profiling does not affect the model output.
The reason why I take map from a log is that I use lower_run_compare() function and it doesn't return exec_prog, which contain ETRecord.
I can move the CIFAR test from test_profiling.py into test_aot_example.py and check the map correctness directly from exec_prog and not from log. However, I think that test_aot_example.py and test_profiling.py together already provide sufficient coverage of profiling.
There was a problem hiding this comment.
I'm sorry, I guess my comment wasn't clear. What I meant is that I don't see any tests that actually run the model. You have made changes to the NeutronBackend.cpp file and I don't see any test verifying those changes.
There was a problem hiding this comment.
Understood now.
The CModel doesn’t support the profiling feature (only FPGA and the board do) because it doesn’t calculate ticks. Therefore, I didn’t consider testing the NeutronBackend.cpp changes with NSYS and instead tested them on the board.
Another issue is that etdump has to be saved manually at a breakpoint in MCUX.
One possible approach is to print the etdump binary to the console and parse it from the test. However, even in that case, there would be no neutron kernel information in the etdump — only CPU kernels and delegate call.
There was a problem hiding this comment.
Ok, thank you for the explanation :)
| for additional details. | ||
|
|
||
| The following command creates a profilable `cifar10_nxp_delegate.pte` model and the corresponding `ETRecord` file, | ||
| compatible with the **i.MX RT700** board using **MCUXpresso SDK 25.06**: |
There was a problem hiding this comment.
And what eIQ Neutron SDK you used?
What is the dependency on MCUX 25.06?
There was a problem hiding this comment.
Please fix the linting errors and look at our comments. For information how to run the lintrunner see https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md .
In short:
$ lintrunner init
$ lintrunner --merge-base-with <reference to last commit outside of your contribution>
|
Internal CI build: https://bamboo3.sw.nxp.com/browse/MLTECE-EXIGH146 |
7a2d30c to
f7c7896
Compare
|
9d924bb to
9a8a99b
Compare
9a8a99b to
d45d32c
Compare
| ) | ||
| # Two graphs are expected in the input log: original and converted. | ||
| EXPECTED_GRAPHS = 2 | ||
| # List of single-input nodes that shouldn't be mapped on the same TFLite node. |
There was a problem hiding this comment.
I don't really understand which nodes is this list supposed to contain. What does the comment mean? Are there any single input nodes which shouldn't be in this list?
Also, is this list complete? Or will we ever need to add operators to it?
There was a problem hiding this comment.
I created this list from the output of ./neutron-converter --show-kernel-kinds --target imxrt700. I selected all single-input kernels that can be executed in parallel with the same input.
As I understand, no additional kernels are expected for Neutron-C. If that is correct, this should be a complete list for Neutron-C.
The comment means that if two or more kernels from the list are mapped to the same TFLite node, this is incorrect. In such cases, we need to further analyze and filter the results to keep only one Neutron node that corresponds to the TFLite node name.
| assert neutron_map == { | ||
| 0: (6,), | ||
| 1: (), | ||
| 2: (7,), | ||
| 3: (), | ||
| 4: (), | ||
| 5: (), | ||
| 6: (), | ||
| 7: (), | ||
| 8: (), | ||
| 9: (), | ||
| 10: (), | ||
| } |
There was a problem hiding this comment.
Is this really the expected result? Only 2 operators are mapped?
I would expect 5 mapped operators (based on the model used).
There was a problem hiding this comment.
It is very "unhappy" example :) There are many mapping issues:
-
AvgPool → Conv2DDepthwiseV2 conversion
The mapping works correctly when there are no other single-input nodes in parallel. However, in this case there is also a MaxPool with the same input. As a result, the single-input node filter is triggered and removes all mapping entities if the Neutron and TFLite names do not match.
This is a simple way to eliminate incorrectly mapped nodes. While it has some side effects, it at least prevents incorrect mappings. I believe this is a relatively rare scenario. -
Concat + Conv2d transformation
These are transformed into multiple layers that are not represented by either the Neutron Concat kernel or the Neutron Conv2DStandardV2 kernel. Consequently, the inputs and outputs do not match.
In a “better” example, the concatenation node does not split into multiple nodes and is correctly converted into a Concat kernel.
I could modify the model so that Concat and Conv2d are converted into the corresponding kernels. However, I actually like this test case, as it can serve as useful input for further profiling improvements.
For better understanding (I will add these comments to tests as well):
0: (6,), # Conv2DStandardV2
1: (), # Conv2DDepthwiseV2 (AvgPool)
2: (7,), # MaxPool
3: (), # TransposeCHW
4: (), # TransposeCHW
5: (), # TransposeCHW
6: (), # Slice
7: (), # Pad
8: (), # Conv2DPointwise
9: (), # Slice
10: () # Neutron Dump
| # Check if ETRecord was created and saved. | ||
| assert f"The ETRecord for the model was saved to" in process_output | ||
| assert etrecord_file.exists(), f"ETRecord file not created at {etrecord_file}" | ||
|
|
There was a problem hiding this comment.
Makes sense, thanks for the explanation. If this test focuses on checking that the ETRecord is generated, can't we add another test somewhere which also runs the profiling and checks the results? I didn't find such a test in test_profiling.py, sorry if I missed something.
| @@ -0,0 +1,205 @@ | |||
| # NXP eIQ Profiling Support | |||
There was a problem hiding this comment.
Very nicely documented 👍🏻
Signed-off-by: Irina Korchakova <irina.trukhina@nxp.com>
c056d88 to
8ef1427
Compare
620af28 to
ec062cc
Compare
ec062cc to
54cf6d4
Compare
Summary
Add profiling support for the NXP backend.
Test plan
All CI tests passed including new test for the profiling feature.
cc @robert-kalmar @JakeStevens @digantdesai