Skip to content

Fixed the issue of incorrectly retrieving the root node of the backwa…#1167

Open
zhengchenyu wants to merge 1 commit intopytorch:mainfrom
zhengchenyu:fix.bwdroot
Open

Fixed the issue of incorrectly retrieving the root node of the backwa…#1167
zhengchenyu wants to merge 1 commit intopytorch:mainfrom
zhengchenyu:fix.bwdroot

Conversation

@zhengchenyu
Copy link
Contributor

@zhengchenyu zhengchenyu commented Nov 6, 2025

When I load some models using TensorBoard, the web page may fail to display. The specific error message is as follows:

W1106 15:30:54.044610 140717341988672 loader.py:109] Failed to parse profile data for Run tb_profiler on XX_HOST. Exception=<torch_tb_profiler.profiler.node.OperatorNode object at 0x7ffa62a07890>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/torch_tb_profiler/profiler/loader.py", line 95, in _process_data
    data = RunProfileData.parse(worker, span, local_file, self.caches.cache_dir)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch_tb_profiler/profiler/data.py", line 109, in parse
    profile = RunProfileData.from_json(worker, span, trace_json)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch_tb_profiler/profiler/data.py", line 118, in from_json
    profile.process()
  File "/opt/conda/lib/python3.11/site-packages/torch_tb_profiler/profiler/data.py", line 178, in process
    self.tid2tree, self.pl_tid2tree = parser.parse(self.events, self.forward_backward_events)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch_tb_profiler/profiler/event_parser.py", line 458, in parse
    tid2tree = builder.build_tree(tid2list, tid2zero_rt_list, staled_device_nodes, fwd_bwd_map=fwd_bwd_map)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch_tb_profiler/profiler/op_tree.py", line 48, in build_tree
    fwd_bwd_root = self._get_backward_roots(fwd_bwd_map, ts2parent, agg_nodes)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch_tb_profiler/profiler/op_tree.py", line 261, in _get_backward_roots
    fwd_to_bwdroot[fwd] = backward_nodes.pop(parent)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: <torch_tb_profiler.profiler.node.OperatorNode object at 0x7ffa62a07890>

After debugging, sometimestid2tree like this occurs: both the first-level and third-level nodes are BACKWARD_ROOT_PREFIX. However, when retrieving the root node of bwd in _get_backward_roots, it only recursively retrieves the third-level node from bottom to top. This causes a mismatch in backward_nodes.

The following are part of tid2tree.

node:level-1   - autograd::engine::evaluate_function: CheckpointFunctionBackward (3100941346080.976, 3100941840450.396) [Operator]
node:level-2     - CheckpointFunctionBackward (3100941346082.426, 3100941840443.4756) [Operator]
node:level-3       - autograd::engine::evaluate_function: PreBackwardFunctionForModuleBackward (3100941714843.749, 3100941714966.6772) [Operator]

@meta-cla meta-cla bot added the cla signed label Nov 6, 2025
@meta-codesync
Copy link

meta-codesync bot commented Dec 16, 2025

@sraikund16 has imported this pull request. If you are a Meta employee, you can view this in D89236335.

Copy link
Member

@aaronenyeshi aaronenyeshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants