Validate correctness of forward_pre_hook + tensor backward hook for OOM attribution

A lightweight execution-entry tracking mechanism was recently added to help attribute CUDA OOMs and runtime failures to the module execution context active at failure time.

Code: https://github.com/traceopt-ai/traceml/blob/main/src/traceml/utils/entry_hook.py

The current implementation relies on:
1. forward_pre_hook on leaf modules
2. Tensor-level backward hooks (Tensor.register_hook)
3. A shared execution-state pointer (EXECUTION_LAYER.current)

What to check if in a normal training run, does the reported:
1. module name
2. phase (forward / backward)
3. roughly match expectations when an OOM occurs?

Does it work in at least on a simple PyTorch training loop

Does it behave sensible when gradient accumulation is enabled (e.g. multiple forward/backward passes before optimizer step)?

Exact attribution is not required, the goal is to identify a useful execution context, not exact tensor causality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate correctness of forward_pre_hook + tensor backward hook for OOM attribution #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Validate correctness of forward_pre_hook + tensor backward hook for OOM attribution #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions