-
Notifications
You must be signed in to change notification settings - Fork 11
Validate correctness of forward_pre_hook + tensor backward hook for OOM attribution #18
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomers
Description
A lightweight execution-entry tracking mechanism was recently added to help attribute CUDA OOMs and runtime failures to the module execution context active at failure time.
Code: https://github.com/traceopt-ai/traceml/blob/main/src/traceml/utils/entry_hook.py
The current implementation relies on:
- forward_pre_hook on leaf modules
- Tensor-level backward hooks (Tensor.register_hook)
- A shared execution-state pointer (EXECUTION_LAYER.current)
What to check if in a normal training run, does the reported:
- module name
- phase (forward / backward)
- roughly match expectations when an OOM occurs?
Does it work in at least on a simple PyTorch training loop
Does it behave sensible when gradient accumulation is enabled (e.g. multiple forward/backward passes before optimizer step)?
Exact attribution is not required, the goal is to identify a useful execution context, not exact tensor causality.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomers