Skip to content

Validate correctness of forward_pre_hook + tensor backward hook for OOM attribution #18

@abhinavsriva

Description

@abhinavsriva

A lightweight execution-entry tracking mechanism was recently added to help attribute CUDA OOMs and runtime failures to the module execution context active at failure time.

Code: https://github.com/traceopt-ai/traceml/blob/main/src/traceml/utils/entry_hook.py

The current implementation relies on:

  1. forward_pre_hook on leaf modules
  2. Tensor-level backward hooks (Tensor.register_hook)
  3. A shared execution-state pointer (EXECUTION_LAYER.current)

What to check if in a normal training run, does the reported:

  1. module name
  2. phase (forward / backward)
  3. roughly match expectations when an OOM occurs?

Does it work in at least on a simple PyTorch training loop

Does it behave sensible when gradient accumulation is enabled (e.g. multiple forward/backward passes before optimizer step)?

Exact attribution is not required, the goal is to identify a useful execution context, not exact tensor causality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions