Skip to content

Phase-level memory breakdown for forward / backward / optimizer #64

@abhinavsriva

Description

@abhinavsriva

Problem

TraceML currently reports approximate step memory by resetting peak memory stats around the training step. This is useful at step level but it does not show where memory pressure happens inside the step: forward, backward or optimizer

Goal

Add estimated memory breakdown for forward, backward, and optimizer, using phase-scoped instrumentation similar to timed region .

Proposed idea

Track memory around each phase and report: peak allocated memory. For forward, one approach is to wrap the outermost nn.Module.call (similar to timed region) and measure memory within that scope.

Important detail

Step memory approximation works by resetting peak stats around the full step. That approach will no longer work by itself once we also reset peak stats inside forward/backward/optimizer, because those resets happen within the step and interfere with the previous step-level approximation. Because of this, step memory should now be computed as the max of:

the existing step-level memory approximation
forward phase peak
backward phase peak
optimizer phase peak
Notes

This is still approximate, not exact attribution, because CUDA is asynchronous and CPU-side phase boundaries do not perfectly match GPU execution.

Requirements

low overhead, no forced synchronization in default mode, outermost forward only, no submodule, keep step memory reporting correct after adding per-phase resets, document clearly

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions