Skip to content

Memory snapshot: python-only stacks for fast dumps + configurable max_entries#3628

Open
SherlockNoMad wants to merge 1 commit into
mainfrom
memory-snapshot-python-stacks
Open

Memory snapshot: python-only stacks for fast dumps + configurable max_entries#3628
SherlockNoMad wants to merge 1 commit into
mainfrom
memory-snapshot-python-stacks

Conversation

@SherlockNoMad

Copy link
Copy Markdown
Contributor

What

MemoryProfiler calls torch.cuda.memory._record_memory_history with torch's
default stacks="all", which captures and symbolizes C++ frames.
Symbolization runs at dump time and is pathologically slow for this workload.

This PR:

  1. Records Python-only stacks (stacks="python") for memory snapshots.
  2. Replaces the hardcoded MEMORY_SNAPSHOT_MAX_ENTRIES = 100000 with a
    Profiler.Config field memory_snapshot_max_entries (default 1_000_000),
    overridable via --profiler.memory_snapshot_max_entries.

Why / evidence

On DeepSeek-v3 16B (8×H100, --profiler.enable_memory_snapshot), a single
_snapshot() dump:

stacks dump time
"all" (default) ~461 s
"python" (this PR) ~4 s

~100× faster. With stacks="all" the dump dominated the run and made
--profiler.enable_memory_snapshot impractical; stacks="python" still produces
actionable per-allocation Python stacks in the PyTorch memory visualizer.

On max_entries: the old 100k ring buffer only retained the last ~2–3 steps of
history. With the much cheaper python-only stacks a larger buffer is affordable,
so the default is raised to 1M and exposed as config. (Torch's own default is
effectively unbounded — sys.maxsize — which is unsafe for long runs, so we keep
an explicit, configurable cap.)

Test plan

deepseek_v3 16B, 8×H100, --profiler.enable_memory_snapshot:

  • snapshot dump ~461 s (stacks="all") → ~4 s (stacks="python")
  • the .pickle opens in the PyTorch memory visualizer with Python stacks
  • snapshot ~10–12 MB at 1M max_entries

…le max_entries

`MemoryProfiler` calls `torch.cuda.memory._record_memory_history` with torch's
default `stacks="all"`, which captures and symbolizes C++ frames. Symbolization
happens at dump time and is extremely slow for this workload: on DeepSeek-v3 16B
(8xH100) a single `_snapshot()` dump took **~461 s**, which dominates the run and
makes `--profiler.enable_memory_snapshot` impractical.

Switch to `stacks="python"` (Python frames only): the dump drops to **~4 s**
(~100x faster) while still giving actionable per-allocation Python stacks in the
PyTorch memory visualizer.

Also make the history cap configurable: replace the hardcoded
`MEMORY_SNAPSHOT_MAX_ENTRIES = 100000` with a `Profiler.Config` field
`memory_snapshot_max_entries` (default 1_000_000). 100k only covered the last
~2-3 steps; python-only stacks make a larger buffer cheap, and exposing it lets
runs bound it explicitly (torch's own default is effectively unbounded).

Test plan:
- deepseek_v3 16B, 8xH100, `--profiler.enable_memory_snapshot`:
  snapshot dump ~461 s (stacks="all") -> ~4 s (stacks="python"); the .pickle
  opens in the PyTorch memory visualizer with Python stacks.
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants