Memory snapshot: python-only stacks for fast dumps + configurable max_entries by SherlockNoMad · Pull Request #3628 · pytorch/torchtitan

SherlockNoMad · 2026-06-11T00:17:09Z

What

MemoryProfiler calls torch.cuda.memory._record_memory_history with torch's
default stacks="all", which captures and symbolizes C++ frames.
Symbolization runs at dump time and is pathologically slow for this workload.

This PR:

Records Python-only stacks (stacks="python") for memory snapshots.
Replaces the hardcoded MEMORY_SNAPSHOT_MAX_ENTRIES = 100000 with a
Profiler.Config field memory_snapshot_max_entries (default 1_000_000),
overridable via --profiler.memory_snapshot_max_entries.

Why / evidence

On DeepSeek-v3 16B (8×H100, --profiler.enable_memory_snapshot), a single
_snapshot() dump:

`stacks`	dump time
`"all"` (default)	~461 s
`"python"` (this PR)	~4 s

~100× faster. With stacks="all" the dump dominated the run and made
--profiler.enable_memory_snapshot impractical; stacks="python" still produces
actionable per-allocation Python stacks in the PyTorch memory visualizer.

On max_entries: the old 100k ring buffer only retained the last ~2–3 steps of
history. With the much cheaper python-only stacks a larger buffer is affordable,
so the default is raised to 1M and exposed as config. (Torch's own default is
effectively unbounded — sys.maxsize — which is unsafe for long runs, so we keep
an explicit, configurable cap.)

Test plan

deepseek_v3 16B, 8×H100, --profiler.enable_memory_snapshot:

snapshot dump ~461 s (stacks="all") → ~4 s (stacks="python")
the .pickle opens in the PyTorch memory visualizer with Python stacks
snapshot ~10–12 MB at 1M max_entries

…le max_entries `MemoryProfiler` calls `torch.cuda.memory._record_memory_history` with torch's default `stacks="all"`, which captures and symbolizes C++ frames. Symbolization happens at dump time and is extremely slow for this workload: on DeepSeek-v3 16B (8xH100) a single `_snapshot()` dump took **~461 s**, which dominates the run and makes `--profiler.enable_memory_snapshot` impractical. Switch to `stacks="python"` (Python frames only): the dump drops to **~4 s** (~100x faster) while still giving actionable per-allocation Python stacks in the PyTorch memory visualizer. Also make the history cap configurable: replace the hardcoded `MEMORY_SNAPSHOT_MAX_ENTRIES = 100000` with a `Profiler.Config` field `memory_snapshot_max_entries` (default 1_000_000). 100k only covered the last ~2-3 steps; python-only stacks make a larger buffer cheap, and exposing it lets runs bound it explicitly (torch's own default is effectively unbounded). Test plan: - deepseek_v3 16B, 8xH100, `--profiler.enable_memory_snapshot`: snapshot dump ~461 s (stacks="all") -> ~4 s (stacks="python"); the .pickle opens in the PyTorch memory visualizer with Python stacks.

SherlockNoMad requested review from fegin, tianyu-l, wconstab and wwwjn as code owners June 11, 2026 00:17

pytorch-bot Bot added the ciflow/8gpu label Jun 11, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 11, 2026

tianyu-l approved these changes Jun 11, 2026

View reviewed changes

yushangdi approved these changes Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory snapshot: python-only stacks for fast dumps + configurable max_entries#3628

Memory snapshot: python-only stacks for fast dumps + configurable max_entries#3628
SherlockNoMad wants to merge 1 commit into
mainfrom
memory-snapshot-python-stacks

SherlockNoMad commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SherlockNoMad commented Jun 11, 2026

What

Why / evidence

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants