Reduce VRAM by leszko · Pull Request #242 · daydreamlive/DEMON

leszko · 2026-06-10T14:29:13Z

fix(vram): eliminate swap-time OOM and reclaim idle VRAM on long-uptime pods

Branch: rafal/reduce-vram → main
Commits: 1200019 (chunked VAE encode + engine pinning), 54a9141 (idle pool reclaim + upload encoder offload)

Problem

Fleet pods (32 GB RTX 5090) intermittently restart with:

swap_failed: GPU memory pressure prevented loading the vae_encode engine for this source duration.
stem_extract_failed context=swap error=CUDA out of memory. Tried to allocate 94.00 MiB ...

The pods sit at ~31.3/31.4 GiB and any swap-time allocation tips them over.
Three compounding causes were identified and fixed.

1. Swap-time vae_encode engine load (`1200019`)

The 240 s vae_encode TRT engine requests ~11.8 GB of activation workspace at
context creation (reproduced exactly; the 60 s engine needs ~4 GB), and it is
loaded mid-swap, after the old engine was already evicted — the unrecoverable
TRTProfileLoadError path that restarts the pod. VAE encode runs once per
source swap, not per realtime tick, so the capacity is wasted.

_trt_vae_encode now chunks inputs longer than the engine's max profile
shape (overlapping chunks, moments stitched at latent-frame granularity,
single sampling pass). Bit-exact (fp16) against single-shot encode at
≥8-frame margins; shipped with a 64-frame (2.56 s) margin. Validation
script: scripts/benchmarks/validate_chunked_vae_encode.py.
available_trt_engines pins vae_encode to the smallest built encode
engine for every duration; decoder/vae_decode still resolve by duration.
Profile swaps therefore never reload vae_encode — the highest-pressure
engine load is gone from the swap path entirely.
Walk mode no longer resolves the full-source profile (which wrongly required
a big decoder on disk for long sources).
Decoder TRT load: raise immediately when create_execution_context()
returns None (TRT's silent OOM signal — previously crashed on the next
tick), with one empty_cache() + retry, matching the vae_encode policy.
LoRA delta materialization falls back to CPU on GPU OOM and returns the
GPU matmul scratch to the driver when deltas store on CPU.
Server defaults PYTORCH_ALLOC_CONF=expandable_segments:True (kills the
"94 MiB failed with hundreds of MiB reserved-but-unallocated" fragmentation
class; operator overrides respected).

2. Idle pool never reclaimed after session teardown (`54a9141`)

Session start/close cycles left pods sitting on the session's transient
allocation peak: measured ~3.6–4.7 GB idle after a plain 60 s session and
~6 GB after one that swapped to the 240 s profile, vs a true idle floor of
~1 GB. No objects leak — a live-server census showed 14 MiB of live tensors
but 3–5 GB reserved in PyTorch's caching allocator. Teardown frees its last
tensors asynchronously (recv-thread join timeout, TRT finalizer chains),
after every in-band empty_cache() has already run, and nothing trims
afterwards. The reserved pool is invisible to TensorRT's cudaMalloc
workspaces, so it consumed exactly the headroom the next swap needed — the
slow "OOM over time" pattern.

Idle VRAM janitor (ws_adapter.start_idle_vram_janitor): when no
session is registered and the pool holds >512 MiB of freed blocks, run
gc.collect() + torch.cuda.empty_cache(). Idle-only by construction —
it never competes with the realtime loop.
Final allocator trim at WS-handler exit, after the body frame (the last
holder of session refs) is gone.

Verified over 9 session start/close cycles (plain and swap):
idle returns to 982–1008 MiB every time (previously 3.6–6 GB).

3. Permanent second model copy for uploads (`54a9141`)

The upload_track path lazily built a full eager Session and cached it
for the process lifetime — pinning a second DiT+VAE+text-encoder copy
(~6 GB) on the GPU from the first upload onward, next to the streaming
TRT engines. It only ever runs VAE encode + semantic extract, both of which
hop weights to the GPU per call via _load_model_context.

The upload encoder now builds with offload_to_cpu=True +
offload_dit_to_cpu=True: weights live in system RAM, uploads measured at
14–22 s, process settles at ~1.4 GB instead of ~7 GB.
Session now exposes offload_dit_to_cpu (passthrough to ModelContext);
without it, offload_to_cpu lets the DiT go GPU-resident on first use.

Validation

Unit: 159 passed (incl. new chunk-plan invariants and engine-selection
tests in tests/unit/test_engine_profiles.py).
Golden: same pass/fail profile as main — the only failure
(swap_fixture) fails identically on main (uncalibrated thresholds +
stochastic VAE-encode sampling; branch metrics were closer to reference
than main's run).
Full-stack integration (live server, WS protocol): 60s→240s→60s→240s
profile swaps, vocal/instrument stem rips, and a realistic 1.2 GB-delta
LoRA enabled across swaps — all passed, including one run under genuine
pressure (GPU at 31/32.6 GB total).
A/B pressure repro at identical 5.9 GiB free: the old swap behavior
fails exactly like prod (240 s encode context requests 11.8 GB → context
creation fails); the new path chunk-encodes the same 200 s source through
the 60 s engine and completes a full stem extraction on top.
Leak cycles: 5 plain + 4 swap session cycles, idle VRAM back to ~1 GB
after every close.

Deploy notes

Pods need vae_encode_fp16_60s built to get the encode saving (~30 s
build). Pods with only 240 s engines keep working unchanged.
scripts/deploy/stage2_models_engines.sh currently builds only
--duration 240 — worth adding the 60 s encode engine to provisioning.
Combined recoverable VRAM on a 32 GB pod: roughly 10–15 GB (≈6.4 GB
resident from encode-engine pinning at the 240 s profile, 3.6–6 GB idle
pool, ~6 GB upload encoder), which is the difference between
"31.31/31.36 GiB used, 94 MiB allocation fails" and comfortable headroom.

🤖 Generated with Claude Code

The 240s vae_encode engine reserves ~6.4GB more workspace than the 60s one at context creation, and loading it mid-swap under memory pressure is the fleet's top OOM/restart cause (swap_failed: vae_encode / stem_extract_failed). VAE encode runs once per source swap, not per realtime tick, so the capacity is wasted. - _trt_vae_encode now chunks inputs longer than the engine's max profile shape (64-frame overlap margins, moments stitched at frame granularity, single sampling pass). Bit-exact vs single-shot encode at >=8-frame margins (validated on GPU; see scripts/benchmarks/validate_chunked_vae_encode.py). - available_trt_engines pins vae_encode to the smallest BUILT encode engine for every duration; decoder/vae_decode still resolve by duration. Profile swaps therefore never reload vae_encode. - ensure_walk_profile / walk-mode session init no longer resolve the full-source profile (which wrongly required a big decoder on disk). - decoder TRT load: raise on create_execution_context()==None instead of crashing on the next tick, and retry once after empty_cache() (same policy as the vae_encode load). - LoRA delta materialization falls back to CPU on GPU OOM and returns the GPU matmul scratch to the driver when deltas store on CPU. - server: default PYTORCH_ALLOC_CONF=expandable_segments:True to kill the reserved-but-unallocated fragmentation OOM class. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…encoder Session start/close cycles left the pod sitting on the session's transient allocation peak: measured driver-level, ~3.6-4.7 GB idle after a plain 60s session and ~6 GB after one that swapped to the 240s decoder profile, on a server whose true idle floor is ~1 GB. No objects leak — teardown frees its last tensors asynchronously (recv thread joined with timeout, polygraphy/TRT finalizer chains), AFTER every in-band empty_cache() has already run, so the freed blocks stay reserved in PyTorch's caching pool indefinitely. That pool is invisible to TensorRT's cudaMalloc workspaces, so it eats exactly the headroom the next session's engine loads and swap-time stem extraction need — the slow 'OOM over time' pattern on fleet pods. - ws_adapter: idle VRAM janitor — when no session is registered and the pool holds >512 MiB of freed blocks, gc.collect() + empty_cache(). Idle-only by construction; never competes with the realtime loop. Cycle-tested: idle returns to ~1 GB after every session (was 3.6-6 GB). - ws_adapter: final trim in handle_client after the body frame (the last session-ref holder) is gone. - ws_adapter: the upload-encoder Session now offloads to CPU (offload_to_cpu + offload_dit_to_cpu). It previously pinned a full eager DiT+VAE+text-encoder copy (~6 GB) on the GPU permanently from the first upload onward, next to the streaming TRT engines. prepare_source hops weights per call; uploads measured 14-22s and settle at ~1.4 GB. - Session: expose offload_dit_to_cpu (passthrough to ModelContext); without it offload_to_cpu lets the DiT go GPU-resident on first use. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

leszko and others added 2 commits June 10, 2026 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce VRAM#242

Reduce VRAM#242
leszko wants to merge 2 commits into
mainfrom
rafal/reduce-vram

leszko commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leszko commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fix(vram): eliminate swap-time OOM and reclaim idle VRAM on long-uptime pods

Problem

1. Swap-time vae_encode engine load (1200019)

2. Idle pool never reclaimed after session teardown (54a9141)

3. Permanent second model copy for uploads (54a9141)

Validation

Deploy notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leszko commented Jun 10, 2026 •

edited

Loading

1. Swap-time vae_encode engine load (`1200019`)

2. Idle pool never reclaimed after session teardown (`54a9141`)

3. Permanent second model copy for uploads (`54a9141`)