Skip to content

Reduce VRAM#242

Draft
leszko wants to merge 2 commits into
mainfrom
rafal/reduce-vram
Draft

Reduce VRAM#242
leszko wants to merge 2 commits into
mainfrom
rafal/reduce-vram

Conversation

@leszko

@leszko leszko commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

fix(vram): eliminate swap-time OOM and reclaim idle VRAM on long-uptime pods

Branch: rafal/reduce-vrammain
Commits: 1200019 (chunked VAE encode + engine pinning), 54a9141 (idle pool reclaim + upload encoder offload)

Problem

Fleet pods (32 GB RTX 5090) intermittently restart with:

swap_failed: GPU memory pressure prevented loading the vae_encode engine for this source duration.
stem_extract_failed context=swap error=CUDA out of memory. Tried to allocate 94.00 MiB ...

The pods sit at ~31.3/31.4 GiB and any swap-time allocation tips them over.
Three compounding causes were identified and fixed.

1. Swap-time vae_encode engine load (1200019)

The 240 s vae_encode TRT engine requests ~11.8 GB of activation workspace at
context creation
(reproduced exactly; the 60 s engine needs ~4 GB), and it is
loaded mid-swap, after the old engine was already evicted — the unrecoverable
TRTProfileLoadError path that restarts the pod. VAE encode runs once per
source swap, not per realtime tick, so the capacity is wasted.

  • _trt_vae_encode now chunks inputs longer than the engine's max profile
    shape (overlapping chunks, moments stitched at latent-frame granularity,
    single sampling pass). Bit-exact (fp16) against single-shot encode at
    ≥8-frame margins
    ; shipped with a 64-frame (2.56 s) margin. Validation
    script: scripts/benchmarks/validate_chunked_vae_encode.py.
  • available_trt_engines pins vae_encode to the smallest built encode
    engine for every duration; decoder/vae_decode still resolve by duration.
    Profile swaps therefore never reload vae_encode — the highest-pressure
    engine load is gone from the swap path entirely.
  • Walk mode no longer resolves the full-source profile (which wrongly required
    a big decoder on disk for long sources).
  • Decoder TRT load: raise immediately when create_execution_context()
    returns None (TRT's silent OOM signal — previously crashed on the next
    tick), with one empty_cache() + retry, matching the vae_encode policy.
  • LoRA delta materialization falls back to CPU on GPU OOM and returns the
    GPU matmul scratch to the driver when deltas store on CPU.
  • Server defaults PYTORCH_ALLOC_CONF=expandable_segments:True (kills the
    "94 MiB failed with hundreds of MiB reserved-but-unallocated" fragmentation
    class; operator overrides respected).

2. Idle pool never reclaimed after session teardown (54a9141)

Session start/close cycles left pods sitting on the session's transient
allocation peak: measured ~3.6–4.7 GB idle after a plain 60 s session and
~6 GB after one that swapped to the 240 s profile
, vs a true idle floor of
~1 GB. No objects leak — a live-server census showed 14 MiB of live tensors
but 3–5 GB reserved in PyTorch's caching allocator. Teardown frees its last
tensors asynchronously (recv-thread join timeout, TRT finalizer chains),
after every in-band empty_cache() has already run, and nothing trims
afterwards. The reserved pool is invisible to TensorRT's cudaMalloc
workspaces, so it consumed exactly the headroom the next swap needed — the
slow "OOM over time" pattern.

  • Idle VRAM janitor (ws_adapter.start_idle_vram_janitor): when no
    session is registered and the pool holds >512 MiB of freed blocks, run
    gc.collect() + torch.cuda.empty_cache(). Idle-only by construction —
    it never competes with the realtime loop.
  • Final allocator trim at WS-handler exit, after the body frame (the last
    holder of session refs) is gone.

Verified over 9 session start/close cycles (plain and swap):
idle returns to 982–1008 MiB every time (previously 3.6–6 GB).

3. Permanent second model copy for uploads (54a9141)

The upload_track path lazily built a full eager Session and cached it
for the process lifetime — pinning a second DiT+VAE+text-encoder copy
(~6 GB) on the GPU from the first upload onward, next to the streaming
TRT engines. It only ever runs VAE encode + semantic extract, both of which
hop weights to the GPU per call via _load_model_context.

  • The upload encoder now builds with offload_to_cpu=True +
    offload_dit_to_cpu=True: weights live in system RAM, uploads measured at
    14–22 s, process settles at ~1.4 GB instead of ~7 GB.
  • Session now exposes offload_dit_to_cpu (passthrough to ModelContext);
    without it, offload_to_cpu lets the DiT go GPU-resident on first use.

Validation

  • Unit: 159 passed (incl. new chunk-plan invariants and engine-selection
    tests in tests/unit/test_engine_profiles.py).
  • Golden: same pass/fail profile as main — the only failure
    (swap_fixture) fails identically on main (uncalibrated thresholds +
    stochastic VAE-encode sampling; branch metrics were closer to reference
    than main's run).
  • Full-stack integration (live server, WS protocol): 60s→240s→60s→240s
    profile swaps, vocal/instrument stem rips, and a realistic 1.2 GB-delta
    LoRA enabled across swaps — all passed, including one run under genuine
    pressure (GPU at 31/32.6 GB total).
  • A/B pressure repro at identical 5.9 GiB free: the old swap behavior
    fails exactly like prod (240 s encode context requests 11.8 GB → context
    creation fails); the new path chunk-encodes the same 200 s source through
    the 60 s engine and completes a full stem extraction on top.
  • Leak cycles: 5 plain + 4 swap session cycles, idle VRAM back to ~1 GB
    after every close.

Deploy notes

  • Pods need vae_encode_fp16_60s built to get the encode saving (~30 s
    build). Pods with only 240 s engines keep working unchanged.
    scripts/deploy/stage2_models_engines.sh currently builds only
    --duration 240 — worth adding the 60 s encode engine to provisioning.
  • Combined recoverable VRAM on a 32 GB pod: roughly 10–15 GB (≈6.4 GB
    resident from encode-engine pinning at the 240 s profile, 3.6–6 GB idle
    pool, ~6 GB upload encoder), which is the difference between
    "31.31/31.36 GiB used, 94 MiB allocation fails" and comfortable headroom.

🤖 Generated with Claude Code

leszko and others added 2 commits June 10, 2026 08:55
The 240s vae_encode engine reserves ~6.4GB more workspace than the
60s one at context creation, and loading it mid-swap under memory
pressure is the fleet's top OOM/restart cause (swap_failed:
vae_encode / stem_extract_failed). VAE encode runs once per source
swap, not per realtime tick, so the capacity is wasted.

- _trt_vae_encode now chunks inputs longer than the engine's max
  profile shape (64-frame overlap margins, moments stitched at frame
  granularity, single sampling pass). Bit-exact vs single-shot encode
  at >=8-frame margins (validated on GPU; see
  scripts/benchmarks/validate_chunked_vae_encode.py).
- available_trt_engines pins vae_encode to the smallest BUILT encode
  engine for every duration; decoder/vae_decode still resolve by
  duration. Profile swaps therefore never reload vae_encode.
- ensure_walk_profile / walk-mode session init no longer resolve the
  full-source profile (which wrongly required a big decoder on disk).
- decoder TRT load: raise on create_execution_context()==None instead
  of crashing on the next tick, and retry once after empty_cache()
  (same policy as the vae_encode load).
- LoRA delta materialization falls back to CPU on GPU OOM and returns
  the GPU matmul scratch to the driver when deltas store on CPU.
- server: default PYTORCH_ALLOC_CONF=expandable_segments:True to kill
  the reserved-but-unallocated fragmentation OOM class.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…encoder

Session start/close cycles left the pod sitting on the session's
transient allocation peak: measured driver-level, ~3.6-4.7 GB idle
after a plain 60s session and ~6 GB after one that swapped to the
240s decoder profile, on a server whose true idle floor is ~1 GB.
No objects leak — teardown frees its last tensors asynchronously
(recv thread joined with timeout, polygraphy/TRT finalizer chains),
AFTER every in-band empty_cache() has already run, so the freed
blocks stay reserved in PyTorch's caching pool indefinitely. That
pool is invisible to TensorRT's cudaMalloc workspaces, so it eats
exactly the headroom the next session's engine loads and swap-time
stem extraction need — the slow 'OOM over time' pattern on fleet
pods.

- ws_adapter: idle VRAM janitor — when no session is registered and
  the pool holds >512 MiB of freed blocks, gc.collect() +
  empty_cache(). Idle-only by construction; never competes with the
  realtime loop. Cycle-tested: idle returns to ~1 GB after every
  session (was 3.6-6 GB).
- ws_adapter: final trim in handle_client after the body frame (the
  last session-ref holder) is gone.
- ws_adapter: the upload-encoder Session now offloads to CPU
  (offload_to_cpu + offload_dit_to_cpu). It previously pinned a full
  eager DiT+VAE+text-encoder copy (~6 GB) on the GPU permanently
  from the first upload onward, next to the streaming TRT engines.
  prepare_source hops weights per call; uploads measured 14-22s and
  settle at ~1.4 GB.
- Session: expose offload_dit_to_cpu (passthrough to ModelContext);
  without it offload_to_cpu lets the DiT go GPU-resident on first
  use.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant