Skip to content

VRAM Pressure - Melband#244

Open
BuffMcBigHuge wants to merge 4 commits into
mainfrom
marco/feat/vram-pressure-2
Open

VRAM Pressure - Melband#244
BuffMcBigHuge wants to merge 4 commits into
mainfrom
marco/feat/vram-pressure-2

Conversation

@BuffMcBigHuge

@BuffMcBigHuge BuffMcBigHuge commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Includes #238 (Latency Improvement Experimentation) as its base commit; everything below describes the VRAM-management and upload-pipeline work layered on top.

Problem

Uploading a track loads Mel-Band RoFormer on top of the resident ACE-Step 1.5 stack. On a 24 GB card running the TRT realtime session this caused a cluster of failures:

  • Memory-pressure spikes during every stem separation, with no relief when the card was full.
  • The shared eager upload-encoder (a full second copy of the ACE-Step weights, ~6 GB) loaded on the first upload and stayed resident for the process lifetime.
  • Doubled WS connections (page reload, stale tab reconnect) ran two StreamingSession.create() calls concurrently → one OOMs → its cleanup evicts shared TRT VAE cache entries out from under the healthy session → both sessions die ('NoneType' object has no attribute 'decode').
  • Uploads longer than the live session's TRT profile (e.g. a 120 s track vs the cached 60 s vae_encode engine) were rejected outright (TRT VAE encode rejected input shape).
  • A second long upload OOM'd: the encoder restored its entire DiT (+4.7 GB), including the generation decoder uploads never execute, on top of a 120 s-profile session (~19 GB resident).
  • Time-to-sound on a new upload was the full pipeline (analysis + encode + separation + persist), ~20–40 s.

Operating principle

The separator and the eager ACE-Step weights never occupy VRAM at the same time. Before Mel-Band RoFormer loads, the resident ACE-Step context parks its eager modules in system RAM; the separator runs in the vacated space, is released, and only then does ACE-Step return. This is unconditional by default — not gated on a free-VRAM heuristic.

Changes

1. Melband VRAM parking (acestep/engine/model_context.py, acestep/streaming/stems.py)

  • ModelContext gains a placement lock and vram_parked(): eager modules (DiT, VAE, text encoder) + silence latent move to CPU, freed pages return to CUDA, everything restores on exit (exception-safe, reentrant). TRT engines are untouched (their memory belongs to execution contexts).
  • Every eager-module consumer routes through _load_model_context(), which takes the same lock — a concurrent op (prompt re-encode, timbre/structure set) issued mid-park blocks until restore instead of running GPU inputs against CPU weights.
  • Park policy via DEMON_MELBAND_VRAM_PARK: always (default), auto (only when claimable VRAM < DEMON_MELBAND_VRAM_RESERVE_GB, default 6.0), never.
  • Structured stems_vram phase=... telemetry at every transition (before/parked/loaded/separated/released/restored).

2. Upload-encoder lifecycle (ws_adapter.py, acestep/engine/session.py)

  • Built parked: constructed with offload_to_cpu=True, offload_dit_to_cpu=True (new Session passthrough) so weights land in system RAM — no ~6 GB construction spike next to the live session — then flipped to resident mode governed by the park protocol.
  • Generation stack stripped: uploads execute exactly three model surfaces (VAE encode, semantic extract, conditioning encoder). _strip_upload_encoder_generation_stack() drops the DiT decoder (1,575,458,880 params, ~3.2 GB) and the eager DiffusionEngine at construction. The per-upload GPU restore is the ~1.3 GB conditioning stack, not a 4.7 GB model copy — this is what fits uploads inside the headroom of a 120 s-profile session.
  • Persistently offloaded after each upload's background rip (offload_eager_to_cpu()); _load_model_context() lazily restores only the modules the next upload touches.
  • Phase-1 encode catches a CUDA OOM once, returns torch's cached pages, and retries before failing.

3. Two-phase uploads + background stem rip (ws_adapter.py, acestep/streaming/stems.py, acestep/user_uploads.py, web client)

  • Phase 1 (synchronous, sub-second server-side when warm): analyze + VAE-encode the full source + persist → ack upload_ok with stems_pending: true. The client can swap — and hear audio — immediately.
  • Phase 2 (background thread): RoFormer separation under the park, per-stem sidecars, stem WAVs, metadata re-save (persist_user_upload_stems). Finished stems are pushed to the live session as a late stem_assets frame (source_mode: "" = overlay-only, never a mode change); failures push stem_failed. Results are discarded if the session-end wipe deleted the track mid-rip.
  • Pending-stems registry coordinates the swap path: a mode-full swap proceeds without stems (no duplicate separation); a vocals/instruments swap — where the stem is the inference source — waits for the rip, then loads from disk cache.
  • Client: upload_ok.stems_pending → stem status "processing"; the existing stem_assets/stem_failed listeners flip it to ready/failed whenever the push lands.

4. Single-active-session policy (ws_adapter.py, acestep/streaming/session.py, web client)

  • StreamingSession.create() calls are serialized; a new main-session connection preempts the active session: stops its runner, closes its socket with close code 4001 (PREEMPTED_CLOSE_CODE), and waits on the new StreamingSession.closed event until the old stack's VRAM is actually released before building the new one.
  • The web client treats 4001 as final ("another connection took over this pod") instead of entering the reconnect loop — no preemption ping-pong between tabs.
  • This removes both the dual-create OOM and the shared-TRT-cache eviction cascade at the root: teardown and creation never overlap.

5. Shape-aware TRT VAE engine selection (acestep/nodes/vae_nodes.py)

  • The process-wide TRT VAE cache can hand the upload encoder an engine belonging to the live session whose optimization profile doesn't cover the upload's length. _trt_vae_profile_fits() checks the input shape against the cached engine's profile first; on a mismatch, a handler that carries an eager VAE falls back to it (both encode and decode nodes). 120 s+ uploads now work alongside a 60 s session.

6. Phase-1 latency (perceived performance)

  • librosa JIT warm at boot (server.py): the beat tracker's cold first call costs ~4–5 s of numba compilation; a boot-time daemon thread pays it before any user does.
  • Windowed + parallel analysis: BPM/key from a centered 60 s window (measured identical results at a fraction of the cost) running on a worker thread concurrently with the GPU source encode.
  • Separator RAM cache (DEMON_MELBAND_RAM_CACHE, default on): RoFormer weights stay in system RAM between rips; per-rip GPU residency is a ~0.2 s move instead of a ~2 s disk load. VRAM discipline unchanged.
  • TRT engine prewarm: during phase 1 (track duration is known) a background thread page-caches the engine files the post-upload swap will load — helps cold-page-cache pods, harmless elsewhere.

Wire / protocol changes

Change Notes
upload_ok.stems_pending: bool Registry field + regenerated wireContract.gen.ts
WS close code 4001 PREEMPTED_CLOSE_CODE, mirrored server (ws_adapter) ↔ client (web/sdk/types/protocol.ts); client treats as final
stem_assets.source_mode: "" Overlay-only push semantics; real modes unchanged on init/swap paths

Environment knobs

  • DEMON_MELBAND_VRAM_PARK = always (default) | auto | never
  • DEMON_MELBAND_VRAM_RESERVE_GB (auto-mode threshold, default 6.0)
  • DEMON_MELBAND_RAM_CACHE = 1 (default) | 0

Measured results (RTX 4090 24 GB, acestep-v15-turbo, TRT)

Metric Before After
upload_ok (120 s track, warm) ~20–40 s 0.86 s server-side (~4 s incl. localhost transfer)
Time-to-sound after swap (same profile) gated on full pipeline ~1–2 s
Stems available (120 s, background) blocking +~10 s after upload_ok, mid-playback
Separator GPU attach per rip 1.6–2 s (disk) 0.1–0.2 s (RAM cache)
Park cycle during separation none / heuristic allocated e.g. 10.69 → 6.23 GB, every rip
VRAM after upload completes +~6 GB permanent encoder fully offloaded (back to session baseline)
Second consecutive 120 s upload CUDA OOM (0 bytes free) passes with ≥6.1 GB free at the worst point
Doubled connection at boot OOM + both sessions dead preempt (4001) + clean handoff

Verification

  • Live end-to-end runs against the real TRT server: dual-connect preemption, 60 s and 120 s uploads, cross-profile (60 s → 120 s) swaps, instruments-stem swap racing the background rip (waits, no duplicate separation), three consecutive uploads, page-refresh teardown/rebuild — zero errors, telemetry confirming the park/restore cycle on every separation.
  • 200 unit tests passing locally (park/restore semantics, persistent offload + lazy restore, preemption incl. the 4001 cross-language constant guard, TRT profile-fit guard, pending-stems registry + swap gating, two-phase persistence, decoder strip, analysis windowing); wire-contract drift guards and npm run typecheck/build/test:unit green.
  • Full design notes in docs/VOCALSTEM.md (§ VRAM Management, § Two-Phase Uploads).

🤖 Generated with Claude Code

Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Signed-off-by: BuffMcBigHuge <marco@bymar.co>
@leszko

leszko commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

@BuffMcBigHuge Is this related to the "VRAM Headroom" in this list? https://docs.google.com/spreadsheets/d/1QfDvH7Q1sKBQqaGI3Akqjpp4LmsKh_uhade32gL7HAU/edit?gid=45901989#gid=45901989

image

Asking, because I started working on the same thing, but I can assign it to you if you're already on it. Let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants