VRAM Pressure - Melband by BuffMcBigHuge · Pull Request #244 · daydreamlive/DEMON

BuffMcBigHuge · 2026-06-11T03:03:37Z

Includes #238 (Latency Improvement Experimentation) as its base commit; everything below describes the VRAM-management and upload-pipeline work layered on top.

Problem

Uploading a track loads Mel-Band RoFormer on top of the resident ACE-Step 1.5 stack. On a 24 GB card running the TRT realtime session this caused a cluster of failures:

Memory-pressure spikes during every stem separation, with no relief when the card was full.
The shared eager upload-encoder (a full second copy of the ACE-Step weights, ~6 GB) loaded on the first upload and stayed resident for the process lifetime.
Doubled WS connections (page reload, stale tab reconnect) ran two StreamingSession.create() calls concurrently → one OOMs → its cleanup evicts shared TRT VAE cache entries out from under the healthy session → both sessions die ('NoneType' object has no attribute 'decode').
Uploads longer than the live session's TRT profile (e.g. a 120 s track vs the cached 60 s vae_encode engine) were rejected outright (TRT VAE encode rejected input shape).
A second long upload OOM'd: the encoder restored its entire DiT (+4.7 GB), including the generation decoder uploads never execute, on top of a 120 s-profile session (~19 GB resident).
Time-to-sound on a new upload was the full pipeline (analysis + encode + separation + persist), ~20–40 s.

Operating principle

The separator and the eager ACE-Step weights never occupy VRAM at the same time. Before Mel-Band RoFormer loads, the resident ACE-Step context parks its eager modules in system RAM; the separator runs in the vacated space, is released, and only then does ACE-Step return. This is unconditional by default — not gated on a free-VRAM heuristic.

Changes

1. Melband VRAM parking (`acestep/engine/model_context.py`, `acestep/streaming/stems.py`)

ModelContext gains a placement lock and vram_parked(): eager modules (DiT, VAE, text encoder) + silence latent move to CPU, freed pages return to CUDA, everything restores on exit (exception-safe, reentrant). TRT engines are untouched (their memory belongs to execution contexts).
Every eager-module consumer routes through _load_model_context(), which takes the same lock — a concurrent op (prompt re-encode, timbre/structure set) issued mid-park blocks until restore instead of running GPU inputs against CPU weights.
Park policy via DEMON_MELBAND_VRAM_PARK: always (default), auto (only when claimable VRAM < DEMON_MELBAND_VRAM_RESERVE_GB, default 6.0), never.
Structured stems_vram phase=... telemetry at every transition (before/parked/loaded/separated/released/restored).

2. Upload-encoder lifecycle (`ws_adapter.py`, `acestep/engine/session.py`)

Built parked: constructed with offload_to_cpu=True, offload_dit_to_cpu=True (new Session passthrough) so weights land in system RAM — no ~6 GB construction spike next to the live session — then flipped to resident mode governed by the park protocol.
Generation stack stripped: uploads execute exactly three model surfaces (VAE encode, semantic extract, conditioning encoder). _strip_upload_encoder_generation_stack() drops the DiT decoder (1,575,458,880 params, ~3.2 GB) and the eager DiffusionEngine at construction. The per-upload GPU restore is the ~1.3 GB conditioning stack, not a 4.7 GB model copy — this is what fits uploads inside the headroom of a 120 s-profile session.
Persistently offloaded after each upload's background rip (offload_eager_to_cpu()); _load_model_context() lazily restores only the modules the next upload touches.
Phase-1 encode catches a CUDA OOM once, returns torch's cached pages, and retries before failing.

3. Two-phase uploads + background stem rip (`ws_adapter.py`, `acestep/streaming/stems.py`, `acestep/user_uploads.py`, web client)

Phase 1 (synchronous, sub-second server-side when warm): analyze + VAE-encode the full source + persist → ack upload_ok with stems_pending: true. The client can swap — and hear audio — immediately.
Phase 2 (background thread): RoFormer separation under the park, per-stem sidecars, stem WAVs, metadata re-save (persist_user_upload_stems). Finished stems are pushed to the live session as a late stem_assets frame (source_mode: "" = overlay-only, never a mode change); failures push stem_failed. Results are discarded if the session-end wipe deleted the track mid-rip.
Pending-stems registry coordinates the swap path: a mode-full swap proceeds without stems (no duplicate separation); a vocals/instruments swap — where the stem is the inference source — waits for the rip, then loads from disk cache.
Client: upload_ok.stems_pending → stem status "processing"; the existing stem_assets/stem_failed listeners flip it to ready/failed whenever the push lands.

4. Single-active-session policy (`ws_adapter.py`, `acestep/streaming/session.py`, web client)

StreamingSession.create() calls are serialized; a new main-session connection preempts the active session: stops its runner, closes its socket with close code 4001 (PREEMPTED_CLOSE_CODE), and waits on the new StreamingSession.closed event until the old stack's VRAM is actually released before building the new one.
The web client treats 4001 as final ("another connection took over this pod") instead of entering the reconnect loop — no preemption ping-pong between tabs.
This removes both the dual-create OOM and the shared-TRT-cache eviction cascade at the root: teardown and creation never overlap.

5. Shape-aware TRT VAE engine selection (`acestep/nodes/vae_nodes.py`)

The process-wide TRT VAE cache can hand the upload encoder an engine belonging to the live session whose optimization profile doesn't cover the upload's length. _trt_vae_profile_fits() checks the input shape against the cached engine's profile first; on a mismatch, a handler that carries an eager VAE falls back to it (both encode and decode nodes). 120 s+ uploads now work alongside a 60 s session.

6. Phase-1 latency (perceived performance)

librosa JIT warm at boot (server.py): the beat tracker's cold first call costs ~4–5 s of numba compilation; a boot-time daemon thread pays it before any user does.
Windowed + parallel analysis: BPM/key from a centered 60 s window (measured identical results at a fraction of the cost) running on a worker thread concurrently with the GPU source encode.
Separator RAM cache (DEMON_MELBAND_RAM_CACHE, default on): RoFormer weights stay in system RAM between rips; per-rip GPU residency is a ~0.2 s move instead of a ~2 s disk load. VRAM discipline unchanged.
TRT engine prewarm: during phase 1 (track duration is known) a background thread page-caches the engine files the post-upload swap will load — helps cold-page-cache pods, harmless elsewhere.

Wire / protocol changes

Change	Notes
`upload_ok.stems_pending: bool`	Registry field + regenerated `wireContract.gen.ts`
WS close code `4001`	`PREEMPTED_CLOSE_CODE`, mirrored server (ws_adapter) ↔ client (`web/sdk/types/protocol.ts`); client treats as final
`stem_assets.source_mode: ""`	Overlay-only push semantics; real modes unchanged on init/swap paths

Environment knobs

DEMON_MELBAND_VRAM_PARK = always (default) | auto | never
DEMON_MELBAND_VRAM_RESERVE_GB (auto-mode threshold, default 6.0)
DEMON_MELBAND_RAM_CACHE = 1 (default) | 0

Measured results (RTX 4090 24 GB, acestep-v15-turbo, TRT)

Metric	Before	After
`upload_ok` (120 s track, warm)	~20–40 s	0.86 s server-side (~4 s incl. localhost transfer)
Time-to-sound after swap (same profile)	gated on full pipeline	~1–2 s
Stems available (120 s, background)	blocking	+~10 s after `upload_ok`, mid-playback
Separator GPU attach per rip	1.6–2 s (disk)	0.1–0.2 s (RAM cache)
Park cycle during separation	none / heuristic	allocated e.g. 10.69 → 6.23 GB, every rip
VRAM after upload completes	+~6 GB permanent	encoder fully offloaded (back to session baseline)
Second consecutive 120 s upload	CUDA OOM (0 bytes free)	passes with ≥6.1 GB free at the worst point
Doubled connection at boot	OOM + both sessions dead	preempt (4001) + clean handoff

Verification

Live end-to-end runs against the real TRT server: dual-connect preemption, 60 s and 120 s uploads, cross-profile (60 s → 120 s) swaps, instruments-stem swap racing the background rip (waits, no duplicate separation), three consecutive uploads, page-refresh teardown/rebuild — zero errors, telemetry confirming the park/restore cycle on every separation.
200 unit tests passing locally (park/restore semantics, persistent offload + lazy restore, preemption incl. the 4001 cross-language constant guard, TRT profile-fit guard, pending-stems registry + swap gating, two-phase persistence, decoder strip, analysis windowing); wire-contract drift guards and npm run typecheck/build/test:unit green.
Full design notes in docs/VOCALSTEM.md (§ VRAM Management, § Two-Phase Uploads).

🤖 Generated with Claude Code

Signed-off-by: BuffMcBigHuge <marco@bymar.co>

leszko · 2026-06-11T07:46:03Z

@BuffMcBigHuge Is this related to the "VRAM Headroom" in this list? https://docs.google.com/spreadsheets/d/1QfDvH7Q1sKBQqaGI3Akqjpp4LmsKh_uhade32gL7HAU/edit?gid=45901989#gid=45901989

Asking, because I started working on the same thing, but I can assign it to you if you're already on it. Let me know.

BuffMcBigHuge added 4 commits June 9, 2026 20:23

Latency experiementation.

9955098

Signed-off-by: BuffMcBigHuge <marco@bymar.co>

New attempt at model persistence and management in vram, parking.

a12aa8b

Signed-off-by: BuffMcBigHuge <marco@bymar.co>

Fixes to multiple sequential uploads and vram pressure.

9574ea0

Signed-off-by: BuffMcBigHuge <marco@bymar.co>

Phase 1 of improving percieved performance of upload.

bf783e3

Signed-off-by: BuffMcBigHuge <marco@bymar.co>

BuffMcBigHuge marked this pull request as ready for review June 11, 2026 03:51

BuffMcBigHuge requested a review from leszko June 11, 2026 03:51

BuffMcBigHuge mentioned this pull request Jun 11, 2026

VRAM pressure work, multi connection handler, session management. #243

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VRAM Pressure - Melband#244

VRAM Pressure - Melband#244
BuffMcBigHuge wants to merge 4 commits into
mainfrom
marco/feat/vram-pressure-2

BuffMcBigHuge commented Jun 11, 2026 •

edited

Loading

Uh oh!

leszko commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BuffMcBigHuge commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Operating principle

Changes

1. Melband VRAM parking (acestep/engine/model_context.py, acestep/streaming/stems.py)

2. Upload-encoder lifecycle (ws_adapter.py, acestep/engine/session.py)

3. Two-phase uploads + background stem rip (ws_adapter.py, acestep/streaming/stems.py, acestep/user_uploads.py, web client)

4. Single-active-session policy (ws_adapter.py, acestep/streaming/session.py, web client)

5. Shape-aware TRT VAE engine selection (acestep/nodes/vae_nodes.py)

6. Phase-1 latency (perceived performance)

Wire / protocol changes

Environment knobs

Measured results (RTX 4090 24 GB, acestep-v15-turbo, TRT)

Verification

Uh oh!

leszko commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BuffMcBigHuge commented Jun 11, 2026 •

edited

Loading

1. Melband VRAM parking (`acestep/engine/model_context.py`, `acestep/streaming/stems.py`)

2. Upload-encoder lifecycle (`ws_adapter.py`, `acestep/engine/session.py`)

3. Two-phase uploads + background stem rip (`ws_adapter.py`, `acestep/streaming/stems.py`, `acestep/user_uploads.py`, web client)

4. Single-active-session policy (`ws_adapter.py`, `acestep/streaming/session.py`, web client)

5. Shape-aware TRT VAE engine selection (`acestep/nodes/vae_nodes.py`)