VRAM Pressure - Melband#244
Open
BuffMcBigHuge wants to merge 4 commits into
Open
Conversation
Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Signed-off-by: BuffMcBigHuge <marco@bymar.co>
Collaborator
|
@BuffMcBigHuge Is this related to the "VRAM Headroom" in this list? https://docs.google.com/spreadsheets/d/1QfDvH7Q1sKBQqaGI3Akqjpp4LmsKh_uhade32gL7HAU/edit?gid=45901989#gid=45901989 Asking, because I started working on the same thing, but I can assign it to you if you're already on it. Let me know. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Includes #238 (Latency Improvement Experimentation) as its base commit; everything below describes the VRAM-management and upload-pipeline work layered on top.
Problem
Uploading a track loads Mel-Band RoFormer on top of the resident ACE-Step 1.5 stack. On a 24 GB card running the TRT realtime session this caused a cluster of failures:
StreamingSession.create()calls concurrently → one OOMs → its cleanup evicts shared TRT VAE cache entries out from under the healthy session → both sessions die ('NoneType' object has no attribute 'decode').vae_encodeengine) were rejected outright (TRT VAE encode rejected input shape).Operating principle
The separator and the eager ACE-Step weights never occupy VRAM at the same time. Before Mel-Band RoFormer loads, the resident ACE-Step context parks its eager modules in system RAM; the separator runs in the vacated space, is released, and only then does ACE-Step return. This is unconditional by default — not gated on a free-VRAM heuristic.
Changes
1. Melband VRAM parking (
acestep/engine/model_context.py,acestep/streaming/stems.py)ModelContextgains a placement lock andvram_parked(): eager modules (DiT, VAE, text encoder) + silence latent move to CPU, freed pages return to CUDA, everything restores on exit (exception-safe, reentrant). TRT engines are untouched (their memory belongs to execution contexts)._load_model_context(), which takes the same lock — a concurrent op (prompt re-encode, timbre/structure set) issued mid-park blocks until restore instead of running GPU inputs against CPU weights.DEMON_MELBAND_VRAM_PARK:always(default),auto(only when claimable VRAM <DEMON_MELBAND_VRAM_RESERVE_GB, default 6.0),never.stems_vram phase=...telemetry at every transition (before/parked/loaded/separated/released/restored).2. Upload-encoder lifecycle (
ws_adapter.py,acestep/engine/session.py)offload_to_cpu=True, offload_dit_to_cpu=True(newSessionpassthrough) so weights land in system RAM — no ~6 GB construction spike next to the live session — then flipped to resident mode governed by the park protocol._strip_upload_encoder_generation_stack()drops the DiT decoder (1,575,458,880 params, ~3.2 GB) and the eager DiffusionEngine at construction. The per-upload GPU restore is the ~1.3 GB conditioning stack, not a 4.7 GB model copy — this is what fits uploads inside the headroom of a 120 s-profile session.offload_eager_to_cpu());_load_model_context()lazily restores only the modules the next upload touches.3. Two-phase uploads + background stem rip (
ws_adapter.py,acestep/streaming/stems.py,acestep/user_uploads.py, web client)upload_okwithstems_pending: true. The client can swap — and hear audio — immediately.persist_user_upload_stems). Finished stems are pushed to the live session as a latestem_assetsframe (source_mode: ""= overlay-only, never a mode change); failures pushstem_failed. Results are discarded if the session-end wipe deleted the track mid-rip.fullswap proceeds without stems (no duplicate separation); avocals/instrumentsswap — where the stem is the inference source — waits for the rip, then loads from disk cache.upload_ok.stems_pending→ stem status "processing"; the existingstem_assets/stem_failedlisteners flip it to ready/failed whenever the push lands.4. Single-active-session policy (
ws_adapter.py,acestep/streaming/session.py, web client)StreamingSession.create()calls are serialized; a new main-session connection preempts the active session: stops its runner, closes its socket with close code 4001 (PREEMPTED_CLOSE_CODE), and waits on the newStreamingSession.closedevent until the old stack's VRAM is actually released before building the new one.5. Shape-aware TRT VAE engine selection (
acestep/nodes/vae_nodes.py)_trt_vae_profile_fits()checks the input shape against the cached engine's profile first; on a mismatch, a handler that carries an eager VAE falls back to it (both encode and decode nodes). 120 s+ uploads now work alongside a 60 s session.6. Phase-1 latency (perceived performance)
server.py): the beat tracker's cold first call costs ~4–5 s of numba compilation; a boot-time daemon thread pays it before any user does.DEMON_MELBAND_RAM_CACHE, default on): RoFormer weights stay in system RAM between rips; per-rip GPU residency is a ~0.2 s move instead of a ~2 s disk load. VRAM discipline unchanged.Wire / protocol changes
upload_ok.stems_pending: boolwireContract.gen.ts4001PREEMPTED_CLOSE_CODE, mirrored server (ws_adapter) ↔ client (web/sdk/types/protocol.ts); client treats as finalstem_assets.source_mode: ""Environment knobs
DEMON_MELBAND_VRAM_PARK=always(default) |auto|neverDEMON_MELBAND_VRAM_RESERVE_GB(auto-mode threshold, default 6.0)DEMON_MELBAND_RAM_CACHE=1(default) |0Measured results (RTX 4090 24 GB, acestep-v15-turbo, TRT)
upload_ok(120 s track, warm)upload_ok, mid-playbackVerification
npm run typecheck/build/test:unitgreen.docs/VOCALSTEM.md(§ VRAM Management, § Two-Phase Uploads).🤖 Generated with Claude Code