Multi-GPU support: per-device VRAM management by Kosinkadink · Pull Request #31 · Comfy-Org/comfy-aimdo

Kosinkadink · 2026-04-09T23:51:24Z

Summary

Makes aimdo work properly with multiple GPUs. Previously, aimdo assumed a single GPU — budget_deficit() used global total_vram_usage (sum of ALL GPUs) against one GPU's vram_capacity, creating a phantom deficit that triggered constant unnecessary eviction. With 2×RTX 4090, this made 2-GPU slower than 1-GPU.

Changes

Commit 1: Mutex + device-aware eviction (foundation)

Added mutex synchronization (pthread/CRITICAL_SECTION) to protect the VBAR priority linked list
Made eviction device-aware: vbars_free_locked_dev() with device_filter parameter
Added per-device VRAM accounting: dev_vram_usage[], dev_vram_add/sub, current_cuda_device()

Commit 2: Phases 1–5 (core multi-GPU logic)

Phase 1 — Per-device state: AimdoDeviceState struct with per-device vram_capacity, ctx, deficit_sync, etc. Lazy ensure_device_init() called from vbar_allocate and allocator hooks. vbar_allocate uses per-device capacity.
Phase 2 — Context safety: with_device_ctx()/restore_ctx() helpers. poll_budget_deficit_dev() switches to correct device context for cuMemGetInfo. Fixed ensure_ctx() to not force device 0 context on multi-GPU.
Phase 3 — Per-device hybrid budget: budget_deficit_dev() uses per-device usage + per-device cuMemGetInfo backstop (not just simple accounting). vbar_fault uses this instead of global budget_deficit(), eliminating phantom cross-GPU deficit.
Phase 4 — Device-aware allocator hooks: aimdo_cuda_malloc/aimdo_cuda_malloc_async determine current device, use vbars_free_dev + budget_deficit_dev. vrambuf_grow also device-aware.
Phase 5 — Atomic counters: dev_vram_add/sub use __atomic_add_fetch (Linux) / InterlockedExchangeAdd64 (Windows).

API Compatibility

Fully backward compatible — no changes to exported function signatures. ComfyUI does not need any modifications. init(device_id) still works; other devices are initialized lazily.

Benchmarks (2×RTX 4090, Qwen-Image 38GB, CFG=7, 20 steps)

Config	Time	vs 1-GPU	Status
1-GPU + aimdo	56.39s	1.00×	✅ stable
2-GPU + aimdo (before, mutex only)	95.39s	0.59×	✅ stable
2-GPU + aimdo (before, dev-aware evict)	78.85s	0.72×	✅ stable
2-GPU + aimdo (this PR)	41.29s	1.37×	✅ stable (7/7)
2-GPU no aimdo	37.32s	1.51×	❌ crashes after ~2 runs
1-GPU + aimdo (this PR)	56.41s	1.00×	✅ no regression

Key design decisions

Must keep cuMemGetInfo hybrid backstop — pure per-device accounting OOM'd because cudaFreeAsync decrements counters before memory is actually reusable.
Lazy device init — ComfyUI only calls init() for device 0. Other devices are initialized on first use via ensure_device_init().
Global budget_deficit() kept as fallback — budget_deficit_dev() falls back to global when device state isn't initialized.

- Add platform-specific mutex (pthread/CRITICAL_SECTION) to protect the global VBAR priority linked list and vbars_dirty state - Make vbars_free_for_vbar() and vbar_fault() stopgap only evict pages from the same device, preventing cross-GPU thrashing - Add per-device VRAM accounting (dev_vram_usage[]) tracked in three_stooges, mod1, account_alloc/free, and vrambuf_destroy - Add budget_deficit_dev() for per-device budget checks (infrastructure) - Fix build-linux-docker missing -Isrc include path Fixes segfault in vbar_fault when multiple GPUs page concurrently (e.g. CFG > 1 with model larger than VRAM). Improves 2-GPU paging performance from 0.59x to 0.72x vs single GPU. Amp-Thread-ID: https://ampcode.com/threads/T-019d71c0-6054-7286-9b03-00dfb4506b59 Co-authored-by: Amp <amp@ampcode.com>

…, device-aware allocator, atomic counters - Phase 1: AimdoDeviceState struct with per-device vram_capacity, ctx, deficit_sync, etc. Lazy ensure_device_init() called from vbar_allocate and allocator hooks. vbar_allocate uses per-device capacity instead of global. - Phase 2: with_device_ctx()/restore_ctx() context save/restore helpers. poll_budget_deficit_dev() switches to correct device context for cuMemGetInfo. ensure_ctx() fixed to not force device 0 context on multi-GPU. - Phase 3: budget_deficit_dev() with per-device cuMemGetInfo hybrid backstop. vbar_fault stopgap and per-page checks now use budget_deficit_dev instead of global budget_deficit, eliminating phantom cross-GPU deficit. - Phase 4: aimdo_cuda_malloc/aimdo_cuda_malloc_async determine current device, call ensure_device_init, use vbars_free_dev + budget_deficit_dev. vrambuf_grow also device-aware. Exposed vbars_free_dev() public API. - Phase 5: dev_vram_add/sub use __atomic_add_fetch/__atomic_sub_fetch (Linux) and InterlockedExchangeAdd64 (Windows) for thread-safe counters. Benchmark: 2-GPU 41.29s avg (was 78.85s), stable 7/7 runs. Target: 37.32s (no-aimdo, unstable). Amp-Thread-ID: https://ampcode.com/threads/T-019d7433-e7a7-762c-8d7e-7e8efac3f8bf Co-authored-by: Amp <amp@ampcode.com>

…x sync - ensure_device_init: double-checked locking with dev_init_lock mutex and __ATOMIC_RELEASE store barrier on s->inited to prevent other threads from seeing partially-initialized state. - dev_vram_load: new atomic read helper using __atomic_load_n (Linux) / InterlockedOr64 (Windows). Used in budget_deficit_dev and poll_budget_deficit_dev instead of raw non-atomic reads. - vbars_free_locked_dev: replaced single 'dirty' bool with per-device synced[] array. Each device gets its own cuCtxSynchronize via with_device_ctx/restore_ctx, preventing cross-device sync issues when device_filter == -1. - All cuCtxSynchronize calls in model-vbar.c now wrapped with with_device_ctx/restore_ctx: vbars_free_for_vbar, vbar_unpin, vbar_free, vbar_free_memory. - Inlined vbars_free_locked() (trivial single-caller wrapper). Benchmarks unchanged: 2-GPU 41.53s, 1-GPU 56.33s (no regression). Amp-Thread-ID: https://ampcode.com/threads/T-019d7433-e7a7-762c-8d7e-7e8efac3f8bf Co-authored-by: Amp <amp@ampcode.com>

InterlockedExchangeAdd64 and InterlockedOr64 are MSVC compiler intrinsics that require <intrin.h>. plat.h only included <BaseTsd.h> in the Windows block, causing unresolved external symbol link errors. Co-authored-by: Amp <amp@ampcode.com> Amp-Thread-ID: https://ampcode.com/threads/T-019d7433-e7a7-762c-8d7e-7e8efac3f8bf

rattus128 · 2026-04-10T00:58:24Z


 /* model_vbar.c */
 size_t vbars_free(size_t size);
+size_t vbars_free_dev(size_t size, int device);


this appears to leave vbars_free behind as dead code:

(venv) rattus@rattus-box2:~/ComfyUI-MuiMui$ git grep "vbars_free[^_]" src/model-vbar.c:size_t vbars_free(size_t size) { src/plat.h:size_t vbars_free(size_t size);

If you support multi-gpu the global API doesnt make sense any more. This should just be a signature change of vbars_free to add the require the new argument.

rattus128 · 2026-04-10T01:33:19Z

+#include <pthread.h>
+static pthread_mutex_t dev_init_lock = PTHREAD_MUTEX_INITIALIZER;
+static inline void dev_init_lock_acquire(void) { pthread_mutex_lock(&dev_init_lock); }
+static inline void dev_init_lock_release(void) { pthread_mutex_unlock(&dev_init_lock); }


src/pyt-cu-plug-alloc-async.c has the same. Should be generalize if we still need this lock (more to come).

rattus128 · 2026-04-10T01:57:59Z

I think we are better off with cleaning up the C and calling device_init() per device rather than the assymetry on the init paths. If we want to keep comfy 0-change we can do the lazy in python layer, although how hard would a device_init() call per device be in comfy main @Kosinkadink ?

rattus128 · 2026-04-10T02:03:31Z

Amp is seeing a cross GPU state sharing of the last_check timestamp with is a local static. I think this might need to be per GPU state.

rattus128 · 2026-04-10T10:22:48Z

    if (cuCtxGetCurrent(&ctx) != CUDA_SUCCESS || !ctx) {
-        cuCtxSetCurrent(aimdo_cuda_ctx);
+        /* No context set — try per-device init for device 0, then use its ctx */
+        int dev = 0;


I think this might be a regression. amp might have misunderstood the intent of this code.

The original bug here was the poll loop in ComfyUI-Manager for the repositories could call a python gc() from an alien thread. This in turn would cause the cudaFree to be completely detached from contexts. Since we interpose between the rt and driver APIs we need to "guess" the context. aimdo_cuda_ctx is the best guess as that is the users elected GPU (--cuda-device X).

Defaulting to 0 as top priority would there break the case of:

--cuda-device 1
ComfyUI-Manager installed, online and still doing its poll while doing intensive workflows.

Theres a little chaos agent thread I added to comfy main.py to make it more reproducible in the original PR which is more reliable than trying to use Manager to repro.

the longer term solution is to stop hooking the RT API in linux and instead do it the windows way. @Kosinkadink if you hard needed this change for functionality we may want to prioritize and bite that bullet. For 100% solution we need to adopt a detours framework for Linux. Was has a PR with PLT/GOT based solution that double hooks but that doesnt solve for this unfortunately so we would have to go ultimate solution of detours.

Test case and chaos agent here:

Comfy-Org/ComfyUI#12840

rattus128 · 2026-04-10T13:14:51Z

ModelPatcher in ComfyUI needs vbars_analyze to return the VRAM per device so that get_free_memory does the right thing. Currently this will aggregate all gpus and over-report free-able memory.

From amp:

Short answer: PR 31 did not make vbars_analyze itself device-scoped. It mainly made it thread-safe, while the multi-GPU changes around it caused its effective meaning to become “global VBAR residency across all devices,” not “the current device’s VBAR residency.”

What PR 31 Changed

From the PR diff, vbars_analyze itself kept the same core algorithm:

It still walks the VBAR list.
It still recomputes resident VRAM by counting rp->handle pages instead of trusting resident_count.
It still logs per-VBAR residency and returns the summed byte count.

What PR 31 directly changed in that function was:

It wrapped vbars_analyze in the new global vbar_list_lock.
It kept vbars_dirty as a single global dirty bit, now protected by that same lock.

So semantically, the function became safer to call concurrently, but its accounting logic did not change.

The Important Semantic Shift After PR 31

The real semantic change came from the surrounding PR 31 refactor:

PR 31 introduced multi-GPU behavior but still kept one global VBAR linked list and one global vbars_dirty.
Because of that, vbars_analyze() in PR 31 no longer implicitly meant “analyze the one device we have.”
It now meant “analyze every VBAR in the global list,” which in practice is “sum resident VBAR pages across all devices.”

That means the return value changed in meaning:

Before multi-GPU, it was effectively single-device VBAR residency.
After PR 31, it became aggregate cross-device VBAR residency.

Kosinkadink and others added 3 commits April 9, 2026 14:09

Kosinkadink force-pushed the multigpu-thread-safety branch from 472d1f5 to 2667e4c Compare April 10, 2026 00:39

Kosinkadink force-pushed the multigpu-thread-safety branch from 2667e4c to ac7f22e Compare April 10, 2026 00:42

rattus128 reviewed Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU support: per-device VRAM management#31

Multi-GPU support: per-device VRAM management#31
Kosinkadink wants to merge 4 commits into
masterfrom
multigpu-thread-safety

Kosinkadink commented Apr 9, 2026

Uh oh!

rattus128 Apr 10, 2026

Uh oh!

rattus128 Apr 10, 2026

Uh oh!

rattus128 commented Apr 10, 2026

Uh oh!

rattus128 commented Apr 10, 2026

Uh oh!

rattus128 Apr 10, 2026

Uh oh!

rattus128 Apr 10, 2026

Uh oh!

rattus128 Apr 10, 2026

Uh oh!

rattus128 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Kosinkadink commented Apr 9, 2026

Summary

Changes

Commit 1: Mutex + device-aware eviction (foundation)

Commit 2: Phases 1–5 (core multi-GPU logic)

API Compatibility

Benchmarks (2×RTX 4090, Qwen-Image 38GB, CFG=7, 20 steps)

Key design decisions

Uh oh!

rattus128 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

rattus128 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

rattus128 commented Apr 10, 2026

Uh oh!

rattus128 commented Apr 10, 2026

Uh oh!

rattus128 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

rattus128 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

rattus128 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

rattus128 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants