Multi-GPU support: per-device VRAM management#31
Conversation
- Add platform-specific mutex (pthread/CRITICAL_SECTION) to protect the global VBAR priority linked list and vbars_dirty state - Make vbars_free_for_vbar() and vbar_fault() stopgap only evict pages from the same device, preventing cross-GPU thrashing - Add per-device VRAM accounting (dev_vram_usage[]) tracked in three_stooges, mod1, account_alloc/free, and vrambuf_destroy - Add budget_deficit_dev() for per-device budget checks (infrastructure) - Fix build-linux-docker missing -Isrc include path Fixes segfault in vbar_fault when multiple GPUs page concurrently (e.g. CFG > 1 with model larger than VRAM). Improves 2-GPU paging performance from 0.59x to 0.72x vs single GPU. Amp-Thread-ID: https://ampcode.com/threads/T-019d71c0-6054-7286-9b03-00dfb4506b59 Co-authored-by: Amp <amp@ampcode.com>
…, device-aware allocator, atomic counters - Phase 1: AimdoDeviceState struct with per-device vram_capacity, ctx, deficit_sync, etc. Lazy ensure_device_init() called from vbar_allocate and allocator hooks. vbar_allocate uses per-device capacity instead of global. - Phase 2: with_device_ctx()/restore_ctx() context save/restore helpers. poll_budget_deficit_dev() switches to correct device context for cuMemGetInfo. ensure_ctx() fixed to not force device 0 context on multi-GPU. - Phase 3: budget_deficit_dev() with per-device cuMemGetInfo hybrid backstop. vbar_fault stopgap and per-page checks now use budget_deficit_dev instead of global budget_deficit, eliminating phantom cross-GPU deficit. - Phase 4: aimdo_cuda_malloc/aimdo_cuda_malloc_async determine current device, call ensure_device_init, use vbars_free_dev + budget_deficit_dev. vrambuf_grow also device-aware. Exposed vbars_free_dev() public API. - Phase 5: dev_vram_add/sub use __atomic_add_fetch/__atomic_sub_fetch (Linux) and InterlockedExchangeAdd64 (Windows) for thread-safe counters. Benchmark: 2-GPU 41.29s avg (was 78.85s), stable 7/7 runs. Target: 37.32s (no-aimdo, unstable). Amp-Thread-ID: https://ampcode.com/threads/T-019d7433-e7a7-762c-8d7e-7e8efac3f8bf Co-authored-by: Amp <amp@ampcode.com>
…x sync - ensure_device_init: double-checked locking with dev_init_lock mutex and __ATOMIC_RELEASE store barrier on s->inited to prevent other threads from seeing partially-initialized state. - dev_vram_load: new atomic read helper using __atomic_load_n (Linux) / InterlockedOr64 (Windows). Used in budget_deficit_dev and poll_budget_deficit_dev instead of raw non-atomic reads. - vbars_free_locked_dev: replaced single 'dirty' bool with per-device synced[] array. Each device gets its own cuCtxSynchronize via with_device_ctx/restore_ctx, preventing cross-device sync issues when device_filter == -1. - All cuCtxSynchronize calls in model-vbar.c now wrapped with with_device_ctx/restore_ctx: vbars_free_for_vbar, vbar_unpin, vbar_free, vbar_free_memory. - Inlined vbars_free_locked() (trivial single-caller wrapper). Benchmarks unchanged: 2-GPU 41.53s, 1-GPU 56.33s (no regression). Amp-Thread-ID: https://ampcode.com/threads/T-019d7433-e7a7-762c-8d7e-7e8efac3f8bf Co-authored-by: Amp <amp@ampcode.com>
472d1f5 to
2667e4c
Compare
InterlockedExchangeAdd64 and InterlockedOr64 are MSVC compiler intrinsics that require <intrin.h>. plat.h only included <BaseTsd.h> in the Windows block, causing unresolved external symbol link errors. Co-authored-by: Amp <amp@ampcode.com> Amp-Thread-ID: https://ampcode.com/threads/T-019d7433-e7a7-762c-8d7e-7e8efac3f8bf
2667e4c to
ac7f22e
Compare
|
|
||
| /* model_vbar.c */ | ||
| size_t vbars_free(size_t size); | ||
| size_t vbars_free_dev(size_t size, int device); |
There was a problem hiding this comment.
this appears to leave vbars_free behind as dead code:
(venv) rattus@rattus-box2:~/ComfyUI-MuiMui$ git grep "vbars_free[^_]"
src/model-vbar.c:size_t vbars_free(size_t size) {
src/plat.h:size_t vbars_free(size_t size);
If you support multi-gpu the global API doesnt make sense any more. This should just be a signature change of vbars_free to add the require the new argument.
| #include <pthread.h> | ||
| static pthread_mutex_t dev_init_lock = PTHREAD_MUTEX_INITIALIZER; | ||
| static inline void dev_init_lock_acquire(void) { pthread_mutex_lock(&dev_init_lock); } | ||
| static inline void dev_init_lock_release(void) { pthread_mutex_unlock(&dev_init_lock); } |
There was a problem hiding this comment.
src/pyt-cu-plug-alloc-async.c has the same. Should be generalize if we still need this lock (more to come).
|
I think we are better off with cleaning up the C and calling device_init() per device rather than the assymetry on the init paths. If we want to keep comfy 0-change we can do the lazy in python layer, although how hard would a device_init() call per device be in comfy main @Kosinkadink ? |
|
Amp is seeing a cross GPU state sharing of the last_check timestamp with is a local static. I think this might need to be per GPU state. |
| if (cuCtxGetCurrent(&ctx) != CUDA_SUCCESS || !ctx) { | ||
| cuCtxSetCurrent(aimdo_cuda_ctx); | ||
| /* No context set — try per-device init for device 0, then use its ctx */ | ||
| int dev = 0; |
There was a problem hiding this comment.
I think this might be a regression. amp might have misunderstood the intent of this code.
The original bug here was the poll loop in ComfyUI-Manager for the repositories could call a python gc() from an alien thread. This in turn would cause the cudaFree to be completely detached from contexts. Since we interpose between the rt and driver APIs we need to "guess" the context. aimdo_cuda_ctx is the best guess as that is the users elected GPU (--cuda-device X).
Defaulting to 0 as top priority would there break the case of:
--cuda-device 1
ComfyUI-Manager installed, online and still doing its poll while doing intensive workflows.
Theres a little chaos agent thread I added to comfy main.py to make it more reproducible in the original PR which is more reliable than trying to use Manager to repro.
There was a problem hiding this comment.
the longer term solution is to stop hooking the RT API in linux and instead do it the windows way. @Kosinkadink if you hard needed this change for functionality we may want to prioritize and bite that bullet. For 100% solution we need to adopt a detours framework for Linux. Was has a PR with PLT/GOT based solution that double hooks but that doesnt solve for this unfortunately so we would have to go ultimate solution of detours.
There was a problem hiding this comment.
Test case and chaos agent here:
|
ModelPatcher in ComfyUI needs vbars_analyze to return the VRAM per device so that get_free_memory does the right thing. Currently this will aggregate all gpus and over-report free-able memory. From amp: |
Summary
Makes aimdo work properly with multiple GPUs. Previously, aimdo assumed a single GPU —
budget_deficit()used globaltotal_vram_usage(sum of ALL GPUs) against one GPU'svram_capacity, creating a phantom deficit that triggered constant unnecessary eviction. With 2×RTX 4090, this made 2-GPU slower than 1-GPU.Changes
Commit 1: Mutex + device-aware eviction (foundation)
vbars_free_locked_dev()with device_filter parameterdev_vram_usage[],dev_vram_add/sub,current_cuda_device()Commit 2: Phases 1–5 (core multi-GPU logic)
AimdoDeviceStatestruct with per-devicevram_capacity,ctx,deficit_sync, etc. Lazyensure_device_init()called fromvbar_allocateand allocator hooks.vbar_allocateuses per-device capacity.with_device_ctx()/restore_ctx()helpers.poll_budget_deficit_dev()switches to correct device context forcuMemGetInfo. Fixedensure_ctx()to not force device 0 context on multi-GPU.budget_deficit_dev()uses per-device usage + per-devicecuMemGetInfobackstop (not just simple accounting).vbar_faultuses this instead of globalbudget_deficit(), eliminating phantom cross-GPU deficit.aimdo_cuda_malloc/aimdo_cuda_malloc_asyncdetermine current device, usevbars_free_dev+budget_deficit_dev.vrambuf_growalso device-aware.dev_vram_add/subuse__atomic_add_fetch(Linux) /InterlockedExchangeAdd64(Windows).API Compatibility
Fully backward compatible — no changes to exported function signatures. ComfyUI does not need any modifications.
init(device_id)still works; other devices are initialized lazily.Benchmarks (2×RTX 4090, Qwen-Image 38GB, CFG=7, 20 steps)
Key design decisions
cudaFreeAsyncdecrements counters before memory is actually reusable.init()for device 0. Other devices are initialized on first use viaensure_device_init().budget_deficit_dev()falls back to global when device state isn't initialized.