Skip to content

Multi-GPU support: per-device VRAM management#31

Open
Kosinkadink wants to merge 4 commits into
masterfrom
multigpu-thread-safety
Open

Multi-GPU support: per-device VRAM management#31
Kosinkadink wants to merge 4 commits into
masterfrom
multigpu-thread-safety

Conversation

@Kosinkadink
Copy link
Copy Markdown
Member

Summary

Makes aimdo work properly with multiple GPUs. Previously, aimdo assumed a single GPU — budget_deficit() used global total_vram_usage (sum of ALL GPUs) against one GPU's vram_capacity, creating a phantom deficit that triggered constant unnecessary eviction. With 2×RTX 4090, this made 2-GPU slower than 1-GPU.

Changes

Commit 1: Mutex + device-aware eviction (foundation)

  • Added mutex synchronization (pthread/CRITICAL_SECTION) to protect the VBAR priority linked list
  • Made eviction device-aware: vbars_free_locked_dev() with device_filter parameter
  • Added per-device VRAM accounting: dev_vram_usage[], dev_vram_add/sub, current_cuda_device()

Commit 2: Phases 1–5 (core multi-GPU logic)

  • Phase 1 — Per-device state: AimdoDeviceState struct with per-device vram_capacity, ctx, deficit_sync, etc. Lazy ensure_device_init() called from vbar_allocate and allocator hooks. vbar_allocate uses per-device capacity.
  • Phase 2 — Context safety: with_device_ctx()/restore_ctx() helpers. poll_budget_deficit_dev() switches to correct device context for cuMemGetInfo. Fixed ensure_ctx() to not force device 0 context on multi-GPU.
  • Phase 3 — Per-device hybrid budget: budget_deficit_dev() uses per-device usage + per-device cuMemGetInfo backstop (not just simple accounting). vbar_fault uses this instead of global budget_deficit(), eliminating phantom cross-GPU deficit.
  • Phase 4 — Device-aware allocator hooks: aimdo_cuda_malloc/aimdo_cuda_malloc_async determine current device, use vbars_free_dev + budget_deficit_dev. vrambuf_grow also device-aware.
  • Phase 5 — Atomic counters: dev_vram_add/sub use __atomic_add_fetch (Linux) / InterlockedExchangeAdd64 (Windows).

API Compatibility

Fully backward compatible — no changes to exported function signatures. ComfyUI does not need any modifications. init(device_id) still works; other devices are initialized lazily.

Benchmarks (2×RTX 4090, Qwen-Image 38GB, CFG=7, 20 steps)

Config Time vs 1-GPU Status
1-GPU + aimdo 56.39s 1.00× ✅ stable
2-GPU + aimdo (before, mutex only) 95.39s 0.59× ✅ stable
2-GPU + aimdo (before, dev-aware evict) 78.85s 0.72× ✅ stable
2-GPU + aimdo (this PR) 41.29s 1.37× ✅ stable (7/7)
2-GPU no aimdo 37.32s 1.51× ❌ crashes after ~2 runs
1-GPU + aimdo (this PR) 56.41s 1.00× ✅ no regression

Key design decisions

  1. Must keep cuMemGetInfo hybrid backstop — pure per-device accounting OOM'd because cudaFreeAsync decrements counters before memory is actually reusable.
  2. Lazy device init — ComfyUI only calls init() for device 0. Other devices are initialized on first use via ensure_device_init().
  3. Global budget_deficit() kept as fallbackbudget_deficit_dev() falls back to global when device state isn't initialized.

Kosinkadink and others added 3 commits April 9, 2026 14:09
- Add platform-specific mutex (pthread/CRITICAL_SECTION) to protect the
  global VBAR priority linked list and vbars_dirty state
- Make vbars_free_for_vbar() and vbar_fault() stopgap only evict pages
  from the same device, preventing cross-GPU thrashing
- Add per-device VRAM accounting (dev_vram_usage[]) tracked in
  three_stooges, mod1, account_alloc/free, and vrambuf_destroy
- Add budget_deficit_dev() for per-device budget checks (infrastructure)
- Fix build-linux-docker missing -Isrc include path

Fixes segfault in vbar_fault when multiple GPUs page concurrently
(e.g. CFG > 1 with model larger than VRAM). Improves 2-GPU paging
performance from 0.59x to 0.72x vs single GPU.

Amp-Thread-ID: https://ampcode.com/threads/T-019d71c0-6054-7286-9b03-00dfb4506b59
Co-authored-by: Amp <amp@ampcode.com>
…, device-aware allocator, atomic counters

- Phase 1: AimdoDeviceState struct with per-device vram_capacity, ctx, deficit_sync, etc.
  Lazy ensure_device_init() called from vbar_allocate and allocator hooks.
  vbar_allocate uses per-device capacity instead of global.

- Phase 2: with_device_ctx()/restore_ctx() context save/restore helpers.
  poll_budget_deficit_dev() switches to correct device context for cuMemGetInfo.
  ensure_ctx() fixed to not force device 0 context on multi-GPU.

- Phase 3: budget_deficit_dev() with per-device cuMemGetInfo hybrid backstop.
  vbar_fault stopgap and per-page checks now use budget_deficit_dev instead of
  global budget_deficit, eliminating phantom cross-GPU deficit.

- Phase 4: aimdo_cuda_malloc/aimdo_cuda_malloc_async determine current device,
  call ensure_device_init, use vbars_free_dev + budget_deficit_dev.
  vrambuf_grow also device-aware. Exposed vbars_free_dev() public API.

- Phase 5: dev_vram_add/sub use __atomic_add_fetch/__atomic_sub_fetch (Linux)
  and InterlockedExchangeAdd64 (Windows) for thread-safe counters.

Benchmark: 2-GPU 41.29s avg (was 78.85s), stable 7/7 runs.
Target: 37.32s (no-aimdo, unstable).
Amp-Thread-ID: https://ampcode.com/threads/T-019d7433-e7a7-762c-8d7e-7e8efac3f8bf
Co-authored-by: Amp <amp@ampcode.com>
…x sync

- ensure_device_init: double-checked locking with dev_init_lock mutex
  and __ATOMIC_RELEASE store barrier on s->inited to prevent other
  threads from seeing partially-initialized state.

- dev_vram_load: new atomic read helper using __atomic_load_n (Linux)
  / InterlockedOr64 (Windows). Used in budget_deficit_dev and
  poll_budget_deficit_dev instead of raw non-atomic reads.

- vbars_free_locked_dev: replaced single 'dirty' bool with per-device
  synced[] array. Each device gets its own cuCtxSynchronize via
  with_device_ctx/restore_ctx, preventing cross-device sync issues
  when device_filter == -1.

- All cuCtxSynchronize calls in model-vbar.c now wrapped with
  with_device_ctx/restore_ctx: vbars_free_for_vbar, vbar_unpin,
  vbar_free, vbar_free_memory.

- Inlined vbars_free_locked() (trivial single-caller wrapper).

Benchmarks unchanged: 2-GPU 41.53s, 1-GPU 56.33s (no regression).

Amp-Thread-ID: https://ampcode.com/threads/T-019d7433-e7a7-762c-8d7e-7e8efac3f8bf
Co-authored-by: Amp <amp@ampcode.com>
@Kosinkadink Kosinkadink force-pushed the multigpu-thread-safety branch from 472d1f5 to 2667e4c Compare April 10, 2026 00:39
InterlockedExchangeAdd64 and InterlockedOr64 are MSVC compiler
intrinsics that require <intrin.h>. plat.h only included <BaseTsd.h>
in the Windows block, causing unresolved external symbol link errors.

Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019d7433-e7a7-762c-8d7e-7e8efac3f8bf
@Kosinkadink Kosinkadink force-pushed the multigpu-thread-safety branch from 2667e4c to ac7f22e Compare April 10, 2026 00:42
Comment thread src/plat.h

/* model_vbar.c */
size_t vbars_free(size_t size);
size_t vbars_free_dev(size_t size, int device);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this appears to leave vbars_free behind as dead code:

(venv) rattus@rattus-box2:~/ComfyUI-MuiMui$ git grep "vbars_free[^_]"
src/model-vbar.c:size_t vbars_free(size_t size) {
src/plat.h:size_t vbars_free(size_t size);

If you support multi-gpu the global API doesnt make sense any more. This should just be a signature change of vbars_free to add the require the new argument.

Comment thread src/control.c
#include <pthread.h>
static pthread_mutex_t dev_init_lock = PTHREAD_MUTEX_INITIALIZER;
static inline void dev_init_lock_acquire(void) { pthread_mutex_lock(&dev_init_lock); }
static inline void dev_init_lock_release(void) { pthread_mutex_unlock(&dev_init_lock); }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/pyt-cu-plug-alloc-async.c has the same. Should be generalize if we still need this lock (more to come).

@rattus128
Copy link
Copy Markdown
Collaborator

I think we are better off with cleaning up the C and calling device_init() per device rather than the assymetry on the init paths. If we want to keep comfy 0-change we can do the lazy in python layer, although how hard would a device_init() call per device be in comfy main @Kosinkadink ?

@rattus128
Copy link
Copy Markdown
Collaborator

Amp is seeing a cross GPU state sharing of the last_check timestamp with is a local static. I think this might need to be per GPU state.

if (cuCtxGetCurrent(&ctx) != CUDA_SUCCESS || !ctx) {
cuCtxSetCurrent(aimdo_cuda_ctx);
/* No context set — try per-device init for device 0, then use its ctx */
int dev = 0;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be a regression. amp might have misunderstood the intent of this code.

The original bug here was the poll loop in ComfyUI-Manager for the repositories could call a python gc() from an alien thread. This in turn would cause the cudaFree to be completely detached from contexts. Since we interpose between the rt and driver APIs we need to "guess" the context. aimdo_cuda_ctx is the best guess as that is the users elected GPU (--cuda-device X).

Defaulting to 0 as top priority would there break the case of:

--cuda-device 1
ComfyUI-Manager installed, online and still doing its poll while doing intensive workflows.

Theres a little chaos agent thread I added to comfy main.py to make it more reproducible in the original PR which is more reliable than trying to use Manager to repro.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the longer term solution is to stop hooking the RT API in linux and instead do it the windows way. @Kosinkadink if you hard needed this change for functionality we may want to prioritize and bite that bullet. For 100% solution we need to adopt a detours framework for Linux. Was has a PR with PLT/GOT based solution that double hooks but that doesnt solve for this unfortunately so we would have to go ultimate solution of detours.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test case and chaos agent here:

Comfy-Org/ComfyUI#12840

@rattus128
Copy link
Copy Markdown
Collaborator

ModelPatcher in ComfyUI needs vbars_analyze to return the VRAM per device so that get_free_memory does the right thing. Currently this will aggregate all gpus and over-report free-able memory.

From amp:

Short answer: PR 31 did not make vbars_analyze itself device-scoped. It mainly made it thread-safe, while the multi-GPU changes around it caused its effective meaning to become “global VBAR residency across all devices,” not “the current device’s VBAR residency.”

What PR 31 Changed

From the PR diff, vbars_analyze itself kept the same core algorithm:

It still walks the VBAR list.
It still recomputes resident VRAM by counting rp->handle pages instead of trusting resident_count.
It still logs per-VBAR residency and returns the summed byte count.

What PR 31 directly changed in that function was:

It wrapped vbars_analyze in the new global vbar_list_lock.
It kept vbars_dirty as a single global dirty bit, now protected by that same lock.

So semantically, the function became safer to call concurrently, but its accounting logic did not change.

The Important Semantic Shift After PR 31

The real semantic change came from the surrounding PR 31 refactor:

PR 31 introduced multi-GPU behavior but still kept one global VBAR linked list and one global vbars_dirty.
Because of that, vbars_analyze() in PR 31 no longer implicitly meant “analyze the one device we have.”
It now meant “analyze every VBAR in the global list,” which in practice is “sum resident VBAR pages across all devices.”

That means the return value changed in meaning:

Before multi-GPU, it was effectively single-device VBAR residency.
After PR 31, it became aggregate cross-device VBAR residency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants