[DRAFT] Barebones ROCM support by asagi4 · Pull Request #2 · Comfy-Org/comfy-aimdo

asagi4 · 2026-02-05T11:42:39Z

Contribution Agreement

I agree that my contributions are licensed under the GPLv3.
I grant Comfy Org the rights to relicense these contributions as outlined in CONTRIBUTING.md.

This is not really intended for merging as is, but for reference. hipify-clang can convert the CUDA code to HIP code pretty easily with a few fixes, and it actually allows you to run aimdo on ROCM.

You might have to make sure your Python venv is using your system ROCM libraries for this to work.

It does not work perfectly (I'm still getting pytorch OOMs when it should be freeing memory) but workflows can run and produce good output.

I am not able to test, but the HIP code should be compilable as is on nvidia platforms too. If you run build-rocm on an nvidia platform, hipcc and hipconfig should set it up to link against cuda instead of ROCM and the result should be basically identical to the CUDA implementation.

0xDELUXA · 2026-02-06T14:46:33Z

Oh, AMD support has entered the chat 🚀

0xDELUXA · 2026-02-07T15:48:24Z

Made some adjustments and can confirm that this works on Windows (native ROCm 7 via TheRock) as well. Built aimdo.dll locally, installed this custom wheel, and got:

aimdo: hip_src\control.c:51:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 9060 XT (VRAM: 16304 MB)
DynamicVRAM support detected and enabled

in the console.

So we can get past these warnings:
No working comfy-aimdo install detected. DynamicVRAM support disabled. Falling back to legacy ModelPatcher. VRAM estimates may be unreliable especially on Windows
NOTE: comfy-aimdo is currently only support for Nvidia GPUs

pip install comfy-aimdo automatically installs the Windows (Nvidia-only) package. It does include an aimdo.dll, but AMD gets the following error:

comfy-aimdo failed to load: E:\ComfyUI\venv\Lib\site-packages\comfy_aimdo\aimdo.dll: Could not find module 'E:\ComfyUI\venv\Lib\site-packages\comfy_aimdo\aimdo.dll' (or one of its dependencies). Try using the full path with constructor syntax.

I got curious and checked what Dependencies reports. Out of the three .dlls it requires, we AMD users are missing nvcuda.dll.

My custom-built aimdo.dll, which actually loads on AMD, replaces the nvcuda.dll dependency with amdhip6_7.dll.

Now that it loads, I'm curious whether it actually works as intended or just errors out.

\

I’m experiencing GPU hangs. After some debugging, I suspect it’s related to VMM + ROCm on Windows.

Summary:
VMM allocation APIs report success, but the GPU cannot reliably access the allocated memory.

All hipMemCreate, hipMemMap, and hipMemSetAccess calls return success.
hipMemsetD8 also returns success (the async operation is queued).
hipDeviceSynchronize completes without errors.
PyTorch kernel hangs when attempting to use the memory.

Suspected root cause: The AMD Windows WDDM driver may not fully support access to memory allocated via the VMM APIs.

tvukovic-amd · 2026-02-10T15:49:02Z

If you need any assistance from the AMD team or have additional questions regarding ROCm on Windows, please feel free to reach out to us.

0xDELUXA · 2026-02-10T16:27:09Z

If you need any assistance from the AMD team or have additional questions regarding ROCm on Windows, please feel free to reach out to us.

Now that ComfyUI x AMD is official, and this PR paves the way for ROCm Linux users to use it, it would be great to have comfy-aimdo running on ROCm Windows too. Theoretically, what is preventing it from working? I've tried many things, but it seems there’s something I haven’t been able to figure out.

tvukovic-amd · 2026-02-19T12:12:26Z

@asagi4 Just wanted to check in - is there any update or further progress on this PR?

asagi4 · 2026-02-19T15:17:07Z

@tvukovic-amd Well I can't do much beyond run hipify and make it compile. I don't know enough about ROCM to debug any issues.

I rebased against master to get it to compile again, but it's untested.

asagi4 · 2026-02-19T17:10:33Z

With latest master it seems to be completely broken. all VRAM allocations fail with aimdo: hip_src/vrambuf.c:56:ERROR:VRAM Allocation failed (non OOM) and torch throws an OOM exception immediately.

0xDELUXA · 2026-02-20T14:54:33Z

After @asagi4 confirmed that the latest updates break comfy-aimdo on AMD (Linux), I decided to try building the version checked out from the master branch. I have a very long, workaround-upon-workaround (mainly for hipify, else it just doesn't work) build script that I use on Windows. And somehow it magically avoids the GPU hang issue I was getting when comfy-aimdo was enabled.

I'm sure comfy-aimdo is actually being taken into consideration here, based on the console output (filtered):

aimdo: hip_src\control.c:51:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 9060 XT (VRAM: 16304 MB) DynamicVRAM support detected and enabled
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 0 patches attached.
Model Initializing ...
Model Initialization complete!
Prompt executed in X seconds

\

After further benchmarking, some workloads still trigger GPU hangs, while others run fine. Previously, neither of them ran successfully. It seems that the new Model Initializing... phase is quite heavy on AMD, which is where it occasionally hangs.

asagi4 · 2026-02-20T16:54:19Z

@0xDELUXA you mean you can run hipify without changes to master? How did you manage that?

0xDELUXA · 2026-02-20T16:58:02Z

@0xDELUXA you mean you can run hipify without changes to master? How did you manage that?

Using the script in my fork: https://github.com/0xDELUXA/comfy-aimdo_win-rocm/blob/master/build-rocm-windows.bat

asagi4 · 2026-02-20T17:43:35Z

Which version of ROCM do you have? My hipify-clang fails because it treats the implicit void* casts as errors (I think because I tries to compile the code as C++) but I don't see you dealing with that at all

0xDELUXA · 2026-02-20T17:44:53Z

Which version of ROCM do you have? My hipify-clang fails because it treats the implicit void* casts as errors (I think because I tries to compile the code as C++) but I don't see you dealing with that at all

ROCm: 7.12.0a20260218
PyTorch: 2.12.0a0+rocm7.12.0a20260218
OS: WIndows 11

asagi4 · 2026-02-20T18:38:06Z

I managed to locally fix things so that aimdo works for me again.
I think vrambuf_create has some alignment issue that appears with HIP
diff for hipified source here

diff -ru hip_src/vrambuf.c hip_src_fixed2/vrambuf.c
--- hip_src/vrambuf.c   2026-02-20 20:34:56.698464966 +0200
+++ hip_src_fixed2/vrambuf.c    2026-02-20 20:32:52.685112770 +0200
@@ -7,8 +7,16 @@
 SHARED_EXPORT
 void *vrambuf_create(int device, size_t max_size) {
     VramBuffer *buf;
+    if ((max_size / VRAM_CHUNK_SIZE) * VRAM_CHUNK_SIZE < max_size) {
+       log(ERROR, "??? alignment %zu\n", max_size);
+       max_size = ((max_size / VRAM_CHUNK_SIZE) + 1) * VRAM_CHUNK_SIZE;
+       log(ERROR, "??? fixed alignment %zu\n", max_size);
+    }

-    buf = (VramBuffer *)calloc(1, sizeof(*buf) + sizeof(hipMemGenericAllocationHandle_t) * max
_size / VRAM_CHUNK_SIZE);
+    size_t size = 0;
+    size = sizeof(*buf) + (sizeof(hipMemGenericAllocationHandle_t) * (max_size / VRAM_CHUNK_SI
ZE));
+    log(ERROR, "vrambuf_create calloc %zu\n", size)
+    buf = (VramBuffer *)calloc(1, size);
     if (!buf) {
         return NULL;
     }
@@ -53,7 +61,7 @@
         }
         if ((err = three_stooges(buf->base_ptr + buf->allocated, to_allocate, buf->device, &ha
ndle)) != hipSuccess) {
             if (err != hipErrorOutOfMemory) {
-                log(ERROR, "VRAM Allocation failed (non OOM): %d\n", err);
+                log(ERROR, "VRAM Allocation failed (non OOM): %s\n", hipGetErrorString(err));
                 return false;
             }
             log(DEBUG, "Pytorch allocator attempt exceeds available VRAM ...\n");

apparently vrambuf_create somehow works on CUDA without aligning to chunk size but with HIP (on Linux?) it fails. I don't know why it works on Windows.

0xDELUXA · 2026-02-20T18:40:35Z

I haven’t encountered any OOMs in my workflows, but occasionally the GPU hangs at 100% usage. It would be great if Windows and Linux ROCm were even more similar.

asagi4 · 2026-02-20T18:55:48Z

with these changes things work for me again on Linux. Or at least one workflow ran successfully. Previously pretty much all allocations failed with "invalid argument" when mapping new vram allocations, presumably because the vram buffers weren't aligned to the defined chunk size.

asagi4 · 2026-02-22T09:04:27Z

Hm, with the latest changes to master the fixing has gotten a bit more complicated because aimdo's overriding functions have mismatching result types from cuda functions and hipify / clang doesn't like that.

For example, they're defined to return int in the header, but the actual function prototype says cudaError_t. In addition, the actual aimdo implementations return CUresults...

I'll try to see what happens if I just fix the return types and cast the return values, but that seems like something that should be fixed regardless of ROCm, since I don't think relying on implicit casts from integers is very good behaviour.

@rattus128 what do you think?

asagi4 · 2026-02-22T09:27:19Z

Now it compiles, loads and appears to work again.

Haven't stress-tested though.

0xDELUXA · 2026-02-22T14:52:49Z

Have you run any workload that exceeds VRAM and would OOM without comfy-aimdo?

Does the original example.py work on your system?

Another thing is that the ROCm documentation states that VMM is “under development” on Windows. Some APIs are even marked as beta on Linux too, so I can’t really do anything to get it to work reliably on Windows.

asagi4 · 2026-02-22T17:41:19Z

@0xDELUXA I haven't stress tested things much, so it's possible that the code isn't very useful as is and fails under memory pressure, but at least it compiles and runs, so it's a start. I also suspect that it failing if vrambuffer allocations aren't aligned to the chunk size is a bug that's just masked by some CUDA-specific behaviour, but I don't know what exactly it's doing wrong, but with ROCm the hipified cuMemSetAccess calls fail with "invalid argument".

I wonder if since the pointer it's working with is vrambuf->base_addr+vrambuf->allocated, that it gives an invalid pointer with some allocation patterns.

I can't help with Windows at all unfortunately. It's been a long time since I last used it for anything.

0xDELUXA · 2026-02-22T20:53:48Z

@0xDELUXA I haven't stress tested things much, so it's possible that the code isn't very useful as is and fails under memory pressure, but at least it compiles and runs, so it's a start. I also suspect that it failing if vrambuffer allocations aren't aligned to the chunk size is a bug that's just masked by some CUDA-specific behaviour, but I don't know what exactly it's doing wrong, but with ROCm the hipified cuMemSetAccess calls fail with "invalid argument".

I wonder if since the pointer it's working with is vrambuf->base_addr+vrambuf->allocated, that it gives an invalid pointer with some allocation patterns.

I see. I don’t really think the comfy-aimdo dev has much insight into the AMD side, so it’s just us. I assume there will still be things that work reliably on Nvidia but not as well on AMD.

I can't help with Windows at all unfortunately. It's been a long time since I last used it for anything.

Not a problem - the build script from my fork, on Windows, as you said, "at least it compiles and runs, so it's a start."

0xDELUXA · 2026-02-23T13:32:20Z

I'm rather curious about how your AMD Linux implementation behaves. Could you try running example.py pls? My output on Windows is this.

asagi4 · 2026-02-23T13:59:24Z

@0xDELUXA I can't run it at all because it tries to import a function called vbars_analyze that doesn't seem to exist anywhere.

0xDELUXA · 2026-02-23T14:01:44Z

I needed to modify it as well, and this one works for me. Commented out vbars_analyze, etc.

asagi4 · 2026-02-23T14:16:05Z

I fixed the script and it gives me this:

Init complete
aimdo: hip_src/control.c:67:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 7900 XTX (VRAM: 24560 MB)
aimdo: hip_src/model-vbar.c:181:DEBUG:vbar_allocate (start): size=131072M, device=0
aimdo: hip_src/model-vbar.c:208:DEBUG:vbar_allocate (return): vbar=0xabacef0
aimdo: hip_src/model-vbar.c:260:DEBUG:vbar_get vbar=0xabacef0
##################### Run the first model #######################
Some weights will be loaded and stay there for all iterations
Some weights will be offloaded

aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400

Iteration 0
[First Load] Populated weight at offset: 0.0M
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400
[First Load] Populated weight at offset: 400.0M
[First Load] Populated weight at offset: 800.0M
[First Load] Populated weight at offset: 1200.0M
[First Load] Populated weight at offset: 1600.0M
[First Load] Populated weight at offset: 2000.0M
[First Load] Populated weight at offset: 2400.0M
[First Load] Populated weight at offset: 2800.0M
[First Load] Populated weight at offset: 3200.0M
[First Load] Populated weight at offset: 3600.0M
[First Load] Populated weight at offset: 4000.0M
[First Load] Populated weight at offset: 4400.0M
[First Load] Populated weight at offset: 4800.0M
[First Load] Populated weight at offset: 5200.0M
[First Load] Populated weight at offset: 5600.0M
[First Load] Populated weight at offset: 6000.0M
[First Load] Populated weight at offset: 6400.0M
[First Load] Populated weight at offset: 6800.0M
[First Load] Populated weight at offset: 7200.0M
[First Load] Populated weight at offset: 7600.0M
[First Load] Populated weight at offset: 8000.0M
[First Load] Populated weight at offset: 8400.0M
[First Load] Populated weight at offset: 8800.0M
[First Load] Populated weight at offset: 9200.0M
[First Load] Populated weight at offset: 9600.0M
[First Load] Populated weight at offset: 10000.0M
[First Load] Populated weight at offset: 10400.0M
[First Load] Populated weight at offset: 10800.0M
[First Load] Populated weight at offset: 11200.0M
[First Load] Populated weight at offset: 11600.0M
[First Load] Populated weight at offset: 12000.0M
[First Load] Populated weight at offset: 12400.0M
[First Load] Populated weight at offset: 12800.0M
[First Load] Populated weight at offset: 13200.0M
[First Load] Populated weight at offset: 13600.0M
[First Load] Populated weight at offset: 14000.0M
[First Load] Populated weight at offset: 14400.0M
[First Load] Populated weight at offset: 14800.0M
[First Load] Populated weight at offset: 15200.0M
[First Load] Populated weight at offset: 15600.0M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 3
...

Iteration 4
...

Iteration 5
...

Iteration 6
...

Iteration 7
...

Iteration 8
...

Iteration 9
...
aimdo: hip_src/pyt-cu-plug-alloc.c:89:DEBUG:Pytorch is freeing VRAM ...
aimdo: hip_src/control.c:34:DEBUG:--- VRAM Stats ---
aimdo: hip_src/control.c:37:DEBUG:  Aimdo Recorded Usage:    16400 MB
aimdo: hip_src/control.c:38:DEBUG:  Cuda:     7820 MB /   24560 MB Free
aimdo: hip_src/model-vbar.c:53:DEBUG:---------------- VBAR Usage ---------------
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xabacef0: Actual Resident VRAM = 16000 MB
aimdo: hip_src/model-vbar.c:86:DEBUG:Total VRAM for VBARs: 16000 MB
aimdo: hip_src/pyt-cu-plug-alloc.c:21:DEBUG:--- Allocation Analysis Start ---
aimdo: hip_src/pyt-cu-plug-alloc.c:30:DEBUG:  [Bucket 1591] Ptr: 0x7fa6c6e00000 | Size:  409600k
aimdo: hip_src/pyt-cu-plug-alloc.c:39:DEBUG:1 Active Allocations for a total of     400 MB
aimdo: hip_src/model-vbar.c:181:DEBUG:vbar_allocate (start): size=3072M, device=0
aimdo: hip_src/model-vbar.c:208:DEBUG:vbar_allocate (return): vbar=0xb135160
aimdo: hip_src/model-vbar.c:260:DEBUG:vbar_get vbar=0xb135160
##################### Run the second model #######################
Everything will be loaded and will displace some weights of the first model

aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 633339904
aimdo: hip_src/vrambuf.c:16:ERROR:vrambuffer max_size not aligned to chunk size!

Iteration 0
[First Load] Populated weight at offset: 0.0M
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 633339904
aimdo: hip_src/vrambuf.c:16:ERROR:vrambuffer max_size not aligned to chunk size!
[First Load] Populated weight at offset: 603.2421875M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 603.2421875M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 603.2421875M

Iteration 3
...

Iteration 4
...

Iteration 5
...

Iteration 6
...

Iteration 7
...

Iteration 8
...

Iteration 9
...
aimdo: hip_src/pyt-cu-plug-alloc.c:89:DEBUG:Pytorch is freeing VRAM ...
aimdo: hip_src/control.c:34:DEBUG:--- VRAM Stats ---
aimdo: hip_src/control.c:37:DEBUG:  Aimdo Recorded Usage:    17824 MB
aimdo: hip_src/control.c:38:DEBUG:  Cuda:     6396 MB /   24560 MB Free
aimdo: hip_src/model-vbar.c:53:DEBUG:---------------- VBAR Usage ---------------
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xabacef0: Actual Resident VRAM = 16000 MB
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xb135160: Actual Resident VRAM = 1216 MB
aimdo: hip_src/model-vbar.c:86:DEBUG:Total VRAM for VBARs: 17216 MB
aimdo: hip_src/pyt-cu-plug-alloc.c:21:DEBUG:--- Allocation Analysis Start ---
aimdo: hip_src/pyt-cu-plug-alloc.c:30:DEBUG:  [Bucket 3544] Ptr: 0x7fa5bb000000 | Size:  622592k
aimdo: hip_src/pyt-cu-plug-alloc.c:39:DEBUG:1 Active Allocations for a total of     608 MB
##################### Run the first model again #######################
Some weights will still be loaded from before and be there first iteration
Some weights will get re-loaded on the first interation
The rest will be offloaded again

aimdo: hip_src/model-vbar.c:234:DEBUG:vbar_prioritize vbar=0xabacef0
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400

Iteration 0
[No Load Needed] Reusing weight at offset: 0.0M
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 3
...

Iteration 4
...

Iteration 5
...

Iteration 6
...

Iteration 7
...

Iteration 8
...

Iteration 9
...
aimdo: hip_src/pyt-cu-plug-alloc.c:89:DEBUG:Pytorch is freeing VRAM ...
aimdo: hip_src/control.c:34:DEBUG:--- VRAM Stats ---
aimdo: hip_src/control.c:37:DEBUG:  Aimdo Recorded Usage:    17616 MB
aimdo: hip_src/control.c:38:DEBUG:  Cuda:     6604 MB /   24560 MB Free
aimdo: hip_src/model-vbar.c:53:DEBUG:---------------- VBAR Usage ---------------
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xb135160: Actual Resident VRAM = 1216 MB
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xabacef0: Actual Resident VRAM = 16000 MB
aimdo: hip_src/model-vbar.c:86:DEBUG:Total VRAM for VBARs: 17216 MB
aimdo: hip_src/pyt-cu-plug-alloc.c:21:DEBUG:--- Allocation Analysis Start ---
aimdo: hip_src/pyt-cu-plug-alloc.c:30:DEBUG:  [Bucket 1591] Ptr: 0x7fa6c6e00000 | Size:  409600k
aimdo: hip_src/pyt-cu-plug-alloc.c:39:DEBUG:1 Active Allocations for a total of     400 MB
Exception ignored in: <function ModelVBAR.__del__ at 0x7fae20bee7a0>
Traceback (most recent call last):
  File "/home/sd/git/comfy-aimdo/comfy_aimdo/model_vbar.py", line 95, in __del__
AttributeError: 'NoneType' object has no attribute 'vbar_free'
Exception ignored in: <function ModelVBAR.__del__ at 0x7fae20bee7a0>
Traceback (most recent call last):
  File "/home/sd/git/comfy-aimdo/comfy_aimdo/model_vbar.py", line 95, in __del__
AttributeError: 'NoneType' object has no attribute 'vbar_free'```
Some of the ERROR logs from aimdo aren't actually errors, they're just things I added that I wanted to log without enabling debug logging.

0xDELUXA · 2026-02-23T14:24:56Z

I see. I've also added some debug output, but shouldn't the script also print [Offloaded] alongside [First Load] and [No Load Needed], considering the Some weights will be offloaded and The rest will be offloaded again comments included in the script by rattus128?
Based on the outputs, this is the main difference between comfy-aimdo on AMD Linux/Windows at present.
Which AMD GPU do you have, btw? Mine has 16 GB VRAM, if yours has more, that could explain the offload difference.

asagi4 · 2026-02-23T15:01:29Z

It might be that it runs like that because everything fits into VRAM. If I change the layer counts, at some point I just get OOMs. I don't think it's properly offloading anything automatically.

jnolck · 2026-03-25T01:10:04Z

I accidentally posted this comment in the ROCm issue first, Github UI managed to confuse me.

So, it looks like dynamic vram still doesn't allow me to run a standard WAN 2.2 14B workflow

It gets rid of swapping and improves memory behaviour, but it still fails to load the second model after running the first (first model takes ~2 minutes to load and run to completion, second model is stuck at "Initializing Model" at 100% CPU for 30 minutes and gets nowhere)

So I guess despite dynamic VRAM, I'll still need a way to just tell ComfyUI to completely drop a model mid-workflow if I want to be able to run WAN models. If ComfyUI were able to drop the previous model completely from RAM and VRAM, the workflow would be able to finish. Unfortunately, it doesn't seem to be possible to do currently, even with a custom node :/

The weird part about ComfyUI getting stuck is that it stops properly responding to Ctrl-C and I have to kill it from the outside.

Doesn't --disable-smart-memory --cache-none do that? Or is that incompatible with aimdo?

asagi4 · 2026-03-25T06:07:51Z

I accidentally posted this comment in the ROCm issue first, Github UI managed to confuse me.
So, it looks like dynamic vram still doesn't allow me to run a standard WAN 2.2 14B workflow
It gets rid of swapping and improves memory behaviour, but it still fails to load the second model after running the first (first model takes ~2 minutes to load and run to completion, second model is stuck at "Initializing Model" at 100% CPU for 30 minutes and gets nowhere)
So I guess despite dynamic VRAM, I'll still need a way to just tell ComfyUI to completely drop a model mid-workflow if I want to be able to run WAN models. If ComfyUI were able to drop the previous model completely from RAM and VRAM, the workflow would be able to finish. Unfortunately, it doesn't seem to be possible to do currently, even with a custom node :/
The weird part about ComfyUI getting stuck is that it stops properly responding to Ctrl-C and I have to kill it from the outside.

Doesn't --disable-smart-memory --cache-none do that? Or is that incompatible with aimdo?

It does, but it also disables caching outputs which makes re-running workflows with partial changes very inefficient so it's not really a solution.

leovanalphen · 2026-03-25T19:59:05Z

I accidentally posted this comment in the ROCm issue first, Github UI managed to confuse me.

So, it looks like dynamic vram still doesn't allow me to run a standard WAN 2.2 14B workflow

It gets rid of swapping and improves memory behaviour, but it still fails to load the second model after running the first (first model takes ~2 minutes to load and run to completion, second model is stuck at "Initializing Model" at 100% CPU for 30 minutes and gets nowhere)

So I guess despite dynamic VRAM, I'll still need a way to just tell ComfyUI to completely drop a model mid-workflow if I want to be able to run WAN models. If ComfyUI were able to drop the previous model completely from RAM and VRAM, the workflow would be able to finish. Unfortunately, it doesn't seem to be possible to do currently, even with a custom node :/

The weird part about ComfyUI getting stuck is that it stops properly responding to Ctrl-C and I have to kill it from the outside.

I had WAN2.2 14B generation working, generated a couple of videos. However, I did have to drop the video size to 512x512, otherwise the VAE decode seems to fail and outputs all grey video. Here are some video's I generated: https://imgur.com/a/H5p3DBR. I don't have the ComfyUI logs anymore, but I could give it another try.

asagi4 · 2026-03-25T20:38:54Z

I had WAN2.2 14B generation working, generated a couple of videos. However, I did have to drop the video size to 512x512, otherwise the VAE decode seems to fail and outputs all grey video. Here are some video's I generated: https://imgur.com/a/H5p3DBR. I don't have the ComfyUI logs anymore, but I could give it another try.

How much RAM do you have? My system is capped at 32GB and it looks like it's just not enough to run WAN2.2.

I don't think the checkpoints themselves are corrupted or anything since I can swap them around and the first one will always successfully run, so it is still just a memory management problem.

tvukovic-amd · 2026-03-26T11:09:57Z

I accidentally posted this comment in the ROCm issue first, Github UI managed to confuse me.

So, it looks like dynamic vram still doesn't allow me to run a standard WAN 2.2 14B workflow

It gets rid of swapping and improves memory behaviour, but it still fails to load the second model after running the first (first model takes ~2 minutes to load and run to completion, second model is stuck at "Initializing Model" at 100% CPU for 30 minutes and gets nowhere)

So I guess despite dynamic VRAM, I'll still need a way to just tell ComfyUI to completely drop a model mid-workflow if I want to be able to run WAN models. If ComfyUI were able to drop the previous model completely from RAM and VRAM, the workflow would be able to finish. Unfortunately, it doesn't seem to be possible to do currently, even with a custom node :/

The weird part about ComfyUI getting stuck is that it stops properly responding to Ctrl-C and I have to kill it from the outside.

So, with the fix in the PR you still have issues while running WAN models?

0xDELUXA · 2026-03-26T11:35:08Z

I’m able to run WAN 2.2 with 32 GB of system RAM. It’s slow, but I don’t get any OOMs (on Windows).

asagi4 · 2026-03-26T15:06:09Z

So, with the fix in the PR you still have issues while running WAN models?

Yeah. I think the ROCm bug is fixed, but even a working aimdo apparently isn't enough for me to run those models.

I think it might still be offloading from GPU to RAM instead of back to disk so I'm probably just running out of RAM and that causes breakage somewhere. I don't know how to find out what's breaking though since the only symptom is that ComfyUI gets stuck at 100% CPU usage, seemingly forever (the longest I've let it run is one hour)

leovanalphen · 2026-03-26T15:40:22Z

So, with the fix in the PR you still have issues while running WAN models?

Yeah. I think the ROCm bug is fixed, but even a working aimdo apparently isn't enough for me to run those models.

I think it might still be offloading from GPU to RAM instead of back to disk so I'm probably just running out of RAM and that causes breakage somewhere. I don't know how to find out what's breaking though since the only symptom is that ComfyUI gets stuck at 100% CPU usage, seemingly forever (the longest I've let it run is one hour)

I tried to run WAN again today and noticed it is crashing the python process for me now, I'm also noticing the 'load entire model into ram first -> then into vram', which I don't remember having earlier, it seems to be what is crashing the workflow (also have 32GB RAM), running out of RAM, while the GPU still has 10GB dedicated and 20gb shared vram free .

The only thing I have changed in the mean time is that I switched to therock for pytorch (from the one that is supplied with comfyui-desktop). I'll try switching back this evening and see if that 'fixes' it.

edit:

In the mean time, what would work as a workaround is splitting it into two workflows, first one does the low-noise diffusion -> save latents -> second workflow + manual flush -> load latents from workflow1 into 2 -> continue.

jeremymeyers · 2026-03-26T15:56:40Z

I've had luck with Wan2.2 on AMD with TheROCK by switching to GGUF models and being aggressive with unloading. Clip+VAE+LoRA (sometimes low+high)+models (low+high) is a mighty big ask for any consumer-grade gpu even with unloading, and once the models overflow into RAM, gen time goes through the roof

leovanalphen · 2026-03-26T16:25:11Z

So, with the fix in the PR you still have issues while running WAN models?

Yeah. I think the ROCm bug is fixed, but even a working aimdo apparently isn't enough for me to run those models.

I think it might still be offloading from GPU to RAM instead of back to disk so I'm probably just running out of RAM and that causes breakage somewhere. I don't know how to find out what's breaking though since the only symptom is that ComfyUI gets stuck at 100% CPU usage, seemingly forever (the longest I've let it run is one hour)

Reinstalled ComfyUI desktop with the adrenalin driver rocm and pytorch. The workflow now completes again like before.

Workflow video_wan2_2_14B_i2v (default comfy) -> changed resolution from 640x640 to 512x512.

Output: https://imgur.com/a/s1pcLvt

Logs:

Found comfy_kitchen backend cuda: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_mxfp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_mxfp8', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_mxfp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_mxfp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend triton: {'available': False, 'disabled': True, 'unavailable_reason': "ImportError: No module named 'triton'", 'capabilities': []}
Checkpoint files will always be loaded safely.
Total VRAM 16304 MB, total RAM 32693 MB
pytorch version: 2.9.1+rocmsdk20260116
Set: torch.backends.cudnn.enabled = False for better AMD performance.
AMD arch: gfx1201
ROCm version: (7, 2)
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 9070 XT : native
Using async weight offloading with 2 streams
Enabled pinned memory 14711.0
Using pytorch attention
Python version: 3.12.11 (main, Aug 18 2025, 19:17:54) [MSC v.1944 64 bit (AMD64)]
ComfyUI version: 0.18.2
comfy-aimdo version: 0.2.99+rocm1
comfy-kitchen version: 0.2.8

got prompt
Using split attention in VAE
Using split attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
Requested to load WanTEModel
loaded completely;  6419.48 MB loaded, full load: True
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cuda:0, dtype: torch.float16
Requested to load WanVAE
loaded completely; 5518.69 MB usable, 242.03 MB loaded, full load: True
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.float16, manual cast: torch.float16
model_type FLOW
Requested to load WAN21
loaded partially; 10876.55 MB usable, 10701.51 MB loaded, 2929.91 MB offloaded, 175.03 MB buffer reserved, lowvram patches: 115
100%|██████████| 2/2 [00:46<00:00, 23.23s/it]
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.float16, manual cast: torch.float16
model_type FLOW
Requested to load WAN21
loaded partially; 10732.30 MB usable, 10557.26 MB loaded, 3074.15 MB offloaded, 175.03 MB buffer reserved, lowvram patches: 120
100%|██████████| 2/2 [02:27<00:00, 73.63s/it]
Requested to load WanVAE
loaded completely; 1959.37 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 530.97 seconds

So it seems to me there is still an issue either in therock pytorch, or therock rocm itself.

asagi4 · 2026-03-26T17:47:30Z

Running the workflow with some more verbose logging, it seems like it manages to do something with the second model. It spams a lot of Backend eager selected for dequantize_per_tensor_fp8, and then it just stops for no apparent reason. CPU usage stays high after it has stopped but it makes no progress.

@rattus128 do you have any idea what is going on there?

0xDELUXA · 2026-03-26T17:53:22Z

Running the workflow with some more verbose logging, it seems like it manages to do something with the second model. It spams a lot of Backend eager selected for dequantize_per_tensor_fp8, and then it just stops for no apparent reason. CPU usage stays high after it has stopped but it makes no progress.

The Backend eager selected for dequantize_per_tensor_fp8 message originates from comfy-kitchen. It currently has some issues on ROCm (e.g. Comfy-Org/comfy-kitchen#32). @tvukovic-amd is already looking into it.
I did some experiments to add a ROCm backend to it, though I'm definitely not suggesting that this would solve your issue.

asagi4 · 2026-03-26T19:02:26Z

The Backend eager selected for dequantize_per_tensor_fp8 message originates from comfy-kitchen. It currently has some issues on ROCm (e.g. Comfy-Org/comfy-kitchen#32). @tvukovic-amd is already looking into it. I did some experiments to add a ROCm backend to it, though I'm definitely not suggesting that this would solve your issue.

I'm aware they're from comfy-kitchen, it's just weird that the messages just go from spamming fairly quickly to full stop suddenly. It looks like the model is running properly at first, but then ComfyUI hits some kind of threshold and gets stuck in an offload loop or something and stops making any progress.

manfreed · 2026-03-31T17:26:14Z

This happens when I start working on something without doing any research on prior art.

Anyway, I did make my own ROCm fork before realizing you had one already, so I'll just share it, maybe it can be of some use. It "#worksforme", although I only did some minimal testing with a handful of generations. However, I did not experience the OOMs and crash issues you seem to had with your port, so maybe there is some implementation difference (I didn't have the chance to look into your PR yet)

(also disclaimer I used some AI to help me out and learn, this would be way over my head)

asagi4 · 2026-03-31T20:54:24Z

@norbert-sule It looks like your code does pretty much the same thing as mine.

Are you running it on Linux? If you are, you should be hitting the same ROCm virtual memory bug that I did and leak memory, since that's just a ROCm bug unrelated to aimdo.

With the memory bug fixed, aimdo pretty much works; I don't get OOMs or crashes with ComfyUI; It's just that on my system trying to run 2x 14B WAN models apparently just doesn't work (it didn't work pre-aimdo and it still doesn't).

tvukovic-amd · 2026-04-15T09:08:13Z

Running the workflow with some more verbose logging, it seems like it manages to do something with the second model. It spams a lot of Backend eager selected for dequantize_per_tensor_fp8, and then it just stops for no apparent reason. CPU usage stays high after it has stopped but it makes no progress.

The Backend eager selected for dequantize_per_tensor_fp8 message originates from comfy-kitchen. It currently has some issues on ROCm (e.g. Comfy-Org/comfy-kitchen#32). @tvukovic-amd is already looking into it. I did some experiments to add a ROCm backend to it, though I'm definitely not suggesting that this would solve your issue.

The solution for the issue Comfy-Org/comfy-kitchen#32 is merged in pytorch main (here is the PR with solution).

rattus128 · 2026-04-16T04:59:45Z

Hey everyone, Thanks for the huge efforts.

Ive just merged a PR to master that is going to conflict. Rather than send you back to square one though with those merge conflicts, feel free to leave this a few days as I will still analyze the approaches relative to your merge base. I have a few ideas on how to make this easier esp from a builds point of view. Im going to next couple of days catch up on the history and approach and see where we are at. This is Aimdos next feature by plans as of this writing.

If theres any sense of "this is still an unresolved problem" on any front let me know. There's a lot of history here!

0xDELUXA · 2026-04-16T07:21:45Z

Hey everyone, Thanks for the huge efforts.

Ive just merged a PR to master that is going to conflict. Rather than send you back to square one though with those merge conflicts, feel free to leave this a few days as I will still analyze the approaches relative to your merge base. I have a few ideas on how to make this easier esp from a builds point of view. Im going to next couple of days catch up on the history and approach and see where we are at. This is Aimdos next feature by plans as of this writing.

If theres any sense of "this is still an unresolved problem" on any front let me know. There's a lot of history here!

Quite a lot, actually - I don’t remember seeing a conversation this long in any AMD-related PR XD

Great to see openness to AMD support! We’re happy to help test on our systems as things move forward.

At the moment, there don’t seem to be any unresolved issues on Windows AFAIK (thx to @tvukovic-amd).
@asagi4 will share their perspective on the Linux side.

rattus128 · 2026-04-17T15:22:23Z

@@ -1,4 +1,4 @@
-#include "plat.h"
+#include "../src/plat.h"


this should be fixed now in the build scriptage.

rattus128 · 2026-04-17T15:28:38Z

+#  define VRAM_CHUNK_SIZE      CUDA_PAGE_SIZE
+#else
+#  define VRAM_CHUNK_SIZE      (16ULL * 1024 * 1024)
+#endif


is this still needed after that rocm fix for the leak or different thing?

Looks like this was introduced in asagi4@9f2d2fa, with the commit message:

Aligning up to chunk size is still needed, otherwise I get an immediate OOM

@asagi4 Is this still the case?

rattus128 · 2026-04-17T15:32:52Z

+env/
+.vscode/
+comfy_aimdo/_version.py
+.clang-format


We should remove the IDE specific gitignore content.

rattus128 · 2026-04-17T15:52:17Z

+            if implementation == AimdoImpl.ROCM:
+                try:
+                    from . import _rocm_init
+                    _rocm_init.initialize()


what was the history of this and why is the situation different for AMD? Can we just let pytorch load everything and hook after?

This is @tvukovic-amd's solution to make aimdo use the DLL from rocm_sdk_core instead of the system-wide version (e.g., installed with the display driver/Adrenalin), which otherwise causes errors.

@0xDELUXA @tvukovic-amd so IIUC pytorch will have the same logic right? Im currently working on converting this to linkless to make pytorch the sole authority on what GPU libs get loaded so if thats the only reason we can drop this change in that approach.

aimdo can be built with both DLLs, but it fails on the user’s side when using the system-wide DLL, which is preferred when this workaround is not applied.

Does the system DLL solve a particular problem the pyt/portable-bundled one does not? If that bundled version sucks we should fix comfy build.

We have issues with the system DLL, it causes hangs. aimdo should use the ROCm-bundled one.

We have issues with the system DLL, it causes hangs. aimdo should use the ROCm-bundled one.

This?

ComfyUI_windows_portable/python_embeded/Lib/site-packages/_rocm_sdk_core/bin/amdhip64_7.dll

For the moment I am assuming comfy-portable installation on top of the portal recommended driver:

As of the time of writing this you need this driver for best results: https://www.amd.com/en/resources/support-articles/release-notes/RN-AMDGPU-WINDOWS-PYTORCH-7-1-1.html

@rattus128
In my opinion, most AMD/Windows users update to the latest Adrenalin drivers (e.g. 26.3.1) and use TheRock. These driver-specific PyTorch versions feel unnecessary.

TheRock provides more up-to-date features and way broader hardware support compared to the driver release notes (e.g. RDNA4 support and limited RDNA3 coverage, mainly RX 7900 XTX), which is really inconsistent. Also, these driver PyTorch versions can't be considered more "stable" than TheRock at all.

AFAIK, in the future, TheRock and these driver-PyTorch releases will converge, it doesn’t make much sense for AMD to release PyTorch from two separate sources.

Amp-Thread-ID: https://ampcode.com/threads/T-019d9be9-f878-771b-9d48-6c18e74ef6d3 Co-authored-by: Amp <amp@ampcode.com>

rattus128 · 2026-04-20T13:51:09Z

Merged to https://github.com/Comfy-Org/comfy-aimdo/pull/35/changes

@0xDELUXA @Apophis3158 please feel free to take a look on windows

Currently I crash example.py with:

Windows fatal exception: access violation

Stack (most recent call first):
  File "C:\users\rattu\ComfyUI_windows_portable_amd\ComfyUI_windows_portable\python_embeded\Lib\site-packages\comfy_aimdo\model_vbar.py", line 50 in __init__
  File "C:\users\rattu\example.py", line 96 in <module>
Traceback (most recent call last):
  File "C:\users\rattu\example.py", line 96, in <module>
    vbar2 = ModelVBAR(gpu_size * 5, device=0) #The vbar can be much bigger than VRAM
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\users\rattu\ComfyUI_windows_portable_amd\ComfyUI_windows_portable\python_embeded\Lib\site-packages\comfy_aimdo\model_vbar.py", line 50, in __init__
    self._ptr = lib.vbar_allocate(self._devctx, int(size), device)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: exception: access violation reading 0x00000000000000E0
aimdo[DEBUG] src\model-vbar.c:379: vbar_free: vbar=0000016FC2CFA3F0

But I also crash like this on the @Apophis3158 branch. So it's likely to be my combo of hardware and AMD stack.

Ive made a effort to simplify the build and linkage approach across AMD and Nvidia which is why it moves a distance from this PR.

0xDELUXA · 2026-04-20T14:00:02Z

@rattus128 Great work! Will give it a try shortly.

Local output of example.py on gfx1200 (Windows 11 with TheRock ROCm 7.13.0a + PyTorch 2.13.0a0)

(venv) PS C:\Users\deluxa> pip uninstall comfy-aimdo -y
Found existing installation: comfy-aimdo 0.2.12
Uninstalling comfy-aimdo-0.2.12:
  Successfully uninstalled comfy-aimdo-0.2.12

(venv) PS C:\Users\deluxa> pip install comfy_aimdo-0.0.214.dev33-cp39-abi3-win_amd64.whl
Processing comfy_aimdo-0.0.214.dev33-cp39-abi3-win_amd64.whl
Installing collected packages: comfy-aimdo
Successfully installed comfy-aimdo-0.0.214.dev33

(venv) PS C:\Users\deluxa> python C:\comfy-aimdo\examples\example.py
aimdo: src-win/cuda-detour.c:38:INFO:aimdo_setup_hooks: installing 6 hooks
aimdo: src-win/shmem-detect.c:80:INFO:comfy-aimdo WDDM adapter match: AMD Radeon RX 9060 XT runtime_luid=00000000:0001546b dxgi_luid=00000000:0001546b
aimdo: src/control.c:152:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 9060 XT (VRAM: 16304 MB)
##################### Run the first model #######################
Some weights will be loaded and stay there for all iterations
Some weights will be offloaded


Iteration 0
[First Load] Populated weight at offset: 0.0M
[First Load] Populated weight at offset: 814.0M
[First Load] Populated weight at offset: 1628.0M
[First Load] Populated weight at offset: 2442.0M
[First Load] Populated weight at offset: 3256.0M
[First Load] Populated weight at offset: 4070.0M
[First Load] Populated weight at offset: 4884.0M
[First Load] Populated weight at offset: 5698.0M
[First Load] Populated weight at offset: 6512.0M
[First Load] Populated weight at offset: 7326.0M
[First Load] Populated weight at offset: 8140.0M
[First Load] Populated weight at offset: 8954.0M
[First Load] Populated weight at offset: 9768.0M
[First Load] Populated weight at offset: 10582.0M
[First Load] Populated weight at offset: 11396.0M
[First Load] Populated weight at offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 814.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 2442.0M
[No Load Needed] Reusing weight at offset: 3256.0M
[No Load Needed] Reusing weight at offset: 4070.0M
[No Load Needed] Reusing weight at offset: 4884.0M
[No Load Needed] Reusing weight at offset: 5698.0M
[No Load Needed] Reusing weight at offset: 6512.0M
[No Load Needed] Reusing weight at offset: 7326.0M
[No Load Needed] Reusing weight at offset: 8140.0M
[No Load Needed] Reusing weight at offset: 8954.0M
[No Load Needed] Reusing weight at offset: 9768.0M
[No Load Needed] Reusing weight at offset: 10582.0M
[Offloaded] offset: 11396.0M
[Offloaded] offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 814.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 2442.0M
[No Load Needed] Reusing weight at offset: 3256.0M
[No Load Needed] Reusing weight at offset: 4070.0M
[No Load Needed] Reusing weight at offset: 4884.0M
[No Load Needed] Reusing weight at offset: 5698.0M
[No Load Needed] Reusing weight at offset: 6512.0M
[No Load Needed] Reusing weight at offset: 7326.0M
[No Load Needed] Reusing weight at offset: 8140.0M
[No Load Needed] Reusing weight at offset: 8954.0M
[No Load Needed] Reusing weight at offset: 9768.0M
[No Load Needed] Reusing weight at offset: 10582.0M
[Offloaded] offset: 11396.0M
[Offloaded] offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M

Iteration 3
...

Iteration 4
...

Iteration 5
...
##################### Run the second model #######################
Everything will be loaded and will displace some weights of the first model


Iteration 0
[First Load] Populated weight at offset: 0.0M
[First Load] Populated weight at offset: 1628.0M
[First Load] Populated weight at offset: 3256.0M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 3256.0M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 3256.0M

Iteration 3
...

Iteration 4
...

Iteration 5
...
##################### Run the first model again #######################
Some weights will still be loaded from before and be there first iteration
Some weights will get re-loaded on the first interation
The rest will be offloaded again


Iteration 0
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 814.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 2442.0M
[No Load Needed] Reusing weight at offset: 3256.0M
[No Load Needed] Reusing weight at offset: 4070.0M
[First Load] Populated weight at offset: 4884.0M
[First Load] Populated weight at offset: 5698.0M
[First Load] Populated weight at offset: 6512.0M
[First Load] Populated weight at offset: 7326.0M
[First Load] Populated weight at offset: 8140.0M
[First Load] Populated weight at offset: 8954.0M
[First Load] Populated weight at offset: 9768.0M
[First Load] Populated weight at offset: 10582.0M
[First Load] Populated weight at offset: 11396.0M
[Offloaded] offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 814.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 2442.0M
[No Load Needed] Reusing weight at offset: 3256.0M
[No Load Needed] Reusing weight at offset: 4070.0M
[No Load Needed] Reusing weight at offset: 4884.0M
[No Load Needed] Reusing weight at offset: 5698.0M
[No Load Needed] Reusing weight at offset: 6512.0M
[No Load Needed] Reusing weight at offset: 7326.0M
[No Load Needed] Reusing weight at offset: 8140.0M
[No Load Needed] Reusing weight at offset: 8954.0M
[No Load Needed] Reusing weight at offset: 9768.0M
[No Load Needed] Reusing weight at offset: 10582.0M
[Offloaded] offset: 11396.0M
[Offloaded] offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 814.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 2442.0M
[No Load Needed] Reusing weight at offset: 3256.0M
[No Load Needed] Reusing weight at offset: 4070.0M
[No Load Needed] Reusing weight at offset: 4884.0M
[No Load Needed] Reusing weight at offset: 5698.0M
[No Load Needed] Reusing weight at offset: 6512.0M
[No Load Needed] Reusing weight at offset: 7326.0M
[No Load Needed] Reusing weight at offset: 8140.0M
[No Load Needed] Reusing weight at offset: 8954.0M
[No Load Needed] Reusing weight at offset: 9768.0M
[No Load Needed] Reusing weight at offset: 10582.0M
[Offloaded] offset: 11396.0M
[Offloaded] offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M

Iteration 3
...

Iteration 4
...

Iteration 5
...
Exception ignored in: <function ModelVBAR.__del__ at 0x000001F081E9C0E0>
Traceback (most recent call last):
  File "C:\venv\Lib\site-packages\comfy_aimdo\model_vbar.py", line 122, in __del__
AttributeError: 'NoneType' object has no attribute 'lib'
Exception ignored in: <function ModelVBAR.__del__ at 0x000001F081E9C0E0>
Traceback (most recent call last):
  File "C:\venv\Lib\site-packages\comfy_aimdo\model_vbar.py", line 122, in __del__
AttributeError: 'NoneType' object has no attribute 'lib'

(venv) PS C:\Users\deluxa>

Looks good. The AttributeError at the end probably needs a fix similar to this.

asagi4 · 2026-04-20T15:17:57Z

Since #35 now exists, I'll close this. There are way too many comments in this one already anyway.

Apophis3158 · 2026-04-20T19:24:31Z

But I also crash like this on the @Apophis3158 branch. So it's likely to be my combo of hardware and AMD stack.

No problem with my end either, same using ROCm 7.13.0a + PyTorch 2.13.0a0.

I guess you might be using ROCm 7.2.1? When using this release, I would always get a BOSD at the beginning of the second example.py run. There could be issues with hipMalloc or hipFree in this release, so I started switching to TheRock 7.12.

Give it a try https://repo.amd.com/rocm/whl/ or nightly https://rocm.nightlies.amd.com/v2/.

0xDELUXA mentioned this pull request Feb 6, 2026

Fix: Add memory precheck before VAE decode to prevent crash Comfy-Org/ComfyUI#12109

Open

asagi4 force-pushed the hack/rocm-support branch from eb2e747 to e95bb5c Compare February 19, 2026 15:15

asagi4 force-pushed the hack/rocm-support branch from e95bb5c to 9c4c215 Compare February 20, 2026 18:53

asagi4 force-pushed the hack/rocm-support branch from 9c4c215 to 51d4d2f Compare February 22, 2026 09:21

woct0rdho mentioned this pull request Mar 26, 2026

HIP/ROCm support for AMD GPUs — working port on RX 9070 (gfx1201) #29

Open

rattus128 reviewed Apr 17, 2026

View reviewed changes

rattus128 added a commit that referenced this pull request Apr 17, 2026

Merge PR #2 into dev/linkless

d143c53

Amp-Thread-ID: https://ampcode.com/threads/T-019d9be9-f878-771b-9d48-6c18e74ef6d3 Co-authored-by: Amp <amp@ampcode.com>

rattus128 mentioned this pull request Apr 20, 2026

AMD Support #35

Merged

2 tasks

asagi4 closed this Apr 20, 2026

Conversation

asagi4 commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Agreement

Uh oh!

0xDELUXA commented Feb 6, 2026

Uh oh!

0xDELUXA commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tvukovic-amd commented Feb 10, 2026

Uh oh!

0xDELUXA commented Feb 10, 2026

Uh oh!

tvukovic-amd commented Feb 19, 2026

Uh oh!

asagi4 commented Feb 19, 2026

Uh oh!

asagi4 commented Feb 19, 2026

Uh oh!

0xDELUXA commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 20, 2026

Uh oh!

0xDELUXA commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 20, 2026

Uh oh!

0xDELUXA commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 20, 2026

Uh oh!

0xDELUXA commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 20, 2026

Uh oh!

asagi4 commented Feb 22, 2026

Uh oh!

asagi4 commented Feb 22, 2026

Uh oh!

0xDELUXA commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 22, 2026

Uh oh!

0xDELUXA commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xDELUXA commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 23, 2026

Uh oh!

0xDELUXA commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 23, 2026

Uh oh!

0xDELUXA commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 23, 2026

Uh oh!

jnolck commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leovanalphen commented Mar 25, 2026

Uh oh!

asagi4 commented Mar 25, 2026

Uh oh!

tvukovic-amd commented Mar 26, 2026

Uh oh!

0xDELUXA commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

asagi4 commented Feb 5, 2026 •

edited

Loading

0xDELUXA commented Feb 7, 2026 •

edited

Loading

0xDELUXA commented Feb 20, 2026 •

edited

Loading

0xDELUXA commented Feb 20, 2026 •

edited

Loading

0xDELUXA commented Feb 20, 2026 •

edited

Loading

0xDELUXA commented Feb 20, 2026 •

edited

Loading

0xDELUXA commented Feb 22, 2026 •

edited

Loading

0xDELUXA commented Feb 22, 2026 •

edited

Loading

0xDELUXA commented Feb 23, 2026 •

edited

Loading

0xDELUXA commented Feb 23, 2026 •

edited

Loading

0xDELUXA commented Feb 23, 2026 •

edited

Loading

jnolck commented Mar 25, 2026 •

edited

Loading

asagi4 commented Mar 25, 2026 •

edited

Loading

0xDELUXA commented Mar 26, 2026 •

edited

Loading

leovanalphen commented Mar 26, 2026 •

edited

Loading

0xDELUXA commented Mar 26, 2026 •

edited

Loading

asagi4 commented Mar 31, 2026 •

edited

Loading

0xDELUXA commented Apr 16, 2026 •

edited

Loading

0xDELUXA Apr 17, 2026 •

edited

Loading

0xDELUXA Apr 17, 2026 •

edited

Loading