Skip to content

[DRAFT] Barebones ROCM support#2

Closed
asagi4 wants to merge 23 commits into
Comfy-Org:masterfrom
asagi4:hack/rocm-support
Closed

[DRAFT] Barebones ROCM support#2
asagi4 wants to merge 23 commits into
Comfy-Org:masterfrom
asagi4:hack/rocm-support

Conversation

@asagi4
Copy link
Copy Markdown
Contributor

@asagi4 asagi4 commented Feb 5, 2026

Contribution Agreement

  • I agree that my contributions are licensed under the GPLv3.
  • I grant Comfy Org the rights to relicense these contributions as outlined in CONTRIBUTING.md.

This is not really intended for merging as is, but for reference. hipify-clang can convert the CUDA code to HIP code pretty easily with a few fixes, and it actually allows you to run aimdo on ROCM.

You might have to make sure your Python venv is using your system ROCM libraries for this to work.

It does not work perfectly (I'm still getting pytorch OOMs when it should be freeing memory) but workflows can run and produce good output.

I am not able to test, but the HIP code should be compilable as is on nvidia platforms too. If you run build-rocm on an nvidia platform, hipcc and hipconfig should set it up to link against cuda instead of ROCM and the result should be basically identical to the CUDA implementation.

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Feb 6, 2026

Oh, AMD support has entered the chat 🚀

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Feb 7, 2026

Made some adjustments and can confirm that this works on Windows (native ROCm 7 via TheRock) as well. Built aimdo.dll locally, installed this custom wheel, and got:

aimdo: hip_src\control.c:51:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 9060 XT (VRAM: 16304 MB)
DynamicVRAM support detected and enabled

in the console.

So we can get past these warnings:
No working comfy-aimdo install detected. DynamicVRAM support disabled. Falling back to legacy ModelPatcher. VRAM estimates may be unreliable especially on Windows
NOTE: comfy-aimdo is currently only support for Nvidia GPUs

pip install comfy-aimdo automatically installs the Windows (Nvidia-only) package. It does include an aimdo.dll, but AMD gets the following error:

comfy-aimdo failed to load: E:\ComfyUI\venv\Lib\site-packages\comfy_aimdo\aimdo.dll: Could not find module 'E:\ComfyUI\venv\Lib\site-packages\comfy_aimdo\aimdo.dll' (or one of its dependencies). Try using the full path with constructor syntax.

I got curious and checked what Dependencies reports. Out of the three .dlls it requires, we AMD users are missing nvcuda.dll.

My custom-built aimdo.dll, which actually loads on AMD, replaces the nvcuda.dll dependency with amdhip6_7.dll.

Now that it loads, I'm curious whether it actually works as intended or just errors out.

\

I’m experiencing GPU hangs. After some debugging, I suspect it’s related to VMM + ROCm on Windows.

Summary:
VMM allocation APIs report success, but the GPU cannot reliably access the allocated memory.

  1. All hipMemCreate, hipMemMap, and hipMemSetAccess calls return success.
  2. hipMemsetD8 also returns success (the async operation is queued).
  3. hipDeviceSynchronize completes without errors.
  4. PyTorch kernel hangs when attempting to use the memory.

Suspected root cause: The AMD Windows WDDM driver may not fully support access to memory allocated via the VMM APIs.

@tvukovic-amd
Copy link
Copy Markdown
Contributor

If you need any assistance from the AMD team or have additional questions regarding ROCm on Windows, please feel free to reach out to us.

@0xDELUXA
Copy link
Copy Markdown

If you need any assistance from the AMD team or have additional questions regarding ROCm on Windows, please feel free to reach out to us.

Now that ComfyUI x AMD is official, and this PR paves the way for ROCm Linux users to use it, it would be great to have comfy-aimdo running on ROCm Windows too. Theoretically, what is preventing it from working? I've tried many things, but it seems there’s something I haven’t been able to figure out.

@tvukovic-amd
Copy link
Copy Markdown
Contributor

@asagi4 Just wanted to check in - is there any update or further progress on this PR?

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Feb 19, 2026

@tvukovic-amd Well I can't do much beyond run hipify and make it compile. I don't know enough about ROCM to debug any issues.

I rebased against master to get it to compile again, but it's untested.

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Feb 19, 2026

With latest master it seems to be completely broken. all VRAM allocations fail with aimdo: hip_src/vrambuf.c:56:ERROR:VRAM Allocation failed (non OOM) and torch throws an OOM exception immediately.

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Feb 20, 2026

After @asagi4 confirmed that the latest updates break comfy-aimdo on AMD (Linux), I decided to try building the version checked out from the master branch. I have a very long, workaround-upon-workaround (mainly for hipify, else it just doesn't work) build script that I use on Windows. And somehow it magically avoids the GPU hang issue I was getting when comfy-aimdo was enabled.

I'm sure comfy-aimdo is actually being taken into consideration here, based on the console output (filtered):

aimdo: hip_src\control.c:51:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 9060 XT (VRAM: 16304 MB) DynamicVRAM support detected and enabled
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 0 patches attached.
Model Initializing ...
Model Initialization complete!
Prompt executed in X seconds

\

After further benchmarking, some workloads still trigger GPU hangs, while others run fine. Previously, neither of them ran successfully. It seems that the new Model Initializing... phase is quite heavy on AMD, which is where it occasionally hangs.

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Feb 20, 2026

@0xDELUXA you mean you can run hipify without changes to master? How did you manage that?

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Feb 20, 2026

@0xDELUXA you mean you can run hipify without changes to master? How did you manage that?

Using the script in my fork: https://github.com/0xDELUXA/comfy-aimdo_win-rocm/blob/master/build-rocm-windows.bat

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Feb 20, 2026

Which version of ROCM do you have? My hipify-clang fails because it treats the implicit void* casts as errors (I think because I tries to compile the code as C++) but I don't see you dealing with that at all

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Feb 20, 2026

Which version of ROCM do you have? My hipify-clang fails because it treats the implicit void* casts as errors (I think because I tries to compile the code as C++) but I don't see you dealing with that at all

ROCm: 7.12.0a20260218
PyTorch: 2.12.0a0+rocm7.12.0a20260218
OS: WIndows 11

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Feb 20, 2026

I managed to locally fix things so that aimdo works for me again.
I think vrambuf_create has some alignment issue that appears with HIP
diff for hipified source here

diff -ru hip_src/vrambuf.c hip_src_fixed2/vrambuf.c
--- hip_src/vrambuf.c   2026-02-20 20:34:56.698464966 +0200
+++ hip_src_fixed2/vrambuf.c    2026-02-20 20:32:52.685112770 +0200
@@ -7,8 +7,16 @@
 SHARED_EXPORT
 void *vrambuf_create(int device, size_t max_size) {
     VramBuffer *buf;
+    if ((max_size / VRAM_CHUNK_SIZE) * VRAM_CHUNK_SIZE < max_size) {
+       log(ERROR, "??? alignment %zu\n", max_size);
+       max_size = ((max_size / VRAM_CHUNK_SIZE) + 1) * VRAM_CHUNK_SIZE;
+       log(ERROR, "??? fixed alignment %zu\n", max_size);
+    }

-    buf = (VramBuffer *)calloc(1, sizeof(*buf) + sizeof(hipMemGenericAllocationHandle_t) * max
_size / VRAM_CHUNK_SIZE);
+    size_t size = 0;
+    size = sizeof(*buf) + (sizeof(hipMemGenericAllocationHandle_t) * (max_size / VRAM_CHUNK_SI
ZE));
+    log(ERROR, "vrambuf_create calloc %zu\n", size)
+    buf = (VramBuffer *)calloc(1, size);
     if (!buf) {
         return NULL;
     }
@@ -53,7 +61,7 @@
         }
         if ((err = three_stooges(buf->base_ptr + buf->allocated, to_allocate, buf->device, &ha
ndle)) != hipSuccess) {
             if (err != hipErrorOutOfMemory) {
-                log(ERROR, "VRAM Allocation failed (non OOM): %d\n", err);
+                log(ERROR, "VRAM Allocation failed (non OOM): %s\n", hipGetErrorString(err));
                 return false;
             }
             log(DEBUG, "Pytorch allocator attempt exceeds available VRAM ...\n");

apparently vrambuf_create somehow works on CUDA without aligning to chunk size but with HIP (on Linux?) it fails. I don't know why it works on Windows.

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Feb 20, 2026

I haven’t encountered any OOMs in my workflows, but occasionally the GPU hangs at 100% usage. It would be great if Windows and Linux ROCm were even more similar.

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Feb 20, 2026

with these changes things work for me again on Linux. Or at least one workflow ran successfully. Previously pretty much all allocations failed with "invalid argument" when mapping new vram allocations, presumably because the vram buffers weren't aligned to the defined chunk size.

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Feb 22, 2026

Hm, with the latest changes to master the fixing has gotten a bit more complicated because aimdo's overriding functions have mismatching result types from cuda functions and hipify / clang doesn't like that.

For example, they're defined to return int in the header, but the actual function prototype says cudaError_t. In addition, the actual aimdo implementations return CUresults...

I'll try to see what happens if I just fix the return types and cast the return values, but that seems like something that should be fixed regardless of ROCm, since I don't think relying on implicit casts from integers is very good behaviour.

@rattus128 what do you think?

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Feb 22, 2026

Now it compiles, loads and appears to work again.

Haven't stress-tested though.

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Feb 22, 2026

Have you run any workload that exceeds VRAM and would OOM without comfy-aimdo?

Does the original example.py work on your system?

Another thing is that the ROCm documentation states that VMM is “under development” on Windows. Some APIs are even marked as beta on Linux too, so I can’t really do anything to get it to work reliably on Windows.

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Feb 22, 2026

@0xDELUXA I haven't stress tested things much, so it's possible that the code isn't very useful as is and fails under memory pressure, but at least it compiles and runs, so it's a start. I also suspect that it failing if vrambuffer allocations aren't aligned to the chunk size is a bug that's just masked by some CUDA-specific behaviour, but I don't know what exactly it's doing wrong, but with ROCm the hipified cuMemSetAccess calls fail with "invalid argument".

I wonder if since the pointer it's working with is vrambuf->base_addr+vrambuf->allocated, that it gives an invalid pointer with some allocation patterns.

I can't help with Windows at all unfortunately. It's been a long time since I last used it for anything.

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Feb 22, 2026

@0xDELUXA I haven't stress tested things much, so it's possible that the code isn't very useful as is and fails under memory pressure, but at least it compiles and runs, so it's a start. I also suspect that it failing if vrambuffer allocations aren't aligned to the chunk size is a bug that's just masked by some CUDA-specific behaviour, but I don't know what exactly it's doing wrong, but with ROCm the hipified cuMemSetAccess calls fail with "invalid argument".

I wonder if since the pointer it's working with is vrambuf->base_addr+vrambuf->allocated, that it gives an invalid pointer with some allocation patterns.

I see. I don’t really think the comfy-aimdo dev has much insight into the AMD side, so it’s just us. I assume there will still be things that work reliably on Nvidia but not as well on AMD.

I can't help with Windows at all unfortunately. It's been a long time since I last used it for anything.

Not a problem - the build script from my fork, on Windows, as you said, "at least it compiles and runs, so it's a start."

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Feb 23, 2026

I'm rather curious about how your AMD Linux implementation behaves. Could you try running example.py pls? My output on Windows is this.

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Feb 23, 2026

@0xDELUXA I can't run it at all because it tries to import a function called vbars_analyze that doesn't seem to exist anywhere.

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Feb 23, 2026

I needed to modify it as well, and this one works for me. Commented out vbars_analyze, etc.

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Feb 23, 2026

I fixed the script and it gives me this:

Init complete
aimdo: hip_src/control.c:67:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 7900 XTX (VRAM: 24560 MB)
aimdo: hip_src/model-vbar.c:181:DEBUG:vbar_allocate (start): size=131072M, device=0
aimdo: hip_src/model-vbar.c:208:DEBUG:vbar_allocate (return): vbar=0xabacef0
aimdo: hip_src/model-vbar.c:260:DEBUG:vbar_get vbar=0xabacef0
##################### Run the first model #######################
Some weights will be loaded and stay there for all iterations
Some weights will be offloaded

aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400

Iteration 0
[First Load] Populated weight at offset: 0.0M
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400
[First Load] Populated weight at offset: 400.0M
[First Load] Populated weight at offset: 800.0M
[First Load] Populated weight at offset: 1200.0M
[First Load] Populated weight at offset: 1600.0M
[First Load] Populated weight at offset: 2000.0M
[First Load] Populated weight at offset: 2400.0M
[First Load] Populated weight at offset: 2800.0M
[First Load] Populated weight at offset: 3200.0M
[First Load] Populated weight at offset: 3600.0M
[First Load] Populated weight at offset: 4000.0M
[First Load] Populated weight at offset: 4400.0M
[First Load] Populated weight at offset: 4800.0M
[First Load] Populated weight at offset: 5200.0M
[First Load] Populated weight at offset: 5600.0M
[First Load] Populated weight at offset: 6000.0M
[First Load] Populated weight at offset: 6400.0M
[First Load] Populated weight at offset: 6800.0M
[First Load] Populated weight at offset: 7200.0M
[First Load] Populated weight at offset: 7600.0M
[First Load] Populated weight at offset: 8000.0M
[First Load] Populated weight at offset: 8400.0M
[First Load] Populated weight at offset: 8800.0M
[First Load] Populated weight at offset: 9200.0M
[First Load] Populated weight at offset: 9600.0M
[First Load] Populated weight at offset: 10000.0M
[First Load] Populated weight at offset: 10400.0M
[First Load] Populated weight at offset: 10800.0M
[First Load] Populated weight at offset: 11200.0M
[First Load] Populated weight at offset: 11600.0M
[First Load] Populated weight at offset: 12000.0M
[First Load] Populated weight at offset: 12400.0M
[First Load] Populated weight at offset: 12800.0M
[First Load] Populated weight at offset: 13200.0M
[First Load] Populated weight at offset: 13600.0M
[First Load] Populated weight at offset: 14000.0M
[First Load] Populated weight at offset: 14400.0M
[First Load] Populated weight at offset: 14800.0M
[First Load] Populated weight at offset: 15200.0M
[First Load] Populated weight at offset: 15600.0M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 3
...

Iteration 4
...

Iteration 5
...

Iteration 6
...

Iteration 7
...

Iteration 8
...

Iteration 9
...
aimdo: hip_src/pyt-cu-plug-alloc.c:89:DEBUG:Pytorch is freeing VRAM ...
aimdo: hip_src/control.c:34:DEBUG:--- VRAM Stats ---
aimdo: hip_src/control.c:37:DEBUG:  Aimdo Recorded Usage:    16400 MB
aimdo: hip_src/control.c:38:DEBUG:  Cuda:     7820 MB /   24560 MB Free
aimdo: hip_src/model-vbar.c:53:DEBUG:---------------- VBAR Usage ---------------
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xabacef0: Actual Resident VRAM = 16000 MB
aimdo: hip_src/model-vbar.c:86:DEBUG:Total VRAM for VBARs: 16000 MB
aimdo: hip_src/pyt-cu-plug-alloc.c:21:DEBUG:--- Allocation Analysis Start ---
aimdo: hip_src/pyt-cu-plug-alloc.c:30:DEBUG:  [Bucket 1591] Ptr: 0x7fa6c6e00000 | Size:  409600k
aimdo: hip_src/pyt-cu-plug-alloc.c:39:DEBUG:1 Active Allocations for a total of     400 MB
aimdo: hip_src/model-vbar.c:181:DEBUG:vbar_allocate (start): size=3072M, device=0
aimdo: hip_src/model-vbar.c:208:DEBUG:vbar_allocate (return): vbar=0xb135160
aimdo: hip_src/model-vbar.c:260:DEBUG:vbar_get vbar=0xb135160
##################### Run the second model #######################
Everything will be loaded and will displace some weights of the first model

aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 633339904
aimdo: hip_src/vrambuf.c:16:ERROR:vrambuffer max_size not aligned to chunk size!

Iteration 0
[First Load] Populated weight at offset: 0.0M
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 633339904
aimdo: hip_src/vrambuf.c:16:ERROR:vrambuffer max_size not aligned to chunk size!
[First Load] Populated weight at offset: 603.2421875M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 603.2421875M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 603.2421875M

Iteration 3
...

Iteration 4
...

Iteration 5
...

Iteration 6
...

Iteration 7
...

Iteration 8
...

Iteration 9
...
aimdo: hip_src/pyt-cu-plug-alloc.c:89:DEBUG:Pytorch is freeing VRAM ...
aimdo: hip_src/control.c:34:DEBUG:--- VRAM Stats ---
aimdo: hip_src/control.c:37:DEBUG:  Aimdo Recorded Usage:    17824 MB
aimdo: hip_src/control.c:38:DEBUG:  Cuda:     6396 MB /   24560 MB Free
aimdo: hip_src/model-vbar.c:53:DEBUG:---------------- VBAR Usage ---------------
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xabacef0: Actual Resident VRAM = 16000 MB
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xb135160: Actual Resident VRAM = 1216 MB
aimdo: hip_src/model-vbar.c:86:DEBUG:Total VRAM for VBARs: 17216 MB
aimdo: hip_src/pyt-cu-plug-alloc.c:21:DEBUG:--- Allocation Analysis Start ---
aimdo: hip_src/pyt-cu-plug-alloc.c:30:DEBUG:  [Bucket 3544] Ptr: 0x7fa5bb000000 | Size:  622592k
aimdo: hip_src/pyt-cu-plug-alloc.c:39:DEBUG:1 Active Allocations for a total of     608 MB
##################### Run the first model again #######################
Some weights will still be loaded from before and be there first iteration
Some weights will get re-loaded on the first interation
The rest will be offloaded again

aimdo: hip_src/model-vbar.c:234:DEBUG:vbar_prioritize vbar=0xabacef0
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400

Iteration 0
[No Load Needed] Reusing weight at offset: 0.0M
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 3
...

Iteration 4
...

Iteration 5
...

Iteration 6
...

Iteration 7
...

Iteration 8
...

Iteration 9
...
aimdo: hip_src/pyt-cu-plug-alloc.c:89:DEBUG:Pytorch is freeing VRAM ...
aimdo: hip_src/control.c:34:DEBUG:--- VRAM Stats ---
aimdo: hip_src/control.c:37:DEBUG:  Aimdo Recorded Usage:    17616 MB
aimdo: hip_src/control.c:38:DEBUG:  Cuda:     6604 MB /   24560 MB Free
aimdo: hip_src/model-vbar.c:53:DEBUG:---------------- VBAR Usage ---------------
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xb135160: Actual Resident VRAM = 1216 MB
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xabacef0: Actual Resident VRAM = 16000 MB
aimdo: hip_src/model-vbar.c:86:DEBUG:Total VRAM for VBARs: 17216 MB
aimdo: hip_src/pyt-cu-plug-alloc.c:21:DEBUG:--- Allocation Analysis Start ---
aimdo: hip_src/pyt-cu-plug-alloc.c:30:DEBUG:  [Bucket 1591] Ptr: 0x7fa6c6e00000 | Size:  409600k
aimdo: hip_src/pyt-cu-plug-alloc.c:39:DEBUG:1 Active Allocations for a total of     400 MB
Exception ignored in: <function ModelVBAR.__del__ at 0x7fae20bee7a0>
Traceback (most recent call last):
  File "/home/sd/git/comfy-aimdo/comfy_aimdo/model_vbar.py", line 95, in __del__
AttributeError: 'NoneType' object has no attribute 'vbar_free'
Exception ignored in: <function ModelVBAR.__del__ at 0x7fae20bee7a0>
Traceback (most recent call last):
  File "/home/sd/git/comfy-aimdo/comfy_aimdo/model_vbar.py", line 95, in __del__
AttributeError: 'NoneType' object has no attribute 'vbar_free'```
Some of the ERROR logs from aimdo aren't actually errors, they're just things I added that I wanted to log without enabling debug logging.

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Feb 23, 2026

I see. I've also added some debug output, but shouldn't the script also print [Offloaded] alongside [First Load] and [No Load Needed], considering the Some weights will be offloaded and The rest will be offloaded again comments included in the script by rattus128?
Based on the outputs, this is the main difference between comfy-aimdo on AMD Linux/Windows at present.
Which AMD GPU do you have, btw? Mine has 16 GB VRAM, if yours has more, that could explain the offload difference.

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Feb 23, 2026

It might be that it runs like that because everything fits into VRAM. If I change the layer counts, at some point I just get OOMs. I don't think it's properly offloading anything automatically.

@jnolck
Copy link
Copy Markdown

jnolck commented Mar 25, 2026

I accidentally posted this comment in the ROCm issue first, Github UI managed to confuse me.

So, it looks like dynamic vram still doesn't allow me to run a standard WAN 2.2 14B workflow

It gets rid of swapping and improves memory behaviour, but it still fails to load the second model after running the first (first model takes ~2 minutes to load and run to completion, second model is stuck at "Initializing Model" at 100% CPU for 30 minutes and gets nowhere)

So I guess despite dynamic VRAM, I'll still need a way to just tell ComfyUI to completely drop a model mid-workflow if I want to be able to run WAN models. If ComfyUI were able to drop the previous model completely from RAM and VRAM, the workflow would be able to finish. Unfortunately, it doesn't seem to be possible to do currently, even with a custom node :/

The weird part about ComfyUI getting stuck is that it stops properly responding to Ctrl-C and I have to kill it from the outside.

Doesn't --disable-smart-memory --cache-none do that? Or is that incompatible with aimdo?

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Mar 25, 2026

I accidentally posted this comment in the ROCm issue first, Github UI managed to confuse me.
So, it looks like dynamic vram still doesn't allow me to run a standard WAN 2.2 14B workflow
It gets rid of swapping and improves memory behaviour, but it still fails to load the second model after running the first (first model takes ~2 minutes to load and run to completion, second model is stuck at "Initializing Model" at 100% CPU for 30 minutes and gets nowhere)
So I guess despite dynamic VRAM, I'll still need a way to just tell ComfyUI to completely drop a model mid-workflow if I want to be able to run WAN models. If ComfyUI were able to drop the previous model completely from RAM and VRAM, the workflow would be able to finish. Unfortunately, it doesn't seem to be possible to do currently, even with a custom node :/
The weird part about ComfyUI getting stuck is that it stops properly responding to Ctrl-C and I have to kill it from the outside.

Doesn't --disable-smart-memory --cache-none do that? Or is that incompatible with aimdo?

It does, but it also disables caching outputs which makes re-running workflows with partial changes very inefficient so it's not really a solution.

@leovanalphen
Copy link
Copy Markdown

I accidentally posted this comment in the ROCm issue first, Github UI managed to confuse me.

So, it looks like dynamic vram still doesn't allow me to run a standard WAN 2.2 14B workflow

It gets rid of swapping and improves memory behaviour, but it still fails to load the second model after running the first (first model takes ~2 minutes to load and run to completion, second model is stuck at "Initializing Model" at 100% CPU for 30 minutes and gets nowhere)

So I guess despite dynamic VRAM, I'll still need a way to just tell ComfyUI to completely drop a model mid-workflow if I want to be able to run WAN models. If ComfyUI were able to drop the previous model completely from RAM and VRAM, the workflow would be able to finish. Unfortunately, it doesn't seem to be possible to do currently, even with a custom node :/

The weird part about ComfyUI getting stuck is that it stops properly responding to Ctrl-C and I have to kill it from the outside.

I had WAN2.2 14B generation working, generated a couple of videos. However, I did have to drop the video size to 512x512, otherwise the VAE decode seems to fail and outputs all grey video. Here are some video's I generated: https://imgur.com/a/H5p3DBR. I don't have the ComfyUI logs anymore, but I could give it another try.

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Mar 25, 2026

I had WAN2.2 14B generation working, generated a couple of videos. However, I did have to drop the video size to 512x512, otherwise the VAE decode seems to fail and outputs all grey video. Here are some video's I generated: https://imgur.com/a/H5p3DBR. I don't have the ComfyUI logs anymore, but I could give it another try.

How much RAM do you have? My system is capped at 32GB and it looks like it's just not enough to run WAN2.2.

I don't think the checkpoints themselves are corrupted or anything since I can swap them around and the first one will always successfully run, so it is still just a memory management problem.

@tvukovic-amd
Copy link
Copy Markdown
Contributor

I accidentally posted this comment in the ROCm issue first, Github UI managed to confuse me.

So, it looks like dynamic vram still doesn't allow me to run a standard WAN 2.2 14B workflow

It gets rid of swapping and improves memory behaviour, but it still fails to load the second model after running the first (first model takes ~2 minutes to load and run to completion, second model is stuck at "Initializing Model" at 100% CPU for 30 minutes and gets nowhere)

So I guess despite dynamic VRAM, I'll still need a way to just tell ComfyUI to completely drop a model mid-workflow if I want to be able to run WAN models. If ComfyUI were able to drop the previous model completely from RAM and VRAM, the workflow would be able to finish. Unfortunately, it doesn't seem to be possible to do currently, even with a custom node :/

The weird part about ComfyUI getting stuck is that it stops properly responding to Ctrl-C and I have to kill it from the outside.

So, with the fix in the PR you still have issues while running WAN models?

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Mar 26, 2026

I’m able to run WAN 2.2 with 32 GB of system RAM. It’s slow, but I don’t get any OOMs (on Windows).

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Mar 26, 2026

So, with the fix in the PR you still have issues while running WAN models?

Yeah. I think the ROCm bug is fixed, but even a working aimdo apparently isn't enough for me to run those models.

I think it might still be offloading from GPU to RAM instead of back to disk so I'm probably just running out of RAM and that causes breakage somewhere. I don't know how to find out what's breaking though since the only symptom is that ComfyUI gets stuck at 100% CPU usage, seemingly forever (the longest I've let it run is one hour)

@leovanalphen
Copy link
Copy Markdown

leovanalphen commented Mar 26, 2026

So, with the fix in the PR you still have issues while running WAN models?

Yeah. I think the ROCm bug is fixed, but even a working aimdo apparently isn't enough for me to run those models.

I think it might still be offloading from GPU to RAM instead of back to disk so I'm probably just running out of RAM and that causes breakage somewhere. I don't know how to find out what's breaking though since the only symptom is that ComfyUI gets stuck at 100% CPU usage, seemingly forever (the longest I've let it run is one hour)

I tried to run WAN again today and noticed it is crashing the python process for me now, I'm also noticing the 'load entire model into ram first -> then into vram', which I don't remember having earlier, it seems to be what is crashing the workflow (also have 32GB RAM), running out of RAM, while the GPU still has 10GB dedicated and 20gb shared vram free .

The only thing I have changed in the mean time is that I switched to therock for pytorch (from the one that is supplied with comfyui-desktop). I'll try switching back this evening and see if that 'fixes' it.

edit:

In the mean time, what would work as a workaround is splitting it into two workflows, first one does the low-noise diffusion -> save latents -> second workflow + manual flush -> load latents from workflow1 into 2 -> continue.

@jeremymeyers
Copy link
Copy Markdown

I've had luck with Wan2.2 on AMD with TheROCK by switching to GGUF models and being aggressive with unloading. Clip+VAE+LoRA (sometimes low+high)+models (low+high) is a mighty big ask for any consumer-grade gpu even with unloading, and once the models overflow into RAM, gen time goes through the roof

@leovanalphen
Copy link
Copy Markdown

So, with the fix in the PR you still have issues while running WAN models?

Yeah. I think the ROCm bug is fixed, but even a working aimdo apparently isn't enough for me to run those models.

I think it might still be offloading from GPU to RAM instead of back to disk so I'm probably just running out of RAM and that causes breakage somewhere. I don't know how to find out what's breaking though since the only symptom is that ComfyUI gets stuck at 100% CPU usage, seemingly forever (the longest I've let it run is one hour)

Reinstalled ComfyUI desktop with the adrenalin driver rocm and pytorch. The workflow now completes again like before.

Workflow video_wan2_2_14B_i2v (default comfy) -> changed resolution from 640x640 to 512x512.

Output: https://imgur.com/a/s1pcLvt

Logs:

Found comfy_kitchen backend cuda: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_mxfp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_mxfp8', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_mxfp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_mxfp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend triton: {'available': False, 'disabled': True, 'unavailable_reason': "ImportError: No module named 'triton'", 'capabilities': []}
Checkpoint files will always be loaded safely.
Total VRAM 16304 MB, total RAM 32693 MB
pytorch version: 2.9.1+rocmsdk20260116
Set: torch.backends.cudnn.enabled = False for better AMD performance.
AMD arch: gfx1201
ROCm version: (7, 2)
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 9070 XT : native
Using async weight offloading with 2 streams
Enabled pinned memory 14711.0
Using pytorch attention
Python version: 3.12.11 (main, Aug 18 2025, 19:17:54) [MSC v.1944 64 bit (AMD64)]
ComfyUI version: 0.18.2
comfy-aimdo version: 0.2.99+rocm1
comfy-kitchen version: 0.2.8

got prompt
Using split attention in VAE
Using split attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
Requested to load WanTEModel
loaded completely;  6419.48 MB loaded, full load: True
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cuda:0, dtype: torch.float16
Requested to load WanVAE
loaded completely; 5518.69 MB usable, 242.03 MB loaded, full load: True
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.float16, manual cast: torch.float16
model_type FLOW
Requested to load WAN21
loaded partially; 10876.55 MB usable, 10701.51 MB loaded, 2929.91 MB offloaded, 175.03 MB buffer reserved, lowvram patches: 115
100%|██████████| 2/2 [00:46<00:00, 23.23s/it]
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.float16, manual cast: torch.float16
model_type FLOW
Requested to load WAN21
loaded partially; 10732.30 MB usable, 10557.26 MB loaded, 3074.15 MB offloaded, 175.03 MB buffer reserved, lowvram patches: 120
100%|██████████| 2/2 [02:27<00:00, 73.63s/it]
Requested to load WanVAE
loaded completely; 1959.37 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 530.97 seconds

So it seems to me there is still an issue either in therock pytorch, or therock rocm itself.

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Mar 26, 2026

Running the workflow with some more verbose logging, it seems like it manages to do something with the second model. It spams a lot of Backend eager selected for dequantize_per_tensor_fp8, and then it just stops for no apparent reason. CPU usage stays high after it has stopped but it makes no progress.

@rattus128 do you have any idea what is going on there?

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Mar 26, 2026

Running the workflow with some more verbose logging, it seems like it manages to do something with the second model. It spams a lot of Backend eager selected for dequantize_per_tensor_fp8, and then it just stops for no apparent reason. CPU usage stays high after it has stopped but it makes no progress.

The Backend eager selected for dequantize_per_tensor_fp8 message originates from comfy-kitchen. It currently has some issues on ROCm (e.g. Comfy-Org/comfy-kitchen#32). @tvukovic-amd is already looking into it.
I did some experiments to add a ROCm backend to it, though I'm definitely not suggesting that this would solve your issue.

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Mar 26, 2026

The Backend eager selected for dequantize_per_tensor_fp8 message originates from comfy-kitchen. It currently has some issues on ROCm (e.g. Comfy-Org/comfy-kitchen#32). @tvukovic-amd is already looking into it. I did some experiments to add a ROCm backend to it, though I'm definitely not suggesting that this would solve your issue.

I'm aware they're from comfy-kitchen, it's just weird that the messages just go from spamming fairly quickly to full stop suddenly. It looks like the model is running properly at first, but then ComfyUI hits some kind of threshold and gets stuck in an offload loop or something and stops making any progress.

@manfreed
Copy link
Copy Markdown

This happens when I start working on something without doing any research on prior art.

Anyway, I did make my own ROCm fork before realizing you had one already, so I'll just share it, maybe it can be of some use. It "#worksforme", although I only did some minimal testing with a handful of generations. However, I did not experience the OOMs and crash issues you seem to had with your port, so maybe there is some implementation difference (I didn't have the chance to look into your PR yet)

(also disclaimer I used some AI to help me out and learn, this would be way over my head)

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Mar 31, 2026

@norbert-sule It looks like your code does pretty much the same thing as mine.

Are you running it on Linux? If you are, you should be hitting the same ROCm virtual memory bug that I did and leak memory, since that's just a ROCm bug unrelated to aimdo.

With the memory bug fixed, aimdo pretty much works; I don't get OOMs or crashes with ComfyUI; It's just that on my system trying to run 2x 14B WAN models apparently just doesn't work (it didn't work pre-aimdo and it still doesn't).

@tvukovic-amd
Copy link
Copy Markdown
Contributor

Running the workflow with some more verbose logging, it seems like it manages to do something with the second model. It spams a lot of Backend eager selected for dequantize_per_tensor_fp8, and then it just stops for no apparent reason. CPU usage stays high after it has stopped but it makes no progress.

The Backend eager selected for dequantize_per_tensor_fp8 message originates from comfy-kitchen. It currently has some issues on ROCm (e.g. Comfy-Org/comfy-kitchen#32). @tvukovic-amd is already looking into it. I did some experiments to add a ROCm backend to it, though I'm definitely not suggesting that this would solve your issue.

The solution for the issue Comfy-Org/comfy-kitchen#32 is merged in pytorch main (here is the PR with solution).

@rattus128
Copy link
Copy Markdown
Collaborator

Hey everyone, Thanks for the huge efforts.

Ive just merged a PR to master that is going to conflict. Rather than send you back to square one though with those merge conflicts, feel free to leave this a few days as I will still analyze the approaches relative to your merge base. I have a few ideas on how to make this easier esp from a builds point of view. Im going to next couple of days catch up on the history and approach and see where we are at. This is Aimdos next feature by plans as of this writing.

If theres any sense of "this is still an unresolved problem" on any front let me know. There's a lot of history here!

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Apr 16, 2026

Hey everyone, Thanks for the huge efforts.

Ive just merged a PR to master that is going to conflict. Rather than send you back to square one though with those merge conflicts, feel free to leave this a few days as I will still analyze the approaches relative to your merge base. I have a few ideas on how to make this easier esp from a builds point of view. Im going to next couple of days catch up on the history and approach and see where we are at. This is Aimdos next feature by plans as of this writing.

If theres any sense of "this is still an unresolved problem" on any front let me know. There's a lot of history here!

Quite a lot, actually - I don’t remember seeing a conversation this long in any AMD-related PR XD

Great to see openness to AMD support! We’re happy to help test on our systems as things move forward.

At the moment, there don’t seem to be any unresolved issues on Windows AFAIK (thx to @tvukovic-amd).
@asagi4 will share their perspective on the Linux side.

Comment thread src-posix/model-mmap.c
@@ -1,4 +1,4 @@
#include "plat.h"
#include "../src/plat.h"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be fixed now in the build scriptage.

Comment thread src/vrambuf.c
# define VRAM_CHUNK_SIZE CUDA_PAGE_SIZE
#else
# define VRAM_CHUNK_SIZE (16ULL * 1024 * 1024)
#endif
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed after that rocm fix for the leak or different thing?

Copy link
Copy Markdown

@0xDELUXA 0xDELUXA Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this was introduced in asagi4@9f2d2fa, with the commit message:

Aligning up to chunk size is still needed, otherwise I get an immediate OOM

@asagi4 Is this still the case?

Comment thread .gitignore
env/
.vscode/
comfy_aimdo/_version.py
.clang-format
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the IDE specific gitignore content.

Comment thread comfy_aimdo/control.py
if implementation == AimdoImpl.ROCM:
try:
from . import _rocm_init
_rocm_init.initialize()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the history of this and why is the situation different for AMD? Can we just let pytorch load everything and hook after?

Copy link
Copy Markdown

@0xDELUXA 0xDELUXA Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is @tvukovic-amd's solution to make aimdo use the DLL from rocm_sdk_core instead of the system-wide version (e.g., installed with the display driver/Adrenalin), which otherwise causes errors.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0xDELUXA @tvukovic-amd so IIUC pytorch will have the same logic right? Im currently working on converting this to linkless to make pytorch the sole authority on what GPU libs get loaded so if thats the only reason we can drop this change in that approach.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aimdo can be built with both DLLs, but it fails on the user’s side when using the system-wide DLL, which is preferred when this workaround is not applied.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the system DLL solve a particular problem the pyt/portable-bundled one does not? If that bundled version sucks we should fix comfy build.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have issues with the system DLL, it causes hangs. aimdo should use the ROCm-bundled one.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have issues with the system DLL, it causes hangs. aimdo should use the ROCm-bundled one.

This?

ComfyUI_windows_portable/python_embeded/Lib/site-packages/_rocm_sdk_core/bin/amdhip64_7.dll

For the moment I am assuming comfy-portable installation on top of the portal recommended driver:

As of the time of writing this you need this driver for best results:
https://www.amd.com/en/resources/support-articles/release-notes/RN-AMDGPU-WINDOWS-PYTORCH-7-1-1.html

Copy link
Copy Markdown

@0xDELUXA 0xDELUXA Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rattus128
In my opinion, most AMD/Windows users update to the latest Adrenalin drivers (e.g. 26.3.1) and use TheRock. These driver-specific PyTorch versions feel unnecessary.

TheRock provides more up-to-date features and way broader hardware support compared to the driver release notes (e.g. RDNA4 support and limited RDNA3 coverage, mainly RX 7900 XTX), which is really inconsistent. Also, these driver PyTorch versions can't be considered more "stable" than TheRock at all.

AFAIK, in the future, TheRock and these driver-PyTorch releases will converge, it doesn’t make much sense for AMD to release PyTorch from two separate sources.

rattus128 added a commit that referenced this pull request Apr 17, 2026
@rattus128 rattus128 mentioned this pull request Apr 20, 2026
2 tasks
@rattus128
Copy link
Copy Markdown
Collaborator

Merged to https://github.com/Comfy-Org/comfy-aimdo/pull/35/changes

@0xDELUXA @Apophis3158 please feel free to take a look on windows

Currently I crash example.py with:

Windows fatal exception: access violation

Stack (most recent call first):
  File "C:\users\rattu\ComfyUI_windows_portable_amd\ComfyUI_windows_portable\python_embeded\Lib\site-packages\comfy_aimdo\model_vbar.py", line 50 in __init__
  File "C:\users\rattu\example.py", line 96 in <module>
Traceback (most recent call last):
  File "C:\users\rattu\example.py", line 96, in <module>
    vbar2 = ModelVBAR(gpu_size * 5, device=0) #The vbar can be much bigger than VRAM
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\users\rattu\ComfyUI_windows_portable_amd\ComfyUI_windows_portable\python_embeded\Lib\site-packages\comfy_aimdo\model_vbar.py", line 50, in __init__
    self._ptr = lib.vbar_allocate(self._devctx, int(size), device)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: exception: access violation reading 0x00000000000000E0
aimdo[DEBUG] src\model-vbar.c:379: vbar_free: vbar=0000016FC2CFA3F0

But I also crash like this on the @Apophis3158 branch. So it's likely to be my combo of hardware and AMD stack.

Ive made a effort to simplify the build and linkage approach across AMD and Nvidia which is why it moves a distance from this PR.

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Apr 20, 2026

@rattus128 Great work! Will give it a try shortly.

Local output of example.py on gfx1200 (Windows 11 with TheRock ROCm 7.13.0a + PyTorch 2.13.0a0)
(venv) PS C:\Users\deluxa> pip uninstall comfy-aimdo -y
Found existing installation: comfy-aimdo 0.2.12
Uninstalling comfy-aimdo-0.2.12:
  Successfully uninstalled comfy-aimdo-0.2.12

(venv) PS C:\Users\deluxa> pip install comfy_aimdo-0.0.214.dev33-cp39-abi3-win_amd64.whl
Processing comfy_aimdo-0.0.214.dev33-cp39-abi3-win_amd64.whl
Installing collected packages: comfy-aimdo
Successfully installed comfy-aimdo-0.0.214.dev33

(venv) PS C:\Users\deluxa> python C:\comfy-aimdo\examples\example.py
aimdo: src-win/cuda-detour.c:38:INFO:aimdo_setup_hooks: installing 6 hooks
aimdo: src-win/shmem-detect.c:80:INFO:comfy-aimdo WDDM adapter match: AMD Radeon RX 9060 XT runtime_luid=00000000:0001546b dxgi_luid=00000000:0001546b
aimdo: src/control.c:152:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 9060 XT (VRAM: 16304 MB)
##################### Run the first model #######################
Some weights will be loaded and stay there for all iterations
Some weights will be offloaded


Iteration 0
[First Load] Populated weight at offset: 0.0M
[First Load] Populated weight at offset: 814.0M
[First Load] Populated weight at offset: 1628.0M
[First Load] Populated weight at offset: 2442.0M
[First Load] Populated weight at offset: 3256.0M
[First Load] Populated weight at offset: 4070.0M
[First Load] Populated weight at offset: 4884.0M
[First Load] Populated weight at offset: 5698.0M
[First Load] Populated weight at offset: 6512.0M
[First Load] Populated weight at offset: 7326.0M
[First Load] Populated weight at offset: 8140.0M
[First Load] Populated weight at offset: 8954.0M
[First Load] Populated weight at offset: 9768.0M
[First Load] Populated weight at offset: 10582.0M
[First Load] Populated weight at offset: 11396.0M
[First Load] Populated weight at offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 814.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 2442.0M
[No Load Needed] Reusing weight at offset: 3256.0M
[No Load Needed] Reusing weight at offset: 4070.0M
[No Load Needed] Reusing weight at offset: 4884.0M
[No Load Needed] Reusing weight at offset: 5698.0M
[No Load Needed] Reusing weight at offset: 6512.0M
[No Load Needed] Reusing weight at offset: 7326.0M
[No Load Needed] Reusing weight at offset: 8140.0M
[No Load Needed] Reusing weight at offset: 8954.0M
[No Load Needed] Reusing weight at offset: 9768.0M
[No Load Needed] Reusing weight at offset: 10582.0M
[Offloaded] offset: 11396.0M
[Offloaded] offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 814.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 2442.0M
[No Load Needed] Reusing weight at offset: 3256.0M
[No Load Needed] Reusing weight at offset: 4070.0M
[No Load Needed] Reusing weight at offset: 4884.0M
[No Load Needed] Reusing weight at offset: 5698.0M
[No Load Needed] Reusing weight at offset: 6512.0M
[No Load Needed] Reusing weight at offset: 7326.0M
[No Load Needed] Reusing weight at offset: 8140.0M
[No Load Needed] Reusing weight at offset: 8954.0M
[No Load Needed] Reusing weight at offset: 9768.0M
[No Load Needed] Reusing weight at offset: 10582.0M
[Offloaded] offset: 11396.0M
[Offloaded] offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M

Iteration 3
...

Iteration 4
...

Iteration 5
...
##################### Run the second model #######################
Everything will be loaded and will displace some weights of the first model


Iteration 0
[First Load] Populated weight at offset: 0.0M
[First Load] Populated weight at offset: 1628.0M
[First Load] Populated weight at offset: 3256.0M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 3256.0M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 3256.0M

Iteration 3
...

Iteration 4
...

Iteration 5
...
##################### Run the first model again #######################
Some weights will still be loaded from before and be there first iteration
Some weights will get re-loaded on the first interation
The rest will be offloaded again


Iteration 0
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 814.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 2442.0M
[No Load Needed] Reusing weight at offset: 3256.0M
[No Load Needed] Reusing weight at offset: 4070.0M
[First Load] Populated weight at offset: 4884.0M
[First Load] Populated weight at offset: 5698.0M
[First Load] Populated weight at offset: 6512.0M
[First Load] Populated weight at offset: 7326.0M
[First Load] Populated weight at offset: 8140.0M
[First Load] Populated weight at offset: 8954.0M
[First Load] Populated weight at offset: 9768.0M
[First Load] Populated weight at offset: 10582.0M
[First Load] Populated weight at offset: 11396.0M
[Offloaded] offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 814.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 2442.0M
[No Load Needed] Reusing weight at offset: 3256.0M
[No Load Needed] Reusing weight at offset: 4070.0M
[No Load Needed] Reusing weight at offset: 4884.0M
[No Load Needed] Reusing weight at offset: 5698.0M
[No Load Needed] Reusing weight at offset: 6512.0M
[No Load Needed] Reusing weight at offset: 7326.0M
[No Load Needed] Reusing weight at offset: 8140.0M
[No Load Needed] Reusing weight at offset: 8954.0M
[No Load Needed] Reusing weight at offset: 9768.0M
[No Load Needed] Reusing weight at offset: 10582.0M
[Offloaded] offset: 11396.0M
[Offloaded] offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 814.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 2442.0M
[No Load Needed] Reusing weight at offset: 3256.0M
[No Load Needed] Reusing weight at offset: 4070.0M
[No Load Needed] Reusing weight at offset: 4884.0M
[No Load Needed] Reusing weight at offset: 5698.0M
[No Load Needed] Reusing weight at offset: 6512.0M
[No Load Needed] Reusing weight at offset: 7326.0M
[No Load Needed] Reusing weight at offset: 8140.0M
[No Load Needed] Reusing weight at offset: 8954.0M
[No Load Needed] Reusing weight at offset: 9768.0M
[No Load Needed] Reusing weight at offset: 10582.0M
[Offloaded] offset: 11396.0M
[Offloaded] offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M

Iteration 3
...

Iteration 4
...

Iteration 5
...
Exception ignored in: <function ModelVBAR.__del__ at 0x000001F081E9C0E0>
Traceback (most recent call last):
  File "C:\venv\Lib\site-packages\comfy_aimdo\model_vbar.py", line 122, in __del__
AttributeError: 'NoneType' object has no attribute 'lib'
Exception ignored in: <function ModelVBAR.__del__ at 0x000001F081E9C0E0>
Traceback (most recent call last):
  File "C:\venv\Lib\site-packages\comfy_aimdo\model_vbar.py", line 122, in __del__
AttributeError: 'NoneType' object has no attribute 'lib'

(venv) PS C:\Users\deluxa>

Looks good. The AttributeError at the end probably needs a fix similar to this.

@asagi4
Copy link
Copy Markdown
Contributor Author

asagi4 commented Apr 20, 2026

Since #35 now exists, I'll close this. There are way too many comments in this one already anyway.

@asagi4 asagi4 closed this Apr 20, 2026
@Apophis3158
Copy link
Copy Markdown
Contributor

But I also crash like this on the @Apophis3158 branch. So it's likely to be my combo of hardware and AMD stack.

No problem with my end either, same using ROCm 7.13.0a + PyTorch 2.13.0a0.

I guess you might be using ROCm 7.2.1? When using this release, I would always get a BOSD at the beginning of the second example.py run. There could be issues with hipMalloc or hipFree in this release, so I started switching to TheRock 7.12.

Give it a try https://repo.amd.com/rocm/whl/ or nightly https://rocm.nightlies.amd.com/v2/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.