[DRAFT] Barebones ROCM support#2
Conversation
|
Oh, AMD support has entered the chat 🚀 |
|
Made some adjustments and can confirm that this works on Windows (native ROCm 7 via TheRock) as well. Built in the console. So we can get past these warnings:
I got curious and checked what Dependencies reports. Out of the three My custom-built Now that it loads, I'm curious whether it actually works as intended or just errors out. \ I’m experiencing GPU hangs. After some debugging, I suspect it’s related to VMM + ROCm on Windows. Summary:
Suspected root cause: The AMD Windows WDDM driver may not fully support access to memory allocated via the VMM APIs. |
|
If you need any assistance from the AMD team or have additional questions regarding ROCm on Windows, please feel free to reach out to us. |
Now that ComfyUI x AMD is official, and this PR paves the way for ROCm Linux users to use it, it would be great to have |
|
@asagi4 Just wanted to check in - is there any update or further progress on this PR? |
eb2e747 to
e95bb5c
Compare
|
@tvukovic-amd Well I can't do much beyond run hipify and make it compile. I don't know enough about ROCM to debug any issues. I rebased against master to get it to compile again, but it's untested. |
|
With latest master it seems to be completely broken. all VRAM allocations fail with |
|
After @asagi4 confirmed that the latest updates break I'm sure
\ After further benchmarking, some workloads still trigger GPU hangs, while others run fine. Previously, neither of them ran successfully. It seems that the new |
|
@0xDELUXA you mean you can run hipify without changes to master? How did you manage that? |
Using the script in my fork: https://github.com/0xDELUXA/comfy-aimdo_win-rocm/blob/master/build-rocm-windows.bat |
|
Which version of ROCM do you have? My hipify-clang fails because it treats the implicit void* casts as errors (I think because I tries to compile the code as C++) but I don't see you dealing with that at all |
ROCm: |
|
I managed to locally fix things so that aimdo works for me again. apparently vrambuf_create somehow works on CUDA without aligning to chunk size but with HIP (on Linux?) it fails. I don't know why it works on Windows. |
|
I haven’t encountered any OOMs in my workflows, but occasionally the GPU hangs at 100% usage. It would be great if Windows and Linux ROCm were even more similar. |
e95bb5c to
9c4c215
Compare
|
with these changes things work for me again on Linux. Or at least one workflow ran successfully. Previously pretty much all allocations failed with "invalid argument" when mapping new vram allocations, presumably because the vram buffers weren't aligned to the defined chunk size. |
|
Hm, with the latest changes to master the fixing has gotten a bit more complicated because aimdo's overriding functions have mismatching result types from cuda functions and hipify / clang doesn't like that. For example, they're defined to return int in the header, but the actual function prototype says cudaError_t. In addition, the actual aimdo implementations return CUresults... I'll try to see what happens if I just fix the return types and cast the return values, but that seems like something that should be fixed regardless of ROCm, since I don't think relying on implicit casts from integers is very good behaviour. @rattus128 what do you think? |
9c4c215 to
51d4d2f
Compare
|
Now it compiles, loads and appears to work again. Haven't stress-tested though. |
|
Have you run any workload that exceeds VRAM and would OOM without Does the original example.py work on your system? Another thing is that the ROCm documentation states that VMM is “under development” on Windows. Some APIs are even marked as beta on Linux too, so I can’t really do anything to get it to work reliably on Windows. |
|
@0xDELUXA I haven't stress tested things much, so it's possible that the code isn't very useful as is and fails under memory pressure, but at least it compiles and runs, so it's a start. I also suspect that it failing if vrambuffer allocations aren't aligned to the chunk size is a bug that's just masked by some CUDA-specific behaviour, but I don't know what exactly it's doing wrong, but with ROCm the hipified cuMemSetAccess calls fail with "invalid argument". I wonder if since the pointer it's working with is I can't help with Windows at all unfortunately. It's been a long time since I last used it for anything. |
I see. I don’t really think the
Not a problem - the build script from my fork, on Windows, as you said, "at least it compiles and runs, so it's a start." |
|
I'm rather curious about how your AMD Linux implementation behaves. Could you try running example.py pls? My output on Windows is this. |
|
@0xDELUXA I can't run it at all because it tries to import a function called vbars_analyze that doesn't seem to exist anywhere. |
|
I needed to modify it as well, and this one works for me. Commented out |
|
I fixed the script and it gives me this: |
|
I see. I've also added some debug output, but shouldn't the script also print |
|
It might be that it runs like that because everything fits into VRAM. If I change the layer counts, at some point I just get OOMs. I don't think it's properly offloading anything automatically. |
Doesn't --disable-smart-memory --cache-none do that? Or is that incompatible with aimdo? |
It does, but it also disables caching outputs which makes re-running workflows with partial changes very inefficient so it's not really a solution. |
I had WAN2.2 14B generation working, generated a couple of videos. However, I did have to drop the video size to 512x512, otherwise the VAE decode seems to fail and outputs all grey video. Here are some video's I generated: https://imgur.com/a/H5p3DBR. I don't have the ComfyUI logs anymore, but I could give it another try. |
How much RAM do you have? My system is capped at 32GB and it looks like it's just not enough to run WAN2.2. I don't think the checkpoints themselves are corrupted or anything since I can swap them around and the first one will always successfully run, so it is still just a memory management problem. |
So, with the fix in the PR you still have issues while running WAN models? |
|
I’m able to run WAN 2.2 with 32 GB of system RAM. It’s slow, but I don’t get any OOMs (on Windows). |
Yeah. I think the ROCm bug is fixed, but even a working aimdo apparently isn't enough for me to run those models. I think it might still be offloading from GPU to RAM instead of back to disk so I'm probably just running out of RAM and that causes breakage somewhere. I don't know how to find out what's breaking though since the only symptom is that ComfyUI gets stuck at 100% CPU usage, seemingly forever (the longest I've let it run is one hour) |
I tried to run WAN again today and noticed it is crashing the python process for me now, I'm also noticing the 'load entire model into ram first -> then into vram', which I don't remember having earlier, it seems to be what is crashing the workflow (also have 32GB RAM), running out of RAM, while the GPU still has 10GB dedicated and 20gb shared vram free . The only thing I have changed in the mean time is that I switched to therock for pytorch (from the one that is supplied with comfyui-desktop). I'll try switching back this evening and see if that 'fixes' it. edit: In the mean time, what would work as a workaround is splitting it into two workflows, first one does the low-noise diffusion -> save latents -> second workflow + manual flush -> load latents from workflow1 into 2 -> continue. |
|
I've had luck with Wan2.2 on AMD with TheROCK by switching to GGUF models and being aggressive with unloading. Clip+VAE+LoRA (sometimes low+high)+models (low+high) is a mighty big ask for any consumer-grade gpu even with unloading, and once the models overflow into RAM, gen time goes through the roof |
Reinstalled ComfyUI desktop with the adrenalin driver rocm and pytorch. The workflow now completes again like before. Workflow video_wan2_2_14B_i2v (default comfy) -> changed resolution from 640x640 to 512x512. Output: https://imgur.com/a/s1pcLvt Logs: So it seems to me there is still an issue either in therock pytorch, or therock rocm itself. |
|
Running the workflow with some more verbose logging, it seems like it manages to do something with the second model. It spams a lot of @rattus128 do you have any idea what is going on there? |
The |
I'm aware they're from comfy-kitchen, it's just weird that the messages just go from spamming fairly quickly to full stop suddenly. It looks like the model is running properly at first, but then ComfyUI hits some kind of threshold and gets stuck in an offload loop or something and stops making any progress. |
|
This happens when I start working on something without doing any research on prior art. Anyway, I did make my own ROCm fork before realizing you had one already, so I'll just share it, maybe it can be of some use. It "#worksforme", although I only did some minimal testing with a handful of generations. However, I did not experience the OOMs and crash issues you seem to had with your port, so maybe there is some implementation difference (I didn't have the chance to look into your PR yet) (also disclaimer I used some AI to help me out and learn, this would be way over my head) |
|
@norbert-sule It looks like your code does pretty much the same thing as mine. Are you running it on Linux? If you are, you should be hitting the same ROCm virtual memory bug that I did and leak memory, since that's just a ROCm bug unrelated to aimdo. With the memory bug fixed, aimdo pretty much works; I don't get OOMs or crashes with ComfyUI; It's just that on my system trying to run 2x 14B WAN models apparently just doesn't work (it didn't work pre-aimdo and it still doesn't). |
The solution for the issue Comfy-Org/comfy-kitchen#32 is merged in pytorch main (here is the PR with solution). |
|
Hey everyone, Thanks for the huge efforts. Ive just merged a PR to master that is going to conflict. Rather than send you back to square one though with those merge conflicts, feel free to leave this a few days as I will still analyze the approaches relative to your merge base. I have a few ideas on how to make this easier esp from a builds point of view. Im going to next couple of days catch up on the history and approach and see where we are at. This is Aimdos next feature by plans as of this writing. If theres any sense of "this is still an unresolved problem" on any front let me know. There's a lot of history here! |
Quite a lot, actually - I don’t remember seeing a conversation this long in any AMD-related PR XD Great to see openness to AMD support! We’re happy to help test on our systems as things move forward. At the moment, there don’t seem to be any unresolved issues on Windows AFAIK (thx to @tvukovic-amd). |
| @@ -1,4 +1,4 @@ | |||
| #include "plat.h" | |||
| #include "../src/plat.h" | |||
There was a problem hiding this comment.
this should be fixed now in the build scriptage.
| # define VRAM_CHUNK_SIZE CUDA_PAGE_SIZE | ||
| #else | ||
| # define VRAM_CHUNK_SIZE (16ULL * 1024 * 1024) | ||
| #endif |
There was a problem hiding this comment.
is this still needed after that rocm fix for the leak or different thing?
There was a problem hiding this comment.
Looks like this was introduced in asagi4@9f2d2fa, with the commit message:
Aligning up to chunk size is still needed, otherwise I get an immediate OOM
@asagi4 Is this still the case?
| env/ | ||
| .vscode/ | ||
| comfy_aimdo/_version.py | ||
| .clang-format |
There was a problem hiding this comment.
We should remove the IDE specific gitignore content.
| if implementation == AimdoImpl.ROCM: | ||
| try: | ||
| from . import _rocm_init | ||
| _rocm_init.initialize() |
There was a problem hiding this comment.
what was the history of this and why is the situation different for AMD? Can we just let pytorch load everything and hook after?
There was a problem hiding this comment.
This is @tvukovic-amd's solution to make aimdo use the DLL from rocm_sdk_core instead of the system-wide version (e.g., installed with the display driver/Adrenalin), which otherwise causes errors.
There was a problem hiding this comment.
@0xDELUXA @tvukovic-amd so IIUC pytorch will have the same logic right? Im currently working on converting this to linkless to make pytorch the sole authority on what GPU libs get loaded so if thats the only reason we can drop this change in that approach.
There was a problem hiding this comment.
aimdo can be built with both DLLs, but it fails on the user’s side when using the system-wide DLL, which is preferred when this workaround is not applied.
There was a problem hiding this comment.
Does the system DLL solve a particular problem the pyt/portable-bundled one does not? If that bundled version sucks we should fix comfy build.
There was a problem hiding this comment.
We have issues with the system DLL, it causes hangs. aimdo should use the ROCm-bundled one.
There was a problem hiding this comment.
We have issues with the system DLL, it causes hangs. aimdo should use the ROCm-bundled one.
This?
ComfyUI_windows_portable/python_embeded/Lib/site-packages/_rocm_sdk_core/bin/amdhip64_7.dll
For the moment I am assuming comfy-portable installation on top of the portal recommended driver:
As of the time of writing this you need this driver for best results:
https://www.amd.com/en/resources/support-articles/release-notes/RN-AMDGPU-WINDOWS-PYTORCH-7-1-1.html
There was a problem hiding this comment.
@rattus128
In my opinion, most AMD/Windows users update to the latest Adrenalin drivers (e.g. 26.3.1) and use TheRock. These driver-specific PyTorch versions feel unnecessary.
TheRock provides more up-to-date features and way broader hardware support compared to the driver release notes (e.g. RDNA4 support and limited RDNA3 coverage, mainly RX 7900 XTX), which is really inconsistent. Also, these driver PyTorch versions can't be considered more "stable" than TheRock at all.
AFAIK, in the future, TheRock and these driver-PyTorch releases will converge, it doesn’t make much sense for AMD to release PyTorch from two separate sources.
Amp-Thread-ID: https://ampcode.com/threads/T-019d9be9-f878-771b-9d48-6c18e74ef6d3 Co-authored-by: Amp <amp@ampcode.com>
|
Merged to https://github.com/Comfy-Org/comfy-aimdo/pull/35/changes @0xDELUXA @Apophis3158 please feel free to take a look on windows Currently I crash example.py with: But I also crash like this on the @Apophis3158 branch. So it's likely to be my combo of hardware and AMD stack. Ive made a effort to simplify the build and linkage approach across AMD and Nvidia which is why it moves a distance from this PR. |
|
@rattus128 Great work! Will give it a try shortly. Local output of example.py on gfx1200 (Windows 11 with TheRock ROCm 7.13.0a + PyTorch 2.13.0a0)(venv) PS C:\Users\deluxa> pip uninstall comfy-aimdo -y
Found existing installation: comfy-aimdo 0.2.12
Uninstalling comfy-aimdo-0.2.12:
Successfully uninstalled comfy-aimdo-0.2.12
(venv) PS C:\Users\deluxa> pip install comfy_aimdo-0.0.214.dev33-cp39-abi3-win_amd64.whl
Processing comfy_aimdo-0.0.214.dev33-cp39-abi3-win_amd64.whl
Installing collected packages: comfy-aimdo
Successfully installed comfy-aimdo-0.0.214.dev33
(venv) PS C:\Users\deluxa> python C:\comfy-aimdo\examples\example.py
aimdo: src-win/cuda-detour.c:38:INFO:aimdo_setup_hooks: installing 6 hooks
aimdo: src-win/shmem-detect.c:80:INFO:comfy-aimdo WDDM adapter match: AMD Radeon RX 9060 XT runtime_luid=00000000:0001546b dxgi_luid=00000000:0001546b
aimdo: src/control.c:152:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 9060 XT (VRAM: 16304 MB)
##################### Run the first model #######################
Some weights will be loaded and stay there for all iterations
Some weights will be offloaded
Iteration 0
[First Load] Populated weight at offset: 0.0M
[First Load] Populated weight at offset: 814.0M
[First Load] Populated weight at offset: 1628.0M
[First Load] Populated weight at offset: 2442.0M
[First Load] Populated weight at offset: 3256.0M
[First Load] Populated weight at offset: 4070.0M
[First Load] Populated weight at offset: 4884.0M
[First Load] Populated weight at offset: 5698.0M
[First Load] Populated weight at offset: 6512.0M
[First Load] Populated weight at offset: 7326.0M
[First Load] Populated weight at offset: 8140.0M
[First Load] Populated weight at offset: 8954.0M
[First Load] Populated weight at offset: 9768.0M
[First Load] Populated weight at offset: 10582.0M
[First Load] Populated weight at offset: 11396.0M
[First Load] Populated weight at offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M
Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 814.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 2442.0M
[No Load Needed] Reusing weight at offset: 3256.0M
[No Load Needed] Reusing weight at offset: 4070.0M
[No Load Needed] Reusing weight at offset: 4884.0M
[No Load Needed] Reusing weight at offset: 5698.0M
[No Load Needed] Reusing weight at offset: 6512.0M
[No Load Needed] Reusing weight at offset: 7326.0M
[No Load Needed] Reusing weight at offset: 8140.0M
[No Load Needed] Reusing weight at offset: 8954.0M
[No Load Needed] Reusing weight at offset: 9768.0M
[No Load Needed] Reusing weight at offset: 10582.0M
[Offloaded] offset: 11396.0M
[Offloaded] offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M
Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 814.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 2442.0M
[No Load Needed] Reusing weight at offset: 3256.0M
[No Load Needed] Reusing weight at offset: 4070.0M
[No Load Needed] Reusing weight at offset: 4884.0M
[No Load Needed] Reusing weight at offset: 5698.0M
[No Load Needed] Reusing weight at offset: 6512.0M
[No Load Needed] Reusing weight at offset: 7326.0M
[No Load Needed] Reusing weight at offset: 8140.0M
[No Load Needed] Reusing weight at offset: 8954.0M
[No Load Needed] Reusing weight at offset: 9768.0M
[No Load Needed] Reusing weight at offset: 10582.0M
[Offloaded] offset: 11396.0M
[Offloaded] offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M
Iteration 3
...
Iteration 4
...
Iteration 5
...
##################### Run the second model #######################
Everything will be loaded and will displace some weights of the first model
Iteration 0
[First Load] Populated weight at offset: 0.0M
[First Load] Populated weight at offset: 1628.0M
[First Load] Populated weight at offset: 3256.0M
Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 3256.0M
Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 3256.0M
Iteration 3
...
Iteration 4
...
Iteration 5
...
##################### Run the first model again #######################
Some weights will still be loaded from before and be there first iteration
Some weights will get re-loaded on the first interation
The rest will be offloaded again
Iteration 0
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 814.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 2442.0M
[No Load Needed] Reusing weight at offset: 3256.0M
[No Load Needed] Reusing weight at offset: 4070.0M
[First Load] Populated weight at offset: 4884.0M
[First Load] Populated weight at offset: 5698.0M
[First Load] Populated weight at offset: 6512.0M
[First Load] Populated weight at offset: 7326.0M
[First Load] Populated weight at offset: 8140.0M
[First Load] Populated weight at offset: 8954.0M
[First Load] Populated weight at offset: 9768.0M
[First Load] Populated weight at offset: 10582.0M
[First Load] Populated weight at offset: 11396.0M
[Offloaded] offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M
Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 814.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 2442.0M
[No Load Needed] Reusing weight at offset: 3256.0M
[No Load Needed] Reusing weight at offset: 4070.0M
[No Load Needed] Reusing weight at offset: 4884.0M
[No Load Needed] Reusing weight at offset: 5698.0M
[No Load Needed] Reusing weight at offset: 6512.0M
[No Load Needed] Reusing weight at offset: 7326.0M
[No Load Needed] Reusing weight at offset: 8140.0M
[No Load Needed] Reusing weight at offset: 8954.0M
[No Load Needed] Reusing weight at offset: 9768.0M
[No Load Needed] Reusing weight at offset: 10582.0M
[Offloaded] offset: 11396.0M
[Offloaded] offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M
Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 814.0M
[No Load Needed] Reusing weight at offset: 1628.0M
[No Load Needed] Reusing weight at offset: 2442.0M
[No Load Needed] Reusing weight at offset: 3256.0M
[No Load Needed] Reusing weight at offset: 4070.0M
[No Load Needed] Reusing weight at offset: 4884.0M
[No Load Needed] Reusing weight at offset: 5698.0M
[No Load Needed] Reusing weight at offset: 6512.0M
[No Load Needed] Reusing weight at offset: 7326.0M
[No Load Needed] Reusing weight at offset: 8140.0M
[No Load Needed] Reusing weight at offset: 8954.0M
[No Load Needed] Reusing weight at offset: 9768.0M
[No Load Needed] Reusing weight at offset: 10582.0M
[Offloaded] offset: 11396.0M
[Offloaded] offset: 12210.0M
[Offloaded] offset: 13024.0M
[Offloaded] offset: 13838.0M
[Offloaded] offset: 14652.0M
[Offloaded] offset: 15466.0M
[Offloaded] offset: 16280.0M
[Offloaded] offset: 17094.0M
[Offloaded] offset: 17908.0M
[Offloaded] offset: 18722.0M
[Offloaded] offset: 19536.0M
[Offloaded] offset: 20350.0M
[Offloaded] offset: 21164.0M
[Offloaded] offset: 21978.0M
[Offloaded] offset: 22792.0M
[Offloaded] offset: 23606.0M
Iteration 3
...
Iteration 4
...
Iteration 5
...
Exception ignored in: <function ModelVBAR.__del__ at 0x000001F081E9C0E0>
Traceback (most recent call last):
File "C:\venv\Lib\site-packages\comfy_aimdo\model_vbar.py", line 122, in __del__
AttributeError: 'NoneType' object has no attribute 'lib'
Exception ignored in: <function ModelVBAR.__del__ at 0x000001F081E9C0E0>
Traceback (most recent call last):
File "C:\venv\Lib\site-packages\comfy_aimdo\model_vbar.py", line 122, in __del__
AttributeError: 'NoneType' object has no attribute 'lib'
(venv) PS C:\Users\deluxa>Looks good. The AttributeError at the end probably needs a fix similar to this. |
|
Since #35 now exists, I'll close this. There are way too many comments in this one already anyway. |
No problem with my end either, same using ROCm 7.13.0a + PyTorch 2.13.0a0. I guess you might be using ROCm 7.2.1? When using this release, I would always get a BOSD at the beginning of the second example.py run. There could be issues with hipMalloc or hipFree in this release, so I started switching to TheRock 7.12. Give it a try https://repo.amd.com/rocm/whl/ or nightly https://rocm.nightlies.amd.com/v2/. |
Contribution Agreement
This is not really intended for merging as is, but for reference. hipify-clang can convert the CUDA code to HIP code pretty easily with a few fixes, and it actually allows you to run aimdo on ROCM.
You might have to make sure your Python venv is using your system ROCM libraries for this to work.
It does not work perfectly (I'm still getting pytorch OOMs when it should be freeing memory) but workflows can run and produce good output.
I am not able to test, but the HIP code should be compilable as is on nvidia platforms too. If you run build-rocm on an nvidia platform, hipcc and hipconfig should set it up to link against cuda instead of ROCM and the result should be basically identical to the CUDA implementation.