Skip to content

fix: implement ELF GOT patching for CUDA hooks on Linux/WSL2#27

Open
WASasquatch wants to merge 1 commit into
Comfy-Org:masterfrom
WASasquatch:fix/linux-wsl2-cuda-hooks
Open

fix: implement ELF GOT patching for CUDA hooks on Linux/WSL2#27
WASasquatch wants to merge 1 commit into
Comfy-Org:masterfrom
WASasquatch:fix/linux-wsl2-cuda-hooks

Conversation

@WASasquatch
Copy link
Copy Markdown

On Linux, libcudart resolves driver API symbols via dlsym at runtime, so the no-op stubs left GOT entries unpatched and VRAM tracking broken.

This commit:

  • Adds src-posix/cuda-hook.c with ELF GOT/PLT patching that hooks both CUDA runtime API (cudaMalloc/cudaFree/cudaMallocAsync/cudaFreeAsync) and driver API (cuMemAlloc_v2/cuMemFree_v2/cuMemAllocAsync/cuMemFreeAsync)
  • Runtime hooks call through real runtime API preserving CUDA memory pools
  • Self-contained allocation tracking via thread-safe hash table
  • Resolves runtime symbols via RTLD_DEFAULT (handles versioned sonames)
  • Leaves GOT pages R+W after patching for lazy resolution compatibility
  • Adds -ldl to build scripts for dlopen/dlsym/dl_iterate_phdr
  • Removes dead symbol-interposition overrides from pyt-cu-plug-alloc-async.c

Tested on RTX 4090 under WSL2 (Ubuntu 24.04) with PyTorch 2.10+cu128.

Contribution Agreement

  • I agree that my contributions are licensed under the GPLv3.
  • I grant Comfy Org the rights to relicense these contributions as outlined in CONTRIBUTING.md.

On Linux, libcudart resolves driver API symbols via dlsym at runtime,
so the no-op stubs left GOT entries unpatched and VRAM tracking broken.

This commit:
- Adds src-posix/cuda-hook.c with ELF GOT/PLT patching that hooks both
  CUDA runtime API (cudaMalloc/cudaFree/cudaMallocAsync/cudaFreeAsync)
  and driver API (cuMemAlloc_v2/cuMemFree_v2/cuMemAllocAsync/cuMemFreeAsync)
- Runtime hooks call through real runtime API preserving CUDA memory pools
- Self-contained allocation tracking via thread-safe hash table
- Resolves runtime symbols via RTLD_DEFAULT (handles versioned sonames)
- Leaves GOT pages R+W after patching for lazy resolution compatibility
- Adds -ldl to build scripts for dlopen/dlsym/dl_iterate_phdr
- Removes dead symbol-interposition overrides from pyt-cu-plug-alloc-async.c

Tested on RTX 4090 under WSL2 (Ubuntu 24.04) with PyTorch 2.10+cu128.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant