Skip to content

Implement threaded loader components (CORE-43)#46

Merged
alexisrolland merged 17 commits into
masterfrom
dev/threaded-load
May 12, 2026
Merged

Implement threaded loader components (CORE-43)#46
alexisrolland merged 17 commits into
masterfrom
dev/threaded-load

Conversation

@rattus128
Copy link
Copy Markdown
Collaborator

Contribution Agreement

  • I agree that my contributions are licensed under the GPLv3.
  • I grant Comfy Org the rights to relicense these contributions as outlined in CONTRIBUTING.md.

Currently loading models from disk has a bad mix of single-core CPU and disk activity serialization.

So create two thread pools to:

1: Prefault memory in the background while the main thread continues.
2: Do chunked + threaded transfers from disk to RAM.

If you prefault, cudaHostRegister is really fast (way way faster than cudaHostMemAlloc). So we use the background prefault of some headroom to completely hide this latency. The chunked transfer is significantly faster pageable cudaMemcpyAsync.

In addition to the above, also expand the hostbuf API to be growable with the above prefaulter.

So the usage pattern for this is use the new hostbuf APIs to create temp buffers that are pinned and always transfer via said pinned memory.

With accompanying ComfyUI changes I get the following preliminary speedups.

Windows RTX5060, 64GB, PCIE4x4 NVME, LTX2.3 FP8 (no Lora) 360P, first run on launch:

Before:

Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 120.73 seconds

After:

Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 90.43 seconds

Linux RTX5090, 96GB, PCIE5x4 NVME, LTX2.3 FP8 (no Lora), 720P, first run on launch, caches cold:

Before:

Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 49.66 seconds

After:

Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 35.50 seconds

This is nsight of the model load on a warm load. The gaps between the green blocks are about what you would expect, given the GPU can DMA master than the memcpy to pin.

image

More data to come Comfy side, along with how to use this in code.

@rattus128 rattus128 force-pushed the dev/threaded-load branch from 203d104 to d26273a Compare May 7, 2026 08:54
@alexisrolland alexisrolland merged commit 2579f02 into master May 12, 2026
12 checks passed
@rattus128 rattus128 deleted the dev/threaded-load branch May 12, 2026 04:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants