Implement threaded loader components (CORE-43) by rattus128 · Pull Request #46 · Comfy-Org/comfy-aimdo

rattus128 · 2026-05-06T12:28:12Z

Contribution Agreement

I agree that my contributions are licensed under the GPLv3.
I grant Comfy Org the rights to relicense these contributions as outlined in CONTRIBUTING.md.

Currently loading models from disk has a bad mix of single-core CPU and disk activity serialization.

So create two thread pools to:

1: Prefault memory in the background while the main thread continues.
2: Do chunked + threaded transfers from disk to RAM.

If you prefault, cudaHostRegister is really fast (way way faster than cudaHostMemAlloc). So we use the background prefault of some headroom to completely hide this latency. The chunked transfer is significantly faster pageable cudaMemcpyAsync.

In addition to the above, also expand the hostbuf API to be growable with the above prefaulter.

So the usage pattern for this is use the new hostbuf APIs to create temp buffers that are pinned and always transfer via said pinned memory.

With accompanying ComfyUI changes I get the following preliminary speedups.

Windows RTX5060, 64GB, PCIE4x4 NVME, LTX2.3 FP8 (no Lora) 360P, first run on launch:

Before:

Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 120.73 seconds

After:

Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 90.43 seconds

Linux RTX5090, 96GB, PCIE5x4 NVME, LTX2.3 FP8 (no Lora), 720P, first run on launch, caches cold:

Before:

Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 49.66 seconds

After:

Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 35.50 seconds

This is nsight of the model load on a warm load. The gaps between the green blocks are about what you would expect, given the GPU can DMA master than the memcpy to pin.

More data to come Comfy side, along with how to use this in code.

rattus128 added 16 commits May 6, 2026 20:17

host_buffer: plan api change to grow the buffer

947b1a3

sketch prewarm hint

6c8c04c

stub file reader

54b5eaa

sketch truncate

5cc2aba

dispatch: Add cuMemHostRegister/Unregister

5de82a4

threads API

5ef6cac

xfer file

a405f38

hostbuf plat

0475692

prewarm

8156b25

implement main hostbuf protocol

3853bd6

re-alloc protocol

ed80e4b

compile fix

769c3e1

filepath fixes

4a02311

override hostbuf_to_tensor size

4803d46

fix non-0 construction

4bf16b3

vebosity

d26273a

rattus128 force-pushed the dev/threaded-load branch from 203d104 to d26273a Compare May 7, 2026 08:54

revert torch getter change

9d29b64

rattus128 mentioned this pull request May 8, 2026

Multi-threaded load of models from disk (big load time speedups & Offload to disk) (CORE-43,CORE-152,CORE-164,CORE-165,CORE-117) Comfy-Org/ComfyUI#13802

Draft

alexisrolland merged commit 2579f02 into master May 12, 2026
12 checks passed

rattus128 deleted the dev/threaded-load branch May 12, 2026 04:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement threaded loader components (CORE-43)#46

Implement threaded loader components (CORE-43)#46
alexisrolland merged 17 commits into
masterfrom
dev/threaded-load

rattus128 commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rattus128 commented May 6, 2026

Contribution Agreement

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants