Implement threaded loader components (CORE-43)#46
Merged
Conversation
203d104 to
d26273a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Contribution Agreement
Currently loading models from disk has a bad mix of single-core CPU and disk activity serialization.
So create two thread pools to:
1: Prefault memory in the background while the main thread continues.
2: Do chunked + threaded transfers from disk to RAM.
If you prefault, cudaHostRegister is really fast (way way faster than cudaHostMemAlloc). So we use the background prefault of some headroom to completely hide this latency. The chunked transfer is significantly faster pageable cudaMemcpyAsync.
In addition to the above, also expand the hostbuf API to be growable with the above prefaulter.
So the usage pattern for this is use the new hostbuf APIs to create temp buffers that are pinned and always transfer via said pinned memory.
With accompanying ComfyUI changes I get the following preliminary speedups.
Windows RTX5060, 64GB, PCIE4x4 NVME, LTX2.3 FP8 (no Lora) 360P, first run on launch:
Before:
After:
Linux RTX5090, 96GB, PCIE5x4 NVME, LTX2.3 FP8 (no Lora), 720P, first run on launch, caches cold:
Before:
After:
This is nsight of the model load on a warm load. The gaps between the green blocks are about what you would expect, given the GPU can DMA master than the memcpy to pin.
More data to come Comfy side, along with how to use this in code.