Multi-threaded load of models from disk (big load time speedups & Offload to disk) (CORE-43,CORE-152,CORE-164,CORE-165,CORE-117)#13802
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
Disable smart memory outright for non dynamic models. This is a minor step towards deprecation of --disable-dynamic-vram and the legacy ModelPatcher. This is needed for estimate-free model development, where new models can opt-out of supplying a memory estimate and not have to worry about hard VRAM allocations due to legacy non-dynamic model patchers This is also a general stability increase for a lot of stray use cases where estimates may still be off and going forward we are not going to accurately maintain such estimates.
Use a single growable buffer so we can do threaded pre-warming on pinned memory.
Aimdo implements a faster threaded loader.
Introduce per-offload-stream HostBuffer reuse for pinned staging, include it in cast buffer reset synchronization. Defer actual casts that go via this pin path to a separate pass such that the buffer can be allocated monolithically (to avoid cudaHostRegister thrash).
Replace the predictive pin pressure mechanism with JIT PIN memory pressure.
This was defeatured in aimdo iteration
1d792e1 to
44c0a06
Compare
📝 WalkthroughWalkthroughThe PR rewrites prefetch_prepared_value (new signature: counter, destination, stream, copy) and updates low‑VRAM patch prepare calls. It adds mmap dirtiness tracking (DIRTY_MMAPS), per‑stream host pin buffers (STREAM_PIN_BUFFERS), and pin‑budget eviction (ensure_pin_budget, get_pin_buffer, resize_pin_buffer) in model_management. pinning APIs changed to subset-aware forms (get_pin, pin_memory) and unpin was removed. cast and ops logic now use stream pin buffers and STREAM_PIN_BUFFER_HEADROOM. read_tensor_file_slice_into gained a hostbuf fast path; extra_ram_release accepts free_active. Windows-specific module removed; comfy-aimdo bumped to 0.4.0. execution.py now uses psutil and frees pins on RAM shortfall. 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@comfy/model_management.py`:
- Around line 1193-1200: The call to offload_stream.synchronize() inside
get_pin_buffer causes unwanted stream-wide blocking when reusing
STREAM_PIN_BUFFERS and should be removed to allow overlapping disk reads,
pinned-memory staging and GPU transfers (matching the non-blocking pattern used
in get_aimdo_cast_buffer()); update get_pin_buffer to drop the
offload_stream.synchronize() call when reusing an existing buffer, or if
synchronization is truly required, document in a clear comment why it is
necessary and add a targeted synchronization mechanism (e.g., per-buffer
event/query) instead of synchronizing the entire offload_stream; reference
get_pin_buffer, STREAM_PIN_BUFFERS, offload_stream.synchronize(),
get_offload_stream and get_aimdo_cast_buffer when making the change.
In `@comfy/pinned_memory.py`:
- Around line 19-29: The HostBuffer.extend call is using a relative increment
but expects an absolute target size, causing truncation and out-of-bounds
slices; update the extend invocation in the block that sets module._pin so it
grows to the absolute new size (current hostbuf.size + requested size) instead
of passing size directly. Specifically, compute the absolute target (e.g.,
offset + size or hostbuf.size + size) and pass that to hostbuf.extend(size=...),
keeping the subsequent
comfy_aimdo.torch.hostbuf_to_tensor(hostbuf)[offset:offset+size] slice and the
assignment to module._pin and module._pin.untyped_storage()._comfy_hostbuf
unchanged. Ensure use of the same behavior as
resize_pin_buffer()/pin_buffer.extend(...).
In `@requirements.txt`:
- Line 26: The requirements file pins a non-existent package version
"comfy-aimdo==0.4.0" which will fail installs; update the dependency line for
the comfy-aimdo package (the literal "comfy-aimdo" entry) to a published version
such as "comfy-aimdo==0.3.0" or remove the strict pin (e.g.,
"comfy-aimdo>=0.3.0") so installs succeed until 0.4.0 is published.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 33b744a4-878e-4d18-84e1-79c21c2872d0
📒 Files selected for processing (9)
comfy/lora.pycomfy/memory_management.pycomfy/model_management.pycomfy/model_patcher.pycomfy/ops.pycomfy/pinned_memory.pycomfy/utils.pycomfy/windows.pyrequirements.txt
💤 Files with no reviewable changes (2)
- comfy/windows.py
- comfy/utils.py
This was syncing with the offload stream which itself is synced with the compute stream, so this was syncing CPU with compute transitively. Define the event to sync it more gently.
Pinning is more important than inactive intermediates and the stream pin buffer is more important than even active intermediates.
Add back proper pin freeing on RAM pressure
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
comfy/model_management.py (1)
1398-1399: 💤 Low valueBehavior change:
pin_memoryno longer hard‑rejects aboveMAX_PINNED_MEMORY.Previously the function returned
FalseonceTOTAL_PINNED_MEMORY + size > MAX_PINNED_MEMORY. Now it best‑effort frees RAM cache + dynamic pins viaensure_pin_budget(size)and then unconditionally proceeds tocudaHostRegister. Ifensure_pin_budgetcannot make enough room (e.g., everything is the active stream pin, which the priority hierarchy keeps), the cap is silently exceeded — andPIN_PRESSURE_HYSTERESISmakes the actual ceiling effectivelyMAX_PINNED_MEMORY + 128 MiBeven on the happy path.This appears to be intentional given the PR's "pin priority hierarchy" notes, but it would be helpful to either (a) document
MAX_PINNED_MEMORYas a soft target rather than a hard cap, or (b) bail whenensure_pin_budgetcouldn't free the shortfall, so we don't quietly drift past the user‑visible budget on systems with marginal RAM.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@comfy/model_management.py` around lines 1398 - 1399, pin_memory's behavior changed to allow exceeding MAX_PINNED_MEMORY because it calls ensure_pin_budget(size) then proceeds to cudaHostRegister regardless; update pin_memory to verify the budget after calling ensure_pin_budget and bail if the shortfall remains: after ensure_pin_budget(size) check if TOTAL_PINNED_MEMORY + size <= MAX_PINNED_MEMORY (or account for PIN_PRESSURE_HYSTERESIS if intended) and if not, return False (or raise) instead of proceeding to cudaHostRegister so the cap remains a hard limit; reference the pin_memory function, ensure_pin_budget, TOTAL_PINNED_MEMORY, MAX_PINNED_MEMORY, PIN_PRESSURE_HYSTERESIS, and cudaHostRegister when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@comfy/model_management.py`:
- Around line 516-525: In free_pins(), avoid indexing into
model.model.dynamic_pins[model.load_device] directly; instead safely
.get(model.load_device) and check for None before accessing ["active"] so
incomplete initialization won't raise; update the condition in free_pins
(referencing current_loaded_models, loaded_model.model, model.is_dynamic(),
model.partially_unload_ram) and apply the same defensive .get() pattern to the
other two occurrences noted around lines where dynamic pin checks occur (the
other methods that reference model.model.dynamic_pins[...] at the end of file)
to ensure you only call partially_unload_ram when pin_state is present and its
"active" flag is evaluated.
---
Nitpick comments:
In `@comfy/model_management.py`:
- Around line 1398-1399: pin_memory's behavior changed to allow exceeding
MAX_PINNED_MEMORY because it calls ensure_pin_budget(size) then proceeds to
cudaHostRegister regardless; update pin_memory to verify the budget after
calling ensure_pin_budget and bail if the shortfall remains: after
ensure_pin_budget(size) check if TOTAL_PINNED_MEMORY + size <= MAX_PINNED_MEMORY
(or account for PIN_PRESSURE_HYSTERESIS if intended) and if not, return False
(or raise) instead of proceeding to cudaHostRegister so the cap remains a hard
limit; reference the pin_memory function, ensure_pin_budget,
TOTAL_PINNED_MEMORY, MAX_PINNED_MEMORY, PIN_PRESSURE_HYSTERESIS, and
cudaHostRegister when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 79af7e61-7191-4ca1-9529-c80ae1a90494
📒 Files selected for processing (5)
comfy/memory_management.pycomfy/model_management.pycomfy/ops.pycomfy/pinned_memory.pyexecution.py
🚧 Files skipped from review as they are similar to previous changes (2)
- comfy/ops.py
- comfy/pinned_memory.py
| def free_pins(size, evict_active=False): | ||
| if size <= 0: | ||
| return | ||
|
|
||
| for loaded_model in reversed(current_loaded_models): | ||
| model = loaded_model.model | ||
| if model is not None and model.is_dynamic() and (evict_active or not model.model.dynamic_pins[model.load_device]["active"]): | ||
| size -= model.partially_unload_ram(size) | ||
| if size <= 0: | ||
| break |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm dynamic_pins is always populated for load_device before any code path that might
# trigger ensure_pin_budget/free_pins.
ast-grep --pattern 'dynamic_pins[$_] = $_'
rg -nP --type=py -C3 '\bdynamic_pins\b'Repository: Comfy-Org/ComfyUI
Length of output: 3537
🏁 Script executed:
#!/bin/bash
# Understand the class structure where dynamic_pins is initialized
cd comfy
# Find the class containing the constructor with the initialization
rg -B20 'if self.load_device not in self.model.dynamic_pins:' model_patcher.py | head -40
# Check what function/method contains the problematic line 522
rg -B10 'def free_pins' model_management.py
# Find where current_loaded_models is populated
rg 'current_loaded_models.append|current_loaded_models +=' model_management.pyRepository: Comfy-Org/ComfyUI
Length of output: 1511
🏁 Script executed:
#!/bin/bash
# Find is_dynamic() method definition
rg 'def is_dynamic' comfy/
# Check where models are added to current_loaded_models with more context
rg -B5 -A5 'current_loaded_models.append|current_loaded_models +=' comfy/model_management.py | head -50
# Check if ModelPatcherDynamic is the only class with is_dynamic
rg 'class.*ModelPatcher' comfy/model_patcher.py
# Look for any guard/check before accessing dynamic_pins in other places
rg -C2 'is_dynamic\(\)' comfy/model_management.pyRepository: Comfy-Org/ComfyUI
Length of output: 1513
🏁 Script executed:
#!/bin/bash
# Get full definition of is_dynamic methods
rg -A5 'def is_dynamic' comfy/model_patcher.py
# Check the LoadedModel class and when models are added
rg -B10 -A10 'class LoadedModel' comfy/model_management.py
# Find where models are loaded and added to current_loaded_models
rg -B8 -A8 'current_loaded_models.append.*LoadedModel' comfy/model_management.pyRepository: Comfy-Org/ComfyUI
Length of output: 969
🏁 Script executed:
#!/bin/bash
# Find where models get loaded/added to current_loaded_models
rg -B15 'loaded_model = LoadedModel' comfy/model_management.py | head -50
# Check for load_model function and how models are prepared
rg 'def load_model' comfy/model_management.py
# Look for any git history or PR context - check if this is new code
git log --oneline comfy/model_management.py 2>/dev/null | head -5 || echo "Git log not available"
git diff HEAD^ comfy/model_management.py 2>/dev/null | grep -A10 -B10 "free_pins\|dynamic_pins" | head -50 || echo "Git diff not available"Repository: Comfy-Org/ComfyUI
Length of output: 656
🏁 Script executed:
#!/bin/bash
# Check if there's any code that changes load_device after construction
rg 'load_device\s*=' comfy/model_patcher.py | head -20
# Look for any re-attachment or device change logic
rg 'attach|detach|reload' comfy/model_patcher.py | grep -i 'def '
# Check if dynamic_pins gets updated for new devices anywhere
rg -B5 -A5 'dynamic_pins\[.*\] = {' comfy/
# Look at the full context of line 522 and around it
sed -n '510,535p' comfy/model_management.pyRepository: Comfy-Org/ComfyUI
Length of output: 1454
🏁 Script executed:
#!/bin/bash
# Fix the regex and search for dynamic_pins initialization pattern
rg -C5 'dynamic_pins\[' comfy/model_patcher.py | grep -E '(def |class |dynamic_pins)'
# Look for any code that might change load_device or reuse models across devices
rg 'model\.load_device' comfy/model_management.py | head -20
# Check if there's any scenario where a model might be used with different load_devices
rg -B10 'for loaded_model in.*current_loaded_models' comfy/model_management.py | head -40
# Search for attach/detach patterns that might change device association
rg -B3 -A8 'def.*attach|def.*prepare' comfy/model_patcher.py | head -60Repository: Comfy-Org/ComfyUI
Length of output: 3858
🏁 Script executed:
#!/bin/bash
# Check if load_device can change or if models can be reused across devices
rg -B5 -A5 'load_device.*=' comfy/model_patcher.py | grep -A8 -B8 '__init__'
# Search for any scenario where a model might be attached/prepared with a new device
rg -C10 'load_device\s*=' comfy/model_management.py
# Check the git log to understand when this code was added
git log --all --full-history -p -- comfy/model_management.py 2>/dev/null | grep -A30 'def free_pins' | head -50 || echo "Checking commit history..."
# Look for any recent changes to dynamic_pins handling
git show HEAD:comfy/model_management.py 2>/dev/null | sed -n '516,525p' || git log -1 --format=%h comfy/model_management.pyRepository: Comfy-Org/ComfyUI
Length of output: 4627
🏁 Script executed:
#!/bin/bash
# Verify if load_device is ever reassigned after __init__
rg 'self\.load_device\s*=' comfy/model_patcher.py
# Check if there's any code path where a model could have load_device changed
git log -1 --format="%H %s" comfy/model_management.py
# Verify the exact changes in this commit
git show --stat comfy/model_management.py 2>/dev/null | head -20
# Check if prepare() method modifies load_device or affects dynamic_pins initialization
rg -B5 -A15 'def prepare\(' comfy/model_patcher.py | head -40Repository: Comfy-Org/ComfyUI
Length of output: 1494
🏁 Script executed:
#!/bin/bash
# Final check: verify there's no other code path that could create a dynamic model without full initialization
rg 'is_dynamic\s*=\s*True|class.*Dynamic' comfy/ --type py
# Confirm that every place accessing dynamic_pins for write operations has it initialized
rg 'dynamic_pins\[.*\]\s*=' comfy/Repository: Comfy-Org/ComfyUI
Length of output: 533
Use .get() to safely access dynamic_pins[load_device] in case initialization is incomplete.
While the current initialization logic in ModelPatcherDynamic.__init__() should populate dynamic_pins[load_device] before any dynamic model reaches free_pins(), defensive access is prudent in this RAM-pressure code path. Consider:
if model is not None and model.is_dynamic():
pin_state = model.model.dynamic_pins.get(model.load_device)
if pin_state is not None and (evict_active or not pin_state["active"]):
size -= model.partially_unload_ram(size)
if size <= 0:
breakThis same pattern should also be applied at lines 1249 and 1251 in the same file for consistency.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@comfy/model_management.py` around lines 516 - 525, In free_pins(), avoid
indexing into model.model.dynamic_pins[model.load_device] directly; instead
safely .get(model.load_device) and check for None before accessing ["active"] so
incomplete initialization won't raise; update the condition in free_pins
(referencing current_loaded_models, loaded_model.model, model.is_dynamic(),
model.partially_unload_ram) and apply the same defensive .get() pattern to the
other two occurrences noted around lines where dynamic pin checks occur (the
other methods that reference model.model.dynamic_pins[...] at the end of file)
to ensure you only call partially_unload_ram when pin_state is present and its
"active" flag is evaluated.
Uncap the windows pins from 50% by extending the pool and have a pressure mechanism to move the pin reservations om demand. This unfortunately implies a GPU sync to do the freeing so significant hysterisis needs to be added to consolidate these pressure events.
Limit the amount of RAM background intermediates can use, so that switching workflows doesn't degrade performance too much.
threaded loader 2 + ram cache (CORE-43,CORE-117)
Flux 2 gets crazy stalls due to a mix of tiny and giant weights creating lopsided steam buffer rotations which creates stalls.
Aimdo is able to chunk and slice this on the way for better CPU->GPU overlap. The main advantage is the ability to shorten the bus contention window between previous weight transfer and the next weights vbar fault.


DGX-spark users please try this and comment your results - I am getting major improvements in load behaviours
Change to this ComfyUI PR, also available here: https://github.com/rattus128/ComfyUI/tree/dev/threaded-loader
Don't forget to update pip with requirements.txt.
Some people have so little RAM they cant fit a single large model in all their RAM.
Others have systems that are so fast, loading the model cold from disk is by far the slowest part of their comfy experience.
So make loading models from disk a lot faster.
Modern NVME disks require a small fleet of CPU threads to actually saturate their read bandwidth. At the same time we are using MMAP+cudaMemcpy(pageable) which is a single threaded, per page synchronous faulting. This limits progress to one disk thread serialized with OS disk activity to do MMAP page faults which is pretty slow. Here is current comfy loading a model (this one is from when the page cache is hot:
How to interpret this (left to right):
The bright green is unpinned transfer - these are slow as they imply full CPU synchronization with a (presumably) single threaded memcpy (notice the CUDA API track with the long red calls to cuMemcpy)
After that there is the cuMemAllocHost to allocated pinned memory which is not a lot better. A huge amount of time is tied up in the ioctl for the allocation, and it turns out this is delayed as the kernel needs to prefault the pinned range synchronously and single threaded. It takes more than double the time to actually copy the weight from the page cache (the following read()).
The time to copy from the pinned memory to GPU is then tiny (the darker green).
The blue is the only time the GPU is doing computational work.
So lets fix it by dumping MMAP and torch.to completely and instead implement a fleet of threads to do the whole lot in comfy-aimdo. This will be released in comfy-aimdo 0.4.0 (CORE-43)
Comfy-Org/comfy-aimdo#46
How it works:
Transfers to the GPU always go via pinned memory allocated via the aimdo HostBuffer API. HostBuffers that are expected to grow (like the set of pins for an active model of the course of model init) have a speculative prefetcher, which will prefault the pages for the subsequent allocations. This avoids single-threaded cuMemHostRegister from assuming the page faulting burden.
When allocating in the growing hostbuf, cuMemHostRegister should be fast if the RAM is already prefaulted (confirmed with extensive experiments).
To actually copy the data, a straight chunked multi-threaded cuMemcpy is used.
non-blocking GPU transfer is then used from there.
Here is what it looks like after:
TLDR: Denser blue. Denser dark green. No bright green. Smaller gaps == Faster load.
The pthread_cond_wait() is the main thread parking itself as it waits for the fleet of copy threads to do the copy. Notice the cond_wait is overlapping the GPU transfer (dark green), so it is effectively using non-blocking GPU DMA to read the next weight from the disk as the previous one copies. The goal is as much time reading from disk as possible (more pthread_cond_wait == good).
Finer details and other changes:
Even for non-offloaded weights, this system is still used. The offload streams now have an associated re-usable pin_stream_buffer that each transfer can use to stage and transfer weights. This is used for first step loads, or when normal pinned memory is fully exhausted.
This facilitated the need to implement a deeper priority hierarchy for pinned memory with JIT memory pressure release. The stream pin buffer is top priority and will trash all other pins to make space on demand. The active models then take precedence over inactive models. This also fixes a bug where pins were not freeing from inactive models in windows (CORE-164). Since models loading now fully bypasses the mmap, there is no need for the RAM based pressure mechanism on model load so this is removed along with the pin memory pressure mechanism. Cleanup the now unused windows specific memory logic.
Making this new pinning logic interact with regular ModelPatcher pins is awkward, so we take a first step towards deprecation on non-dynamic VRAM by switching off smart memory completely for non-dynamic. Going forward, new models should not need VRAM estimates (as dynamic does it dynamically) so avoid needing such estimates while solving our pin interaction problem (CORE-152).
Performance following this enhancement was then immediately gated on Loras still being unpinned, so follow up with general pinned-memory support for Loras which was deferred in the original lora async work in #13618. (CORE-165)
Example test conditions:
Windows RTX5060, 16GB RAM, PCIE4x4 NVME
LTX2.3 FP8 template workflow (with Lora), 720x520x121f
Before:
After:
When running without the lora:
LTX2.3 FP8 T2V 360Px121f, Windows, RTX5060, 16GB RAM
Before:
After:
Example test conditions:
Linux, DGX spark
Flux2 FP8 in 3 different quants (forces 3x models)
Before (step speed ok but huge time spent in loading):
Before --disable-dynamic-vram (OOM killered):
After:
More test data to come.