Multi-threaded load of models from disk (big load time speedups & Offload to disk) (CORE-43,CORE-152,CORE-164,CORE-165,CORE-117) by rattus128 · Pull Request #13802 · Comfy-Org/ComfyUI

rattus128 · 2026-05-08T14:34:36Z

DGX-spark users please try this and comment your results - I am getting major improvements in load behaviours

Change to this ComfyUI PR, also available here: https://github.com/rattus128/ComfyUI/tree/dev/threaded-loader
Don't forget to update pip with requirements.txt.

Some people have so little RAM they cant fit a single large model in all their RAM.
Others have systems that are so fast, loading the model cold from disk is by far the slowest part of their comfy experience.

So make loading models from disk a lot faster.

Modern NVME disks require a small fleet of CPU threads to actually saturate their read bandwidth. At the same time we are using MMAP+cudaMemcpy(pageable) which is a single threaded, per page synchronous faulting. This limits progress to one disk thread serialized with OS disk activity to do MMAP page faults which is pretty slow. Here is current comfy loading a model (this one is from when the page cache is hot:

How to interpret this (left to right):

The bright green is unpinned transfer - these are slow as they imply full CPU synchronization with a (presumably) single threaded memcpy (notice the CUDA API track with the long red calls to cuMemcpy)

After that there is the cuMemAllocHost to allocated pinned memory which is not a lot better. A huge amount of time is tied up in the ioctl for the allocation, and it turns out this is delayed as the kernel needs to prefault the pinned range synchronously and single threaded. It takes more than double the time to actually copy the weight from the page cache (the following read()).

The time to copy from the pinned memory to GPU is then tiny (the darker green).

The blue is the only time the GPU is doing computational work.

So lets fix it by dumping MMAP and torch.to completely and instead implement a fleet of threads to do the whole lot in comfy-aimdo. This will be released in comfy-aimdo 0.4.0 (CORE-43)

Comfy-Org/comfy-aimdo#46

How it works:

Transfers to the GPU always go via pinned memory allocated via the aimdo HostBuffer API. HostBuffers that are expected to grow (like the set of pins for an active model of the course of model init) have a speculative prefetcher, which will prefault the pages for the subsequent allocations. This avoids single-threaded cuMemHostRegister from assuming the page faulting burden.

When allocating in the growing hostbuf, cuMemHostRegister should be fast if the RAM is already prefaulted (confirmed with extensive experiments).

To actually copy the data, a straight chunked multi-threaded cuMemcpy is used.

non-blocking GPU transfer is then used from there.

Here is what it looks like after:

TLDR: Denser blue. Denser dark green. No bright green. Smaller gaps == Faster load.

The pthread_cond_wait() is the main thread parking itself as it waits for the fleet of copy threads to do the copy. Notice the cond_wait is overlapping the GPU transfer (dark green), so it is effectively using non-blocking GPU DMA to read the next weight from the disk as the previous one copies. The goal is as much time reading from disk as possible (more pthread_cond_wait == good).

Finer details and other changes:

Even for non-offloaded weights, this system is still used. The offload streams now have an associated re-usable pin_stream_buffer that each transfer can use to stage and transfer weights. This is used for first step loads, or when normal pinned memory is fully exhausted.

This facilitated the need to implement a deeper priority hierarchy for pinned memory with JIT memory pressure release. The stream pin buffer is top priority and will trash all other pins to make space on demand. The active models then take precedence over inactive models. This also fixes a bug where pins were not freeing from inactive models in windows (CORE-164). Since models loading now fully bypasses the mmap, there is no need for the RAM based pressure mechanism on model load so this is removed along with the pin memory pressure mechanism. Cleanup the now unused windows specific memory logic.

Making this new pinning logic interact with regular ModelPatcher pins is awkward, so we take a first step towards deprecation on non-dynamic VRAM by switching off smart memory completely for non-dynamic. Going forward, new models should not need VRAM estimates (as dynamic does it dynamically) so avoid needing such estimates while solving our pin interaction problem (CORE-152).

Performance following this enhancement was then immediately gated on Loras still being unpinned, so follow up with general pinned-memory support for Loras which was deferred in the original lora async work in #13618. (CORE-165)

Example test conditions:

Windows RTX5060, 16GB RAM, PCIE4x4 NVME
LTX2.3 FP8 template workflow (with Lora), 720x520x121f

Before:

Requested to load LTXAV
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [08:59<00:00, 67.40s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [03:37<00:00, 72.58s/it]

Prompt executed in 00:13:59

After:

Requested to load LTXAV
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [02:11<00:00, 16.43s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:56<00:00, 18.96s/it]

Prompt executed in 246.86 seconds

When running without the lora:

LTX2.3 FP8 T2V 360Px121f, Windows, RTX5060, 16GB RAM

Before:

Requested to load LTXAV
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [03:48<00:00, 28.56s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:42<00:00, 34.26s/it]

Prompt executed in 464.36 seconds

After:

Requested to load LTXAV
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:08<00:00,  8.60s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:26<00:00,  8.95s/it]

Prompt executed in 143.79 seconds

Example test conditions:

Linux, DGX spark
Flux2 FP8 in 3 different quants (forces 3x models)

Before (step speed ok but huge time spent in loading):

model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:48<00:00,  2.42s/it]
Requested to load AutoencoderKL
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:50<00:00,  2.51s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.bfloat16, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:48<00:00,  2.41s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Prompt executed in 00:14:10

Before --disable-dynamic-vram (OOM killered):

Using mixed precision operations
model weight dtype torch.float8_e5m2, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Unloaded partially: 33080.59 MB freed, 0.00 MB remains loaded, 4480.00 MB buffer reserved, lowvram patches: 0
Killed

After:

model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:47<00:00,  2.35s/it]
Requested to load AutoencoderKL
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:47<00:00,  2.36s/it]
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.bfloat16, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:47<00:00,  2.37s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Prompt executed in 226.51 seconds

More test data to come.

socket-security · 2026-05-08T14:35:05Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	comfy-aimdo@0.3.0 ⏵ 0.4.2	⁺¹

View full report

rattus128 · 2026-05-09T01:34:14Z

This is bad:

qwen 2048x2048 on my 5060.

There is a sync in there that shouldn't be happening. Pretty sure its the per stream pre-buffer use syncing in too dumb a way.

Disable smart memory outright for non dynamic models. This is a minor step towards deprecation of --disable-dynamic-vram and the legacy ModelPatcher. This is needed for estimate-free model development, where new models can opt-out of supplying a memory estimate and not have to worry about hard VRAM allocations due to legacy non-dynamic model patchers This is also a general stability increase for a lot of stray use cases where estimates may still be off and going forward we are not going to accurately maintain such estimates.

Use a single growable buffer so we can do threaded pre-warming on pinned memory.

Aimdo implements a faster threaded loader.

Introduce per-offload-stream HostBuffer reuse for pinned staging, include it in cast buffer reset synchronization. Defer actual casts that go via this pin path to a separate pass such that the buffer can be allocated monolithically (to avoid cudaHostRegister thrash).

Replace the predictive pin pressure mechanism with JIT PIN memory pressure.

This was defeatured in aimdo iteration

rattus128 · 2026-05-12T13:30:43Z

This is bad:

qwen 2048x2048 on my 5060.

There is a sync in there that shouldn't be happening. Pretty sure its the per stream pre-buffer use syncing in too dumb a way.

This is fixed.

coderabbitai · 2026-05-12T13:38:18Z

📝 Walkthrough

Walkthrough

The PR rewrites prefetch_prepared_value (new signature: counter, destination, stream, copy) and updates low‑VRAM patch prepare calls. It adds mmap dirtiness tracking (DIRTY_MMAPS), per‑stream host pin buffers (STREAM_PIN_BUFFERS), and pin‑budget eviction (ensure_pin_budget, get_pin_buffer, resize_pin_buffer) in model_management. pinning APIs changed to subset-aware forms (get_pin, pin_memory) and unpin was removed. cast and ops logic now use stream pin buffers and STREAM_PIN_BUFFER_HEADROOM. read_tensor_file_slice_into gained a hostbuf fast path; extra_ram_release accepts free_active. Windows-specific module removed; comfy-aimdo bumped to 0.4.0. execution.py now uses psutil and frees pins on RAM shortfall.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description comprehensively explains the motivation, implementation approach, performance improvements, and provides concrete test results demonstrating the changes across multiple platforms.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately reflects the main objective: implementing multi-threaded model loading from disk with performance improvements and disk offload capabilities.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@comfy/model_management.py`:
- Around line 1193-1200: The call to offload_stream.synchronize() inside
get_pin_buffer causes unwanted stream-wide blocking when reusing
STREAM_PIN_BUFFERS and should be removed to allow overlapping disk reads,
pinned-memory staging and GPU transfers (matching the non-blocking pattern used
in get_aimdo_cast_buffer()); update get_pin_buffer to drop the
offload_stream.synchronize() call when reusing an existing buffer, or if
synchronization is truly required, document in a clear comment why it is
necessary and add a targeted synchronization mechanism (e.g., per-buffer
event/query) instead of synchronizing the entire offload_stream; reference
get_pin_buffer, STREAM_PIN_BUFFERS, offload_stream.synchronize(),
get_offload_stream and get_aimdo_cast_buffer when making the change.

In `@comfy/pinned_memory.py`:
- Around line 19-29: The HostBuffer.extend call is using a relative increment
but expects an absolute target size, causing truncation and out-of-bounds
slices; update the extend invocation in the block that sets module._pin so it
grows to the absolute new size (current hostbuf.size + requested size) instead
of passing size directly. Specifically, compute the absolute target (e.g.,
offset + size or hostbuf.size + size) and pass that to hostbuf.extend(size=...),
keeping the subsequent
comfy_aimdo.torch.hostbuf_to_tensor(hostbuf)[offset:offset+size] slice and the
assignment to module._pin and module._pin.untyped_storage()._comfy_hostbuf
unchanged. Ensure use of the same behavior as
resize_pin_buffer()/pin_buffer.extend(...).

In `@requirements.txt`:
- Line 26: The requirements file pins a non-existent package version
"comfy-aimdo==0.4.0" which will fail installs; update the dependency line for
the comfy-aimdo package (the literal "comfy-aimdo" entry) to a published version
such as "comfy-aimdo==0.3.0" or remove the strict pin (e.g.,
"comfy-aimdo>=0.3.0") so installs succeed until 0.4.0 is published.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 33b744a4-878e-4d18-84e1-79c21c2872d0

📥 Commits

Reviewing files that changed from the base of the PR and between 20e4394 and 44c0a06.

📒 Files selected for processing (9)

comfy/lora.py
comfy/memory_management.py
comfy/model_management.py
comfy/model_patcher.py
comfy/ops.py
comfy/pinned_memory.py
comfy/utils.py
comfy/windows.py
requirements.txt

💤 Files with no reviewable changes (2)

comfy/windows.py
comfy/utils.py

This was syncing with the offload stream which itself is synced with the compute stream, so this was syncing CPU with compute transitively. Define the event to sync it more gently.

Pinning is more important than inactive intermediates and the stream pin buffer is more important than even active intermediates.

Add back proper pin freeing on RAM pressure

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

comfy/model_management.py (1)
1398-1399: 💤 Low value

Behavior change: pin_memory no longer hard‑rejects above MAX_PINNED_MEMORY.

Previously the function returned False once TOTAL_PINNED_MEMORY + size > MAX_PINNED_MEMORY. Now it best‑effort frees RAM cache + dynamic pins via ensure_pin_budget(size) and then unconditionally proceeds to cudaHostRegister. If ensure_pin_budget cannot make enough room (e.g., everything is the active stream pin, which the priority hierarchy keeps), the cap is silently exceeded — and PIN_PRESSURE_HYSTERESIS makes the actual ceiling effectively MAX_PINNED_MEMORY + 128 MiB even on the happy path.

This appears to be intentional given the PR's "pin priority hierarchy" notes, but it would be helpful to either (a) document MAX_PINNED_MEMORY as a soft target rather than a hard cap, or (b) bail when ensure_pin_budget couldn't free the shortfall, so we don't quietly drift past the user‑visible budget on systems with marginal RAM.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@comfy/model_management.py` around lines 1398 - 1399, pin_memory's behavior
changed to allow exceeding MAX_PINNED_MEMORY because it calls
ensure_pin_budget(size) then proceeds to cudaHostRegister regardless; update
pin_memory to verify the budget after calling ensure_pin_budget and bail if the
shortfall remains: after ensure_pin_budget(size) check if TOTAL_PINNED_MEMORY +
size <= MAX_PINNED_MEMORY (or account for PIN_PRESSURE_HYSTERESIS if intended)
and if not, return False (or raise) instead of proceeding to cudaHostRegister so
the cap remains a hard limit; reference the pin_memory function,
ensure_pin_budget, TOTAL_PINNED_MEMORY, MAX_PINNED_MEMORY,
PIN_PRESSURE_HYSTERESIS, and cudaHostRegister when making the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@comfy/model_management.py`:
- Around line 516-525: In free_pins(), avoid indexing into
model.model.dynamic_pins[model.load_device] directly; instead safely
.get(model.load_device) and check for None before accessing ["active"] so
incomplete initialization won't raise; update the condition in free_pins
(referencing current_loaded_models, loaded_model.model, model.is_dynamic(),
model.partially_unload_ram) and apply the same defensive .get() pattern to the
other two occurrences noted around lines where dynamic pin checks occur (the
other methods that reference model.model.dynamic_pins[...] at the end of file)
to ensure you only call partially_unload_ram when pin_state is present and its
"active" flag is evaluated.

---

Nitpick comments:
In `@comfy/model_management.py`:
- Around line 1398-1399: pin_memory's behavior changed to allow exceeding
MAX_PINNED_MEMORY because it calls ensure_pin_budget(size) then proceeds to
cudaHostRegister regardless; update pin_memory to verify the budget after
calling ensure_pin_budget and bail if the shortfall remains: after
ensure_pin_budget(size) check if TOTAL_PINNED_MEMORY + size <= MAX_PINNED_MEMORY
(or account for PIN_PRESSURE_HYSTERESIS if intended) and if not, return False
(or raise) instead of proceeding to cudaHostRegister so the cap remains a hard
limit; reference the pin_memory function, ensure_pin_budget,
TOTAL_PINNED_MEMORY, MAX_PINNED_MEMORY, PIN_PRESSURE_HYSTERESIS, and
cudaHostRegister when making the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 79af7e61-7191-4ca1-9529-c80ae1a90494

📥 Commits

Reviewing files that changed from the base of the PR and between 44c0a06 and 3f71781.

📒 Files selected for processing (5)

comfy/memory_management.py
comfy/model_management.py
comfy/ops.py
comfy/pinned_memory.py
execution.py

🚧 Files skipped from review as they are similar to previous changes (2)

comfy/ops.py
comfy/pinned_memory.py

coderabbitai · 2026-05-13T12:33:13Z

+def free_pins(size, evict_active=False):
+    if size <= 0:
+        return
+
+    for loaded_model in reversed(current_loaded_models):
+        model = loaded_model.model
+        if model is not None and model.is_dynamic() and (evict_active or not model.model.dynamic_pins[model.load_device]["active"]):
+            size -= model.partially_unload_ram(size)
+            if size <= 0:
+                break


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Confirm dynamic_pins is always populated for load_device before any code path that might # trigger ensure_pin_budget/free_pins. ast-grep --pattern 'dynamic_pins[$_] = $_' rg -nP --type=py -C3 '\bdynamic_pins\b'

Repository: Comfy-Org/ComfyUI

Length of output: 3537

🏁 Script executed:

#!/bin/bash # Understand the class structure where dynamic_pins is initialized cd comfy # Find the class containing the constructor with the initialization rg -B20 'if self.load_device not in self.model.dynamic_pins:' model_patcher.py | head -40 # Check what function/method contains the problematic line 522 rg -B10 'def free_pins' model_management.py # Find where current_loaded_models is populated rg 'current_loaded_models.append|current_loaded_models +=' model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 1511

🏁 Script executed:

#!/bin/bash # Find is_dynamic() method definition rg 'def is_dynamic' comfy/ # Check where models are added to current_loaded_models with more context rg -B5 -A5 'current_loaded_models.append|current_loaded_models +=' comfy/model_management.py | head -50 # Check if ModelPatcherDynamic is the only class with is_dynamic rg 'class.*ModelPatcher' comfy/model_patcher.py # Look for any guard/check before accessing dynamic_pins in other places rg -C2 'is_dynamic' comfy/model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 1513

🏁 Script executed:

#!/bin/bash # Get full definition of is_dynamic methods rg -A5 'def is_dynamic' comfy/model_patcher.py # Check the LoadedModel class and when models are added rg -B10 -A10 'class LoadedModel' comfy/model_management.py # Find where models are loaded and added to current_loaded_models rg -B8 -A8 'current_loaded_models.append.*LoadedModel' comfy/model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 969

🏁 Script executed:

#!/bin/bash # Find where models get loaded/added to current_loaded_models rg -B15 'loaded_model = LoadedModel' comfy/model_management.py | head -50 # Check for load_model function and how models are prepared rg 'def load_model' comfy/model_management.py # Look for any git history or PR context - check if this is new code git log --oneline comfy/model_management.py 2>/dev/null | head -5 || echo "Git log not available" git diff HEAD^ comfy/model_management.py 2>/dev/null | grep -A10 -B10 "free_pins\|dynamic_pins" | head -50 || echo "Git diff not available"

Repository: Comfy-Org/ComfyUI

Length of output: 656

🏁 Script executed:

#!/bin/bash # Check if there's any code that changes load_device after construction rg 'load_device\s*=' comfy/model_patcher.py | head -20 # Look for any re-attachment or device change logic rg 'attach|detach|reload' comfy/model_patcher.py | grep -i 'def ' # Check if dynamic_pins gets updated for new devices anywhere rg -B5 -A5 'dynamic_pins\[.*\] = {' comfy/ # Look at the full context of line 522 and around it sed -n '510,535p' comfy/model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 1454

🏁 Script executed:

#!/bin/bash # Fix the regex and search for dynamic_pins initialization pattern rg -C5 'dynamic_pins\[' comfy/model_patcher.py | grep -E '(def |class |dynamic_pins)' # Look for any code that might change load_device or reuse models across devices rg 'model\.load_device' comfy/model_management.py | head -20 # Check if there's any scenario where a model might be used with different load_devices rg -B10 'for loaded_model in.*current_loaded_models' comfy/model_management.py | head -40 # Search for attach/detach patterns that might change device association rg -B3 -A8 'def.*attach|def.*prepare' comfy/model_patcher.py | head -60

Repository: Comfy-Org/ComfyUI

Length of output: 3858

🏁 Script executed:

#!/bin/bash # Check if load_device can change or if models can be reused across devices rg -B5 -A5 'load_device.*=' comfy/model_patcher.py | grep -A8 -B8 '__init__' # Search for any scenario where a model might be attached/prepared with a new device rg -C10 'load_device\s*=' comfy/model_management.py # Check the git log to understand when this code was added git log --all --full-history -p -- comfy/model_management.py 2>/dev/null | grep -A30 'def free_pins' | head -50 || echo "Checking commit history..." # Look for any recent changes to dynamic_pins handling git show HEAD:comfy/model_management.py 2>/dev/null | sed -n '516,525p' || git log -1 --format=%h comfy/model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 4627

🏁 Script executed:

#!/bin/bash # Verify if load_device is ever reassigned after __init__ rg 'self\.load_device\s*=' comfy/model_patcher.py # Check if there's any code path where a model could have load_device changed git log -1 --format="%H %s" comfy/model_management.py # Verify the exact changes in this commit git show --stat comfy/model_management.py 2>/dev/null | head -20 # Check if prepare() method modifies load_device or affects dynamic_pins initialization rg -B5 -A15 'def prepare\(' comfy/model_patcher.py | head -40

Repository: Comfy-Org/ComfyUI

Length of output: 1494

🏁 Script executed:

#!/bin/bash # Final check: verify there's no other code path that could create a dynamic model without full initialization rg 'is_dynamic\s*=\s*True|class.*Dynamic' comfy/ --type py # Confirm that every place accessing dynamic_pins for write operations has it initialized rg 'dynamic_pins\[.*\]\s*=' comfy/

Repository: Comfy-Org/ComfyUI

Length of output: 533

Use .get() to safely access dynamic_pins[load_device] in case initialization is incomplete.

While the current initialization logic in ModelPatcherDynamic.__init__() should populate dynamic_pins[load_device] before any dynamic model reaches free_pins(), defensive access is prudent in this RAM-pressure code path. Consider:

if model is not None and model.is_dynamic(): pin_state = model.model.dynamic_pins.get(model.load_device) if pin_state is not None and (evict_active or not pin_state["active"]): size -= model.partially_unload_ram(size) if size <= 0: break

This same pattern should also be applied at lines 1249 and 1251 in the same file for consistency.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@comfy/model_management.py` around lines 516 - 525, In free_pins(), avoid indexing into model.model.dynamic_pins[model.load_device] directly; instead safely .get(model.load_device) and check for None before accessing ["active"] so incomplete initialization won't raise; update the condition in free_pins (referencing current_loaded_models, loaded_model.model, model.is_dynamic(), model.partially_unload_ram) and apply the same defensive .get() pattern to the other two occurrences noted around lines where dynamic pin checks occur (the other methods that reference model.model.dynamic_pins[...] at the end of file) to ensure you only call partially_unload_ram when pin_state is present and its "active" flag is evaluated.

Uncap the windows pins from 50% by extending the pool and have a pressure mechanism to move the pin reservations om demand. This unfortunately implies a GPU sync to do the freeing so significant hysterisis needs to be added to consolidate these pressure events.

Limit the amount of RAM background intermediates can use, so that switching workflows doesn't degrade performance too much.

threaded loader 2 + ram cache (CORE-43,CORE-117)

Flux 2 gets crazy stalls due to a mix of tiny and giant weights creating lopsided steam buffer rotations which creates stalls.

Aimdo is able to chunk and slice this on the way for better CPU->GPU overlap. The main advantage is the ability to shorten the bus contention window between previous weight transfer and the next weights vbar fault.

rattus128 mentioned this pull request May 11, 2026

Bypass safetensors mmap when --disable-mmap is set #13609

Closed

rattus128 added 12 commits May 12, 2026 11:04

pinned_memory: implement with aimdo growable buffer

157965a

Use a single growable buffer so we can do threaded pre-warming on pinned memory.

mm: use aimdo to do transfer from disk to pin

b66b642

Aimdo implements a faster threaded loader.

remove old pin path

1795523

Implement JIT pinned memory pressure

8187cd7

Replace the predictive pin pressure mechanism with JIT PIN memory pressure.

LowVRAMPatch: change to two-phase visit

2b927e1

lora: re-implement as inplace swiss-army-knife operation

8e473d7

prepare for multiple pin sets

e48dace

implement pinned loras

3a3b75a

requirements: comfy-aimdo 0.4.0

c395f2d

ops: remove unused arg

44c0a06

This was defeatured in aimdo iteration

rattus128 force-pushed the dev/threaded-loader branch from 1d792e1 to 44c0a06 Compare May 12, 2026 01:55

rattus128 marked this pull request as ready for review May 12, 2026 13:30

rattus128 requested review from Kosinkadink, alexisrolland, comfyanonymous, guill and kijai as code owners May 12, 2026 13:30

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

Comment thread comfy/model_management.py

Comment thread comfy/pinned_memory.py

Comment thread requirements.txt Outdated

rattus128 added 3 commits May 13, 2026 22:23

ops: sync the CPU with only the offload stream activity

ee927aa

This was syncing with the offload stream which itself is synced with the compute stream, so this was syncing CPU with compute transitively. Define the event to sync it more gently.

pins: implement freeing intermediate for pinned memory

d61026d

Pinning is more important than inactive intermediates and the stream pin buffer is more important than even active intermediates.

execution: implement pin eviction on RAM presure

3f71781

Add back proper pin freeing on RAM pressure

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

rattus128 marked this pull request as draft May 13, 2026 15:14

rattus128 and others added 5 commits May 15, 2026 00:35

cli_args/execution: Implement lower background cache-ram threshold

18a74cb

Limit the amount of RAM background intermediates can use, so that switching workflows doesn't degrade performance too much.

make default

d8b4427

bump aimdo

55197d8

Merge pull request #9 from rattus128/dev/threaded-loader-2-ram-cache

ea5775c

threaded loader 2 + ram cache (CORE-43,CORE-117)

rattus128 changed the title ~~Multi-threaded load of models from disk (big load time speedups & Offload to disk) (CORE-43,CORE-152,CORE-164,CORE-165)~~ Multi-threaded load of models from disk (big load time speedups & Offload to disk) (CORE-43,CORE-152,CORE-164,CORE-165,CORE-117) May 15, 2026

rattus128 added 4 commits May 16, 2026 03:19

model-patcher: force-cast tiny weights

0242954

Flux 2 gets crazy stalls due to a mix of tiny and giant weights creating lopsided steam buffer rotations which creates stalls.

ops: refactor in prep for chunking

ed15d62

mm: delegate pin-on-the-way to aimdo

4386563

Aimdo is able to chunk and slice this on the way for better CPU->GPU overlap. The main advantage is the ability to shorten the bus contention window between previous weight transfer and the next weights vbar fault.

bump aimdo

52a68b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded load of models from disk (big load time speedups & Offload to disk) (CORE-43,CORE-152,CORE-164,CORE-165,CORE-117)#13802

Multi-threaded load of models from disk (big load time speedups & Offload to disk) (CORE-43,CORE-152,CORE-164,CORE-165,CORE-117)#13802
rattus128 wants to merge 24 commits into
Comfy-Org:masterfrom
rattus128:dev/threaded-loader

rattus128 commented May 8, 2026 •

edited

Loading

Uh oh!

socket-security Bot commented May 8, 2026 •

edited

Loading

Uh oh!

rattus128 commented May 9, 2026 •

edited

Loading

Uh oh!

rattus128 commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026 •

edited

Loading

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rattus128 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

socket-security Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rattus128 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rattus128 commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rattus128 commented May 8, 2026 •

edited

Loading

socket-security Bot commented May 8, 2026 •

edited

Loading

rattus128 commented May 9, 2026 •

edited

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading