Skip to content

Multi-threaded load of models from disk (big load time speedups & Offload to disk) (CORE-43,CORE-152,CORE-164,CORE-165,CORE-117)#13802

Draft
rattus128 wants to merge 24 commits into
Comfy-Org:masterfrom
rattus128:dev/threaded-loader
Draft

Multi-threaded load of models from disk (big load time speedups & Offload to disk) (CORE-43,CORE-152,CORE-164,CORE-165,CORE-117)#13802
rattus128 wants to merge 24 commits into
Comfy-Org:masterfrom
rattus128:dev/threaded-loader

Conversation

@rattus128
Copy link
Copy Markdown
Contributor

@rattus128 rattus128 commented May 8, 2026

DGX-spark users please try this and comment your results - I am getting major improvements in load behaviours

Change to this ComfyUI PR, also available here: https://github.com/rattus128/ComfyUI/tree/dev/threaded-loader
Don't forget to update pip with requirements.txt.


Some people have so little RAM they cant fit a single large model in all their RAM.
Others have systems that are so fast, loading the model cold from disk is by far the slowest part of their comfy experience.

So make loading models from disk a lot faster.

Modern NVME disks require a small fleet of CPU threads to actually saturate their read bandwidth. At the same time we are using MMAP+cudaMemcpy(pageable) which is a single threaded, per page synchronous faulting. This limits progress to one disk thread serialized with OS disk activity to do MMAP page faults which is pretty slow. Here is current comfy loading a model (this one is from when the page cache is hot:

image

How to interpret this (left to right):

The bright green is unpinned transfer - these are slow as they imply full CPU synchronization with a (presumably) single threaded memcpy (notice the CUDA API track with the long red calls to cuMemcpy)

After that there is the cuMemAllocHost to allocated pinned memory which is not a lot better. A huge amount of time is tied up in the ioctl for the allocation, and it turns out this is delayed as the kernel needs to prefault the pinned range synchronously and single threaded. It takes more than double the time to actually copy the weight from the page cache (the following read()).

The time to copy from the pinned memory to GPU is then tiny (the darker green).

The blue is the only time the GPU is doing computational work.

So lets fix it by dumping MMAP and torch.to completely and instead implement a fleet of threads to do the whole lot in comfy-aimdo. This will be released in comfy-aimdo 0.4.0 (CORE-43)

Comfy-Org/comfy-aimdo#46

How it works:

Transfers to the GPU always go via pinned memory allocated via the aimdo HostBuffer API. HostBuffers that are expected to grow (like the set of pins for an active model of the course of model init) have a speculative prefetcher, which will prefault the pages for the subsequent allocations. This avoids single-threaded cuMemHostRegister from assuming the page faulting burden.

When allocating in the growing hostbuf, cuMemHostRegister should be fast if the RAM is already prefaulted (confirmed with extensive experiments).

To actually copy the data, a straight chunked multi-threaded cuMemcpy is used.

non-blocking GPU transfer is then used from there.

Here is what it looks like after:

image

TLDR: Denser blue. Denser dark green. No bright green. Smaller gaps == Faster load.

The pthread_cond_wait() is the main thread parking itself as it waits for the fleet of copy threads to do the copy. Notice the cond_wait is overlapping the GPU transfer (dark green), so it is effectively using non-blocking GPU DMA to read the next weight from the disk as the previous one copies. The goal is as much time reading from disk as possible (more pthread_cond_wait == good).


Finer details and other changes:

Even for non-offloaded weights, this system is still used. The offload streams now have an associated re-usable pin_stream_buffer that each transfer can use to stage and transfer weights. This is used for first step loads, or when normal pinned memory is fully exhausted.

This facilitated the need to implement a deeper priority hierarchy for pinned memory with JIT memory pressure release. The stream pin buffer is top priority and will trash all other pins to make space on demand. The active models then take precedence over inactive models. This also fixes a bug where pins were not freeing from inactive models in windows (CORE-164). Since models loading now fully bypasses the mmap, there is no need for the RAM based pressure mechanism on model load so this is removed along with the pin memory pressure mechanism. Cleanup the now unused windows specific memory logic.

Making this new pinning logic interact with regular ModelPatcher pins is awkward, so we take a first step towards deprecation on non-dynamic VRAM by switching off smart memory completely for non-dynamic. Going forward, new models should not need VRAM estimates (as dynamic does it dynamically) so avoid needing such estimates while solving our pin interaction problem (CORE-152).

Performance following this enhancement was then immediately gated on Loras still being unpinned, so follow up with general pinned-memory support for Loras which was deferred in the original lora async work in #13618. (CORE-165)


Example test conditions:

Windows RTX5060, 16GB RAM, PCIE4x4 NVME
LTX2.3 FP8 template workflow (with Lora), 720x520x121f

Before:

scr
Requested to load LTXAV
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [08:59<00:00, 67.40s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [03:37<00:00, 72.58s/it]
Prompt executed in 00:13:59

After:

scr
Requested to load LTXAV
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [02:11<00:00, 16.43s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:56<00:00, 18.96s/it]
Prompt executed in 246.86 seconds

When running without the lora:

LTX2.3 FP8 T2V 360Px121f, Windows, RTX5060, 16GB RAM

Before:

Requested to load LTXAV
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [03:48<00:00, 28.56s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:42<00:00, 34.26s/it]
Prompt executed in 464.36 seconds

After:

Requested to load LTXAV
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:08<00:00,  8.60s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:26<00:00,  8.95s/it]
Prompt executed in 143.79 seconds

Example test conditions:

Linux, DGX spark
Flux2 FP8 in 3 different quants (forces 3x models)

image image image image

Before (step speed ok but huge time spent in loading):

model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:48<00:00,  2.42s/it]
Requested to load AutoencoderKL
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:50<00:00,  2.51s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.bfloat16, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:48<00:00,  2.41s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Prompt executed in 00:14:10

Before --disable-dynamic-vram (OOM killered):

Using mixed precision operations
model weight dtype torch.float8_e5m2, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Unloaded partially: 33080.59 MB freed, 0.00 MB remains loaded, 4480.00 MB buffer reserved, lowvram patches: 0
Killed

After:

model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:47<00:00,  2.35s/it]
Requested to load AutoencoderKL
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:47<00:00,  2.36s/it]
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.bfloat16, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:47<00:00,  2.37s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Prompt executed in 226.51 seconds
image

More test data to come.

@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 8, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedcomfy-aimdo@​0.3.0 ⏵ 0.4.299 +110010010070

View full report

@rattus128
Copy link
Copy Markdown
Contributor Author

rattus128 commented May 9, 2026

This is bad:

qwen 2048x2048 on my 5060.

image

There is a sync in there that shouldn't be happening. Pretty sure its the per stream pre-buffer use syncing in too dumb a way.

rattus128 added 12 commits May 12, 2026 11:04
Disable smart memory outright for non dynamic models.

This is a minor step towards deprecation of --disable-dynamic-vram
and the legacy ModelPatcher.

This is needed for estimate-free model development, where new models
can opt-out of supplying a memory estimate and not have to worry
about hard VRAM allocations due to legacy non-dynamic model patchers

This is also a general stability increase for a lot of stray use cases
where estimates may still be off and going forward we are not going
to accurately maintain such estimates.
Use a single growable buffer so we can do threaded pre-warming on
pinned memory.
Aimdo implements a faster threaded loader.
Introduce per-offload-stream HostBuffer reuse for pinned staging,
include it in cast buffer reset synchronization.

Defer actual casts that go via this pin path to a separate pass
such that the buffer can be allocated monolithically (to avoid
cudaHostRegister thrash).
Replace the predictive pin pressure mechanism with JIT PIN memory
pressure.
This was defeatured in aimdo iteration
@rattus128 rattus128 force-pushed the dev/threaded-loader branch from 1d792e1 to 44c0a06 Compare May 12, 2026 01:55
@rattus128 rattus128 marked this pull request as ready for review May 12, 2026 13:30
@rattus128
Copy link
Copy Markdown
Contributor Author

This is bad:

qwen 2048x2048 on my 5060.
image

There is a sync in there that shouldn't be happening. Pretty sure its the per stream pre-buffer use syncing in too dumb a way.

This is fixed.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

The PR rewrites prefetch_prepared_value (new signature: counter, destination, stream, copy) and updates low‑VRAM patch prepare calls. It adds mmap dirtiness tracking (DIRTY_MMAPS), per‑stream host pin buffers (STREAM_PIN_BUFFERS), and pin‑budget eviction (ensure_pin_budget, get_pin_buffer, resize_pin_buffer) in model_management. pinning APIs changed to subset-aware forms (get_pin, pin_memory) and unpin was removed. cast and ops logic now use stream pin buffers and STREAM_PIN_BUFFER_HEADROOM. read_tensor_file_slice_into gained a hostbuf fast path; extra_ram_release accepts free_active. Windows-specific module removed; comfy-aimdo bumped to 0.4.0. execution.py now uses psutil and frees pins on RAM shortfall.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description comprehensively explains the motivation, implementation approach, performance improvements, and provides concrete test results demonstrating the changes across multiple platforms.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately reflects the main objective: implementing multi-threaded model loading from disk with performance improvements and disk offload capabilities.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@comfy/model_management.py`:
- Around line 1193-1200: The call to offload_stream.synchronize() inside
get_pin_buffer causes unwanted stream-wide blocking when reusing
STREAM_PIN_BUFFERS and should be removed to allow overlapping disk reads,
pinned-memory staging and GPU transfers (matching the non-blocking pattern used
in get_aimdo_cast_buffer()); update get_pin_buffer to drop the
offload_stream.synchronize() call when reusing an existing buffer, or if
synchronization is truly required, document in a clear comment why it is
necessary and add a targeted synchronization mechanism (e.g., per-buffer
event/query) instead of synchronizing the entire offload_stream; reference
get_pin_buffer, STREAM_PIN_BUFFERS, offload_stream.synchronize(),
get_offload_stream and get_aimdo_cast_buffer when making the change.

In `@comfy/pinned_memory.py`:
- Around line 19-29: The HostBuffer.extend call is using a relative increment
but expects an absolute target size, causing truncation and out-of-bounds
slices; update the extend invocation in the block that sets module._pin so it
grows to the absolute new size (current hostbuf.size + requested size) instead
of passing size directly. Specifically, compute the absolute target (e.g.,
offset + size or hostbuf.size + size) and pass that to hostbuf.extend(size=...),
keeping the subsequent
comfy_aimdo.torch.hostbuf_to_tensor(hostbuf)[offset:offset+size] slice and the
assignment to module._pin and module._pin.untyped_storage()._comfy_hostbuf
unchanged. Ensure use of the same behavior as
resize_pin_buffer()/pin_buffer.extend(...).

In `@requirements.txt`:
- Line 26: The requirements file pins a non-existent package version
"comfy-aimdo==0.4.0" which will fail installs; update the dependency line for
the comfy-aimdo package (the literal "comfy-aimdo" entry) to a published version
such as "comfy-aimdo==0.3.0" or remove the strict pin (e.g.,
"comfy-aimdo>=0.3.0") so installs succeed until 0.4.0 is published.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 33b744a4-878e-4d18-84e1-79c21c2872d0

📥 Commits

Reviewing files that changed from the base of the PR and between 20e4394 and 44c0a06.

📒 Files selected for processing (9)
  • comfy/lora.py
  • comfy/memory_management.py
  • comfy/model_management.py
  • comfy/model_patcher.py
  • comfy/ops.py
  • comfy/pinned_memory.py
  • comfy/utils.py
  • comfy/windows.py
  • requirements.txt
💤 Files with no reviewable changes (2)
  • comfy/windows.py
  • comfy/utils.py

Comment thread comfy/model_management.py
Comment thread comfy/pinned_memory.py
Comment thread requirements.txt Outdated
rattus128 added 3 commits May 13, 2026 22:23
This was syncing with the offload stream which itself is synced with the
compute stream, so this was syncing CPU with compute transitively. Define
the event to sync it more gently.
Pinning is more important than inactive intermediates and the stream
pin buffer is more important than even active intermediates.
Add back proper pin freeing on RAM pressure
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
comfy/model_management.py (1)

1398-1399: 💤 Low value

Behavior change: pin_memory no longer hard‑rejects above MAX_PINNED_MEMORY.

Previously the function returned False once TOTAL_PINNED_MEMORY + size > MAX_PINNED_MEMORY. Now it best‑effort frees RAM cache + dynamic pins via ensure_pin_budget(size) and then unconditionally proceeds to cudaHostRegister. If ensure_pin_budget cannot make enough room (e.g., everything is the active stream pin, which the priority hierarchy keeps), the cap is silently exceeded — and PIN_PRESSURE_HYSTERESIS makes the actual ceiling effectively MAX_PINNED_MEMORY + 128 MiB even on the happy path.

This appears to be intentional given the PR's "pin priority hierarchy" notes, but it would be helpful to either (a) document MAX_PINNED_MEMORY as a soft target rather than a hard cap, or (b) bail when ensure_pin_budget couldn't free the shortfall, so we don't quietly drift past the user‑visible budget on systems with marginal RAM.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@comfy/model_management.py` around lines 1398 - 1399, pin_memory's behavior
changed to allow exceeding MAX_PINNED_MEMORY because it calls
ensure_pin_budget(size) then proceeds to cudaHostRegister regardless; update
pin_memory to verify the budget after calling ensure_pin_budget and bail if the
shortfall remains: after ensure_pin_budget(size) check if TOTAL_PINNED_MEMORY +
size <= MAX_PINNED_MEMORY (or account for PIN_PRESSURE_HYSTERESIS if intended)
and if not, return False (or raise) instead of proceeding to cudaHostRegister so
the cap remains a hard limit; reference the pin_memory function,
ensure_pin_budget, TOTAL_PINNED_MEMORY, MAX_PINNED_MEMORY,
PIN_PRESSURE_HYSTERESIS, and cudaHostRegister when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@comfy/model_management.py`:
- Around line 516-525: In free_pins(), avoid indexing into
model.model.dynamic_pins[model.load_device] directly; instead safely
.get(model.load_device) and check for None before accessing ["active"] so
incomplete initialization won't raise; update the condition in free_pins
(referencing current_loaded_models, loaded_model.model, model.is_dynamic(),
model.partially_unload_ram) and apply the same defensive .get() pattern to the
other two occurrences noted around lines where dynamic pin checks occur (the
other methods that reference model.model.dynamic_pins[...] at the end of file)
to ensure you only call partially_unload_ram when pin_state is present and its
"active" flag is evaluated.

---

Nitpick comments:
In `@comfy/model_management.py`:
- Around line 1398-1399: pin_memory's behavior changed to allow exceeding
MAX_PINNED_MEMORY because it calls ensure_pin_budget(size) then proceeds to
cudaHostRegister regardless; update pin_memory to verify the budget after
calling ensure_pin_budget and bail if the shortfall remains: after
ensure_pin_budget(size) check if TOTAL_PINNED_MEMORY + size <= MAX_PINNED_MEMORY
(or account for PIN_PRESSURE_HYSTERESIS if intended) and if not, return False
(or raise) instead of proceeding to cudaHostRegister so the cap remains a hard
limit; reference the pin_memory function, ensure_pin_budget,
TOTAL_PINNED_MEMORY, MAX_PINNED_MEMORY, PIN_PRESSURE_HYSTERESIS, and
cudaHostRegister when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 79af7e61-7191-4ca1-9529-c80ae1a90494

📥 Commits

Reviewing files that changed from the base of the PR and between 44c0a06 and 3f71781.

📒 Files selected for processing (5)
  • comfy/memory_management.py
  • comfy/model_management.py
  • comfy/ops.py
  • comfy/pinned_memory.py
  • execution.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • comfy/ops.py
  • comfy/pinned_memory.py

Comment thread comfy/model_management.py
Comment on lines +516 to +525
def free_pins(size, evict_active=False):
if size <= 0:
return

for loaded_model in reversed(current_loaded_models):
model = loaded_model.model
if model is not None and model.is_dynamic() and (evict_active or not model.model.dynamic_pins[model.load_device]["active"]):
size -= model.partially_unload_ram(size)
if size <= 0:
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm dynamic_pins is always populated for load_device before any code path that might
# trigger ensure_pin_budget/free_pins.
ast-grep --pattern 'dynamic_pins[$_] = $_'
rg -nP --type=py -C3 '\bdynamic_pins\b'

Repository: Comfy-Org/ComfyUI

Length of output: 3537


🏁 Script executed:

#!/bin/bash
# Understand the class structure where dynamic_pins is initialized
cd comfy
# Find the class containing the constructor with the initialization
rg -B20 'if self.load_device not in self.model.dynamic_pins:' model_patcher.py | head -40

# Check what function/method contains the problematic line 522
rg -B10 'def free_pins' model_management.py

# Find where current_loaded_models is populated
rg 'current_loaded_models.append|current_loaded_models +=' model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 1511


🏁 Script executed:

#!/bin/bash
# Find is_dynamic() method definition
rg 'def is_dynamic' comfy/

# Check where models are added to current_loaded_models with more context
rg -B5 -A5 'current_loaded_models.append|current_loaded_models +=' comfy/model_management.py | head -50

# Check if ModelPatcherDynamic is the only class with is_dynamic
rg 'class.*ModelPatcher' comfy/model_patcher.py

# Look for any guard/check before accessing dynamic_pins in other places
rg -C2 'is_dynamic\(\)' comfy/model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 1513


🏁 Script executed:

#!/bin/bash
# Get full definition of is_dynamic methods
rg -A5 'def is_dynamic' comfy/model_patcher.py

# Check the LoadedModel class and when models are added
rg -B10 -A10 'class LoadedModel' comfy/model_management.py

# Find where models are loaded and added to current_loaded_models
rg -B8 -A8 'current_loaded_models.append.*LoadedModel' comfy/model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 969


🏁 Script executed:

#!/bin/bash
# Find where models get loaded/added to current_loaded_models
rg -B15 'loaded_model = LoadedModel' comfy/model_management.py | head -50

# Check for load_model function and how models are prepared
rg 'def load_model' comfy/model_management.py

# Look for any git history or PR context - check if this is new code
git log --oneline comfy/model_management.py 2>/dev/null | head -5 || echo "Git log not available"
git diff HEAD^ comfy/model_management.py 2>/dev/null | grep -A10 -B10 "free_pins\|dynamic_pins" | head -50 || echo "Git diff not available"

Repository: Comfy-Org/ComfyUI

Length of output: 656


🏁 Script executed:

#!/bin/bash
# Check if there's any code that changes load_device after construction
rg 'load_device\s*=' comfy/model_patcher.py | head -20

# Look for any re-attachment or device change logic
rg 'attach|detach|reload' comfy/model_patcher.py | grep -i 'def '

# Check if dynamic_pins gets updated for new devices anywhere
rg -B5 -A5 'dynamic_pins\[.*\] = {' comfy/

# Look at the full context of line 522 and around it
sed -n '510,535p' comfy/model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 1454


🏁 Script executed:

#!/bin/bash
# Fix the regex and search for dynamic_pins initialization pattern
rg -C5 'dynamic_pins\[' comfy/model_patcher.py | grep -E '(def |class |dynamic_pins)'

# Look for any code that might change load_device or reuse models across devices
rg 'model\.load_device' comfy/model_management.py | head -20

# Check if there's any scenario where a model might be used with different load_devices
rg -B10 'for loaded_model in.*current_loaded_models' comfy/model_management.py | head -40

# Search for attach/detach patterns that might change device association
rg -B3 -A8 'def.*attach|def.*prepare' comfy/model_patcher.py | head -60

Repository: Comfy-Org/ComfyUI

Length of output: 3858


🏁 Script executed:

#!/bin/bash
# Check if load_device can change or if models can be reused across devices
rg -B5 -A5 'load_device.*=' comfy/model_patcher.py | grep -A8 -B8 '__init__'

# Search for any scenario where a model might be attached/prepared with a new device
rg -C10 'load_device\s*=' comfy/model_management.py

# Check the git log to understand when this code was added
git log --all --full-history -p -- comfy/model_management.py 2>/dev/null | grep -A30 'def free_pins' | head -50 || echo "Checking commit history..."

# Look for any recent changes to dynamic_pins handling
git show HEAD:comfy/model_management.py 2>/dev/null | sed -n '516,525p' || git log -1 --format=%h comfy/model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 4627


🏁 Script executed:

#!/bin/bash
# Verify if load_device is ever reassigned after __init__
rg 'self\.load_device\s*=' comfy/model_patcher.py

# Check if there's any code path where a model could have load_device changed
git log -1 --format="%H %s" comfy/model_management.py

# Verify the exact changes in this commit
git show --stat comfy/model_management.py 2>/dev/null | head -20

# Check if prepare() method modifies load_device or affects dynamic_pins initialization
rg -B5 -A15 'def prepare\(' comfy/model_patcher.py | head -40

Repository: Comfy-Org/ComfyUI

Length of output: 1494


🏁 Script executed:

#!/bin/bash
# Final check: verify there's no other code path that could create a dynamic model without full initialization
rg 'is_dynamic\s*=\s*True|class.*Dynamic' comfy/ --type py

# Confirm that every place accessing dynamic_pins for write operations has it initialized
rg 'dynamic_pins\[.*\]\s*=' comfy/

Repository: Comfy-Org/ComfyUI

Length of output: 533


Use .get() to safely access dynamic_pins[load_device] in case initialization is incomplete.

While the current initialization logic in ModelPatcherDynamic.__init__() should populate dynamic_pins[load_device] before any dynamic model reaches free_pins(), defensive access is prudent in this RAM-pressure code path. Consider:

if model is not None and model.is_dynamic():
    pin_state = model.model.dynamic_pins.get(model.load_device)
    if pin_state is not None and (evict_active or not pin_state["active"]):
        size -= model.partially_unload_ram(size)
        if size <= 0:
            break

This same pattern should also be applied at lines 1249 and 1251 in the same file for consistency.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@comfy/model_management.py` around lines 516 - 525, In free_pins(), avoid
indexing into model.model.dynamic_pins[model.load_device] directly; instead
safely .get(model.load_device) and check for None before accessing ["active"] so
incomplete initialization won't raise; update the condition in free_pins
(referencing current_loaded_models, loaded_model.model, model.is_dynamic(),
model.partially_unload_ram) and apply the same defensive .get() pattern to the
other two occurrences noted around lines where dynamic pin checks occur (the
other methods that reference model.model.dynamic_pins[...] at the end of file)
to ensure you only call partially_unload_ram when pin_state is present and its
"active" flag is evaluated.

@rattus128 rattus128 marked this pull request as draft May 13, 2026 15:14
rattus128 and others added 5 commits May 15, 2026 00:35
Uncap the windows pins from 50% by extending the pool and have a pressure
mechanism to move the pin reservations om demand.

This unfortunately implies a GPU sync to do the freeing so significant
hysterisis needs to be added to consolidate these pressure events.
Limit the amount of RAM background intermediates can use, so that
switching workflows doesn't degrade performance too much.
threaded loader 2 + ram cache (CORE-43,CORE-117)
@rattus128 rattus128 changed the title Multi-threaded load of models from disk (big load time speedups & Offload to disk) (CORE-43,CORE-152,CORE-164,CORE-165) Multi-threaded load of models from disk (big load time speedups & Offload to disk) (CORE-43,CORE-152,CORE-164,CORE-165,CORE-117) May 15, 2026
rattus128 added 4 commits May 16, 2026 03:19
Flux 2 gets crazy stalls due to a mix of tiny and giant weights
creating lopsided steam buffer rotations which creates stalls.
Aimdo is able to chunk and slice this on the way for better CPU->GPU
overlap. The main advantage is the ability to shorten the bus contention
window between previous weight transfer and the next weights vbar
fault.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant