Versions
- mlx-lm 0.31.3, mlx 0.31.2 (also present on current
main: models/cache.py:1678/1684/1692)
- macOS 26 (Darwin 25.5.0), M5 / 16 GB unified memory, Python 3.14
Summary
LRUPromptCache.fetch_nearest_cache returns a copy.deepcopy of the matched entry's full KV cache (models/cache.py:1678, 1684, 1692). For the duration of the fetch, the process holds two full copies of that conversation's KV.
On memory-tight machines this doubles peak memory exactly when it's scarcest: continuing the longest-running conversation. A ~6.3 GB cached sequence (roughly a 12k-token context on an 8B model with fp16 KV) dies instantly on fetch on a 16 GB Mac — the Metal command buffer fails with kIOGPUCommandBufferCallbackErrorOutOfMemory, mlx::core::gpu::check_error throws inside Metal's completion-handler thread where nothing can catch it, and the server aborts (SIGABRT).
This amplifies #1390: even once --prompt-cache-bytes is enforced everywhere (#1118 / #1392 wire it into the LRUPromptCache constructor), a single conversation whose KV fits comfortably under max_bytes can still kill the server when it is reused, because the transient peak is 2× the entry size.
Suggested direction
Move semantics for the common case: pop the entry from the cache on fetch instead of deep-copying, hand the caller the original, and re-insert the updated cache after generation. That halves the transient peak. The deepcopy is only load-bearing when two concurrent requests share a prefix — which the sequential serving path (the only path when a draft model is loaded, server.py:371) cannot do anyway.
Versions
main:models/cache.py:1678/1684/1692)Summary
LRUPromptCache.fetch_nearest_cachereturns acopy.deepcopyof the matched entry's full KV cache (models/cache.py:1678,1684,1692). For the duration of the fetch, the process holds two full copies of that conversation's KV.On memory-tight machines this doubles peak memory exactly when it's scarcest: continuing the longest-running conversation. A ~6.3 GB cached sequence (roughly a 12k-token context on an 8B model with fp16 KV) dies instantly on fetch on a 16 GB Mac — the Metal command buffer fails with
kIOGPUCommandBufferCallbackErrorOutOfMemory,mlx::core::gpu::check_errorthrows inside Metal's completion-handler thread where nothing can catch it, and the server aborts (SIGABRT).This amplifies #1390: even once
--prompt-cache-bytesis enforced everywhere (#1118 / #1392 wire it into theLRUPromptCacheconstructor), a single conversation whose KV fits comfortably undermax_bytescan still kill the server when it is reused, because the transient peak is 2× the entry size.Suggested direction
Move semantics for the common case: pop the entry from the cache on fetch instead of deep-copying, hand the caller the original, and re-insert the updated cache after generation. That halves the transient peak. The deepcopy is only load-bearing when two concurrent requests share a prefix — which the sequential serving path (the only path when a draft model is loaded,
server.py:371) cannot do anyway.