`fetch_nearest_cache` deep-copies the cached KV, doubling peak memory exactly when a cached conversation is reused

## Versions

- mlx-lm 0.31.3, mlx 0.31.2 (also present on current `main`: `models/cache.py:1678/1684/1692`)
- macOS 26 (Darwin 25.5.0), M5 / 16 GB unified memory, Python 3.14

## Summary

`LRUPromptCache.fetch_nearest_cache` returns a `copy.deepcopy` of the matched entry's full KV cache (`models/cache.py:1678`, `1684`, `1692`). For the duration of the fetch, the process holds **two** full copies of that conversation's KV.

On memory-tight machines this doubles peak memory exactly when it's scarcest: continuing the longest-running conversation. A ~6.3 GB cached sequence (roughly a 12k-token context on an 8B model with fp16 KV) dies instantly on fetch on a 16 GB Mac — the Metal command buffer fails with `kIOGPUCommandBufferCallbackErrorOutOfMemory`, `mlx::core::gpu::check_error` throws inside Metal's completion-handler thread where nothing can catch it, and the server aborts (SIGABRT).

This amplifies #1390: even once `--prompt-cache-bytes` is enforced everywhere (#1118 / #1392 wire it into the `LRUPromptCache` constructor), a single conversation whose KV fits comfortably under `max_bytes` can still kill the server when it is *reused*, because the transient peak is 2× the entry size.

## Suggested direction

Move semantics for the common case: pop the entry from the cache on fetch instead of deep-copying, hand the caller the original, and re-insert the updated cache after generation. That halves the transient peak. The deepcopy is only load-bearing when two concurrent requests share a prefix — which the sequential serving path (the only path when a draft model is loaded, `server.py:371`) cannot do anyway.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`fetch_nearest_cache` deep-copies the cached KV, doubling peak memory exactly when a cached conversation is reused #1395

Versions

Summary

Suggested direction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

fetch_nearest_cache deep-copies the cached KV, doubling peak memory exactly when a cached conversation is reused #1395

Description

Versions

Summary

Suggested direction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`fetch_nearest_cache` deep-copies the cached KV, doubling peak memory exactly when a cached conversation is reused #1395