Skip to content

fetch_nearest_cache deep-copies the cached KV, doubling peak memory exactly when a cached conversation is reused #1395

@VitorLudke

Description

@VitorLudke

Versions

  • mlx-lm 0.31.3, mlx 0.31.2 (also present on current main: models/cache.py:1678/1684/1692)
  • macOS 26 (Darwin 25.5.0), M5 / 16 GB unified memory, Python 3.14

Summary

LRUPromptCache.fetch_nearest_cache returns a copy.deepcopy of the matched entry's full KV cache (models/cache.py:1678, 1684, 1692). For the duration of the fetch, the process holds two full copies of that conversation's KV.

On memory-tight machines this doubles peak memory exactly when it's scarcest: continuing the longest-running conversation. A ~6.3 GB cached sequence (roughly a 12k-token context on an 8B model with fp16 KV) dies instantly on fetch on a 16 GB Mac — the Metal command buffer fails with kIOGPUCommandBufferCallbackErrorOutOfMemory, mlx::core::gpu::check_error throws inside Metal's completion-handler thread where nothing can catch it, and the server aborts (SIGABRT).

This amplifies #1390: even once --prompt-cache-bytes is enforced everywhere (#1118 / #1392 wire it into the LRUPromptCache constructor), a single conversation whose KV fits comfortably under max_bytes can still kill the server when it is reused, because the transient peak is 2× the entry size.

Suggested direction

Move semantics for the common case: pop the entry from the cache on fetch instead of deep-copying, hand the caller the original, and re-insert the updated cache after generation. That halves the transient peak. The deepcopy is only load-bearing when two concurrent requests share a prefix — which the sequential serving path (the only path when a draft model is loaded, server.py:371) cannot do anyway.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions