fix(server): honor prompt cache byte limit by agisilaos · Pull Request #1392 · ml-explore/mlx-lm

agisilaos · 2026-06-11T05:46:14Z

Summary

mlx_lm.server parsed --prompt-cache-bytes, but the server-created LRUPromptCache was only initialized with prompt_cache_size. That left the LRU cache itself with its default effectively-unbounded byte limit, so sequential serving could retain prompt cache entries beyond the configured byte budget.

This wires prompt_cache_bytes into LRUPromptCache(max_bytes=...) when the server constructs the prompt cache, keeping byte-limit enforcement centralized for all cache insertion paths.

Changes

Add make_lru_prompt_cache() to construct the server prompt cache with both sequence and byte limits.
Use the helper from run() so --prompt-cache-bytes is honored by the LRU cache itself.
Add a regression test showing a server-created prompt cache evicts entries according to the configured byte budget.

Alternatives considered

Adding another trim call only in the sequential serve path. This was rejected because passing the byte limit into LRUPromptCache keeps enforcement centralized and applies consistently to every insertion path.

Notes

Fixes the server prompt-cache byte-limit enforcement gap related to mlx_lm.server causes macOS kernel panic (IOGPUMemory crash) due to unbounded memory growth #883.
Related to mlx_lm.server crashes on Metal OOM instead of returning an HTTP error #854 and generate() crashes on Metal OOM instead of recovering gracefully #1015, which track broader Metal OOM recovery behavior. This PR reduces one server-side OOM trigger but does not attempt to make arbitrary Metal OOMs recoverable.
Overlaps with fix: honor --prompt-cache-bytes in sequential serve mode #1118.

Testing

python3 -m unittest tests.test_server.TestLRUPromptCache
black --check mlx_lm/server.py tests/test_server.py
isort --profile=black --check-only mlx_lm/server.py tests/test_server.py

What changed: - Construct the server prompt cache through a helper that passes prompt_cache_bytes into LRUPromptCache as max_bytes. - Add a regression test showing a server-created prompt cache evicts entries by byte budget. Why: - --prompt-cache-bytes was parsed and partially used in batching but not applied to the LRU cache itself. - Sequential serving could therefore keep prompt caches unbounded by bytes and contribute to Metal OOM aborts as long prompts accumulated. Alternatives considered: - Trimming only in the sequential serve path was considered, but wiring the limit into LRUPromptCache keeps enforcement centralized for every insertion path.

This was referenced Jun 11, 2026

fetch_nearest_cache deep-copies the cached KV, doubling peak memory exactly when a cached conversation is reused #1395

Open

mlx_lm.server aborts with Metal OOM after prompt cache grows to ~23-26 GB #1390

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): honor prompt cache byte limit#1392

fix(server): honor prompt cache byte limit#1392
agisilaos wants to merge 1 commit into
ml-explore:mainfrom
agisilaos:fix/issue-aborts-metal-oom-after-prompt-cache-grows

agisilaos commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

agisilaos commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Alternatives considered

Notes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

agisilaos commented Jun 11, 2026 •

edited

Loading