Skip to content

fix(server): honor prompt cache byte limit#1392

Open
agisilaos wants to merge 1 commit into
ml-explore:mainfrom
agisilaos:fix/issue-aborts-metal-oom-after-prompt-cache-grows
Open

fix(server): honor prompt cache byte limit#1392
agisilaos wants to merge 1 commit into
ml-explore:mainfrom
agisilaos:fix/issue-aborts-metal-oom-after-prompt-cache-grows

Conversation

@agisilaos

@agisilaos agisilaos commented Jun 11, 2026

Copy link
Copy Markdown

Summary

mlx_lm.server parsed --prompt-cache-bytes, but the server-created LRUPromptCache was only initialized with prompt_cache_size. That left the LRU cache itself with its default effectively-unbounded byte limit, so sequential serving could retain prompt cache entries beyond the configured byte budget.

This wires prompt_cache_bytes into LRUPromptCache(max_bytes=...) when the server constructs the prompt cache, keeping byte-limit enforcement centralized for all cache insertion paths.

Changes

  • Add make_lru_prompt_cache() to construct the server prompt cache with both sequence and byte limits.
  • Use the helper from run() so --prompt-cache-bytes is honored by the LRU cache itself.
  • Add a regression test showing a server-created prompt cache evicts entries according to the configured byte budget.

Alternatives considered

  • Adding another trim call only in the sequential serve path. This was rejected because passing the byte limit into LRUPromptCache keeps enforcement centralized and applies consistently to every insertion path.

Notes

Testing

  • python3 -m unittest tests.test_server.TestLRUPromptCache
  • black --check mlx_lm/server.py tests/test_server.py
  • isort --profile=black --check-only mlx_lm/server.py tests/test_server.py

What changed:
- Construct the server prompt cache through a helper that passes prompt_cache_bytes into LRUPromptCache as max_bytes.
- Add a regression test showing a server-created prompt cache evicts entries by byte budget.

Why:
- --prompt-cache-bytes was parsed and partially used in batching but not applied to the LRU cache itself.
- Sequential serving could therefore keep prompt caches unbounded by bytes and contribute to Metal OOM aborts as long prompts accumulated.

Alternatives considered:
- Trimming only in the sequential serve path was considered, but wiring the limit into LRUPromptCache keeps enforcement centralized for every insertion path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant