fix(server): honor prompt cache byte limit#1392
Open
agisilaos wants to merge 1 commit into
Open
Conversation
What changed: - Construct the server prompt cache through a helper that passes prompt_cache_bytes into LRUPromptCache as max_bytes. - Add a regression test showing a server-created prompt cache evicts entries by byte budget. Why: - --prompt-cache-bytes was parsed and partially used in batching but not applied to the LRU cache itself. - Sequential serving could therefore keep prompt caches unbounded by bytes and contribute to Metal OOM aborts as long prompts accumulated. Alternatives considered: - Trimming only in the sequential serve path was considered, but wiring the limit into LRUPromptCache keeps enforcement centralized for every insertion path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
mlx_lm.serverparsed--prompt-cache-bytes, but the server-createdLRUPromptCachewas only initialized withprompt_cache_size. That left the LRU cache itself with its default effectively-unbounded byte limit, so sequential serving could retain prompt cache entries beyond the configured byte budget.This wires
prompt_cache_bytesintoLRUPromptCache(max_bytes=...)when the server constructs the prompt cache, keeping byte-limit enforcement centralized for all cache insertion paths.Changes
make_lru_prompt_cache()to construct the server prompt cache with both sequence and byte limits.run()so--prompt-cache-bytesis honored by the LRU cache itself.Alternatives considered
LRUPromptCachekeeps enforcement centralized and applies consistently to every insertion path.Notes
Testing
python3 -m unittest tests.test_server.TestLRUPromptCacheblack --check mlx_lm/server.py tests/test_server.pyisort --profile=black --check-only mlx_lm/server.py tests/test_server.py