Summary
mlx_lm.server crashed twice while serving OpenAI-compatible chat completion requests for mlx-community/Qwen3.5-4B-MLX-8bit.
The immediate crash is an uncaught Metal out-of-memory exception:
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
zsh: abort mlx_lm.server --model mlx-community/Qwen3.5-4B-MLX-8bit
Before each abort, the server logs show the prompt cache growing substantially:
- First run: prompt cache reached
23.35 GB, then aborted shortly after.
- Second run: prompt cache reached
26.28 GB, then aborted shortly after.
There were also repeated BrokenPipeError: [Errno 32] Broken pipe traces when clients disconnected during streaming/progress callbacks. Those did not kill the server by themselves, but may be relevant because interrupted requests appear to leave cache state behind.
Environment
- macOS: 27.0, build
26A5353q
- Architecture:
arm64
- RAM: 48 GB (
51539607552 bytes)
- Python: 3.14.6
mlx-lm: 0.31.3
mlx: 0.31.2
huggingface_hub: 1.18.0
- Model:
mlx-community/Qwen3.5-4B-MLX-8bit
Command
mlx_lm.server --model mlx-community/Qwen3.5-4B-MLX-8bit
The server started on the default host/port:
Starting httpd at 127.0.0.1 on port 8080...
Observed Behavior
The server accepted multiple POST /v1/chat/completions requests. Some requests used very large prompts, for example:
Prompt processing progress: 2048/55817
...
Prompt processing progress: 55817/55817
and after restart:
Prompt processing progress: 2048/64523
...
Prompt processing progress: 22765/64523
BrokenPipeError: [Errno 32] Broken pipe
The prompt cache then grew over time:
Prompt Cache: 10 sequences, 20.50 GB
Prompt Cache: 10 sequences, 23.35 GB
and on the second run:
Prompt Cache: 10 sequences, 20.11 GB
Prompt Cache: 10 sequences, 22.07 GB
Prompt Cache: 10 sequences, 23.53 GB
Prompt Cache: 10 sequences, 24.93 GB
Prompt Cache: 10 sequences, 25.69 GB
Prompt Cache: 10 sequences, 26.28 GB
Then the process aborted:
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
zsh: abort mlx_lm.server --model mlx-community/Qwen3.5-4B-MLX-8bit
Expected Behavior
The server should avoid aborting the process when memory pressure is high. Ideally it would either:
- enforce a safer default prompt cache memory limit,
- evict prompt cache entries before the Metal command buffer fails,
- return an HTTP error for requests that cannot be served within available memory,
- or catch/report the OOM in a controlled way instead of terminating the process.
It would also be helpful if BrokenPipeError from disconnected clients during keepalive/final SSE writes were handled quietly or caused associated request/cache cleanup.
Notes
The help text shows cache limiting options exist:
--prompt-cache-size PROMPT_CACHE_SIZE
--prompt-cache-bytes PROMPT_CACHE_BYTES
No explicit cache limit was passed in this run, so this was default behavior.
Summary
mlx_lm.servercrashed twice while serving OpenAI-compatible chat completion requests formlx-community/Qwen3.5-4B-MLX-8bit.The immediate crash is an uncaught Metal out-of-memory exception:
Before each abort, the server logs show the prompt cache growing substantially:
23.35 GB, then aborted shortly after.26.28 GB, then aborted shortly after.There were also repeated
BrokenPipeError: [Errno 32] Broken pipetraces when clients disconnected during streaming/progress callbacks. Those did not kill the server by themselves, but may be relevant because interrupted requests appear to leave cache state behind.Environment
26A5353qarm6451539607552bytes)mlx-lm: 0.31.3mlx: 0.31.2huggingface_hub: 1.18.0mlx-community/Qwen3.5-4B-MLX-8bitCommand
The server started on the default host/port:
Observed Behavior
The server accepted multiple
POST /v1/chat/completionsrequests. Some requests used very large prompts, for example:and after restart:
The prompt cache then grew over time:
and on the second run:
Then the process aborted:
Expected Behavior
The server should avoid aborting the process when memory pressure is high. Ideally it would either:
It would also be helpful if
BrokenPipeErrorfrom disconnected clients during keepalive/final SSE writes were handled quietly or caused associated request/cache cleanup.Notes
The help text shows cache limiting options exist:
No explicit cache limit was passed in this run, so this was default behavior.