Skip to content

mlx_lm.server aborts with Metal OOM after prompt cache grows to ~23-26 GB #1390

@agisilaos

Description

@agisilaos

Summary

mlx_lm.server crashed twice while serving OpenAI-compatible chat completion requests for mlx-community/Qwen3.5-4B-MLX-8bit.

The immediate crash is an uncaught Metal out-of-memory exception:

libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
zsh: abort      mlx_lm.server --model mlx-community/Qwen3.5-4B-MLX-8bit

Before each abort, the server logs show the prompt cache growing substantially:

  • First run: prompt cache reached 23.35 GB, then aborted shortly after.
  • Second run: prompt cache reached 26.28 GB, then aborted shortly after.

There were also repeated BrokenPipeError: [Errno 32] Broken pipe traces when clients disconnected during streaming/progress callbacks. Those did not kill the server by themselves, but may be relevant because interrupted requests appear to leave cache state behind.

Environment

  • macOS: 27.0, build 26A5353q
  • Architecture: arm64
  • RAM: 48 GB (51539607552 bytes)
  • Python: 3.14.6
  • mlx-lm: 0.31.3
  • mlx: 0.31.2
  • huggingface_hub: 1.18.0
  • Model: mlx-community/Qwen3.5-4B-MLX-8bit

Command

mlx_lm.server --model mlx-community/Qwen3.5-4B-MLX-8bit

The server started on the default host/port:

Starting httpd at 127.0.0.1 on port 8080...

Observed Behavior

The server accepted multiple POST /v1/chat/completions requests. Some requests used very large prompts, for example:

Prompt processing progress: 2048/55817
...
Prompt processing progress: 55817/55817

and after restart:

Prompt processing progress: 2048/64523
...
Prompt processing progress: 22765/64523
BrokenPipeError: [Errno 32] Broken pipe

The prompt cache then grew over time:

Prompt Cache: 10 sequences, 20.50 GB
Prompt Cache: 10 sequences, 23.35 GB

and on the second run:

Prompt Cache: 10 sequences, 20.11 GB
Prompt Cache: 10 sequences, 22.07 GB
Prompt Cache: 10 sequences, 23.53 GB
Prompt Cache: 10 sequences, 24.93 GB
Prompt Cache: 10 sequences, 25.69 GB
Prompt Cache: 10 sequences, 26.28 GB

Then the process aborted:

libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
zsh: abort      mlx_lm.server --model mlx-community/Qwen3.5-4B-MLX-8bit

Expected Behavior

The server should avoid aborting the process when memory pressure is high. Ideally it would either:

  • enforce a safer default prompt cache memory limit,
  • evict prompt cache entries before the Metal command buffer fails,
  • return an HTTP error for requests that cannot be served within available memory,
  • or catch/report the OOM in a controlled way instead of terminating the process.

It would also be helpful if BrokenPipeError from disconnected clients during keepalive/final SSE writes were handled quietly or caused associated request/cache cleanup.

Notes

The help text shows cache limiting options exist:

--prompt-cache-size PROMPT_CACHE_SIZE
--prompt-cache-bytes PROMPT_CACHE_BYTES

No explicit cache limit was passed in this run, so this was default behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions