`mlx_lm.server` aborts with Metal OOM after prompt cache grows to ~23-26 GB

## Summary

`mlx_lm.server` crashed twice while serving OpenAI-compatible chat completion requests for `mlx-community/Qwen3.5-4B-MLX-8bit`.

The immediate crash is an uncaught Metal out-of-memory exception:

```text
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
zsh: abort      mlx_lm.server --model mlx-community/Qwen3.5-4B-MLX-8bit
```

Before each abort, the server logs show the prompt cache growing substantially:

- First run: prompt cache reached `23.35 GB`, then aborted shortly after.
- Second run: prompt cache reached `26.28 GB`, then aborted shortly after.

There were also repeated `BrokenPipeError: [Errno 32] Broken pipe` traces when clients disconnected during streaming/progress callbacks. Those did not kill the server by themselves, but may be relevant because interrupted requests appear to leave cache state behind.

## Environment

- macOS: 27.0, build `26A5353q`
- Architecture: `arm64`
- RAM: 48 GB (`51539607552` bytes)
- Python: 3.14.6
- `mlx-lm`: 0.31.3
- `mlx`: 0.31.2
- `huggingface_hub`: 1.18.0
- Model: `mlx-community/Qwen3.5-4B-MLX-8bit`

## Command

```bash
mlx_lm.server --model mlx-community/Qwen3.5-4B-MLX-8bit
```

The server started on the default host/port:

```text
Starting httpd at 127.0.0.1 on port 8080...
```

## Observed Behavior

The server accepted multiple `POST /v1/chat/completions` requests. Some requests used very large prompts, for example:

```text
Prompt processing progress: 2048/55817
...
Prompt processing progress: 55817/55817
```

and after restart:

```text
Prompt processing progress: 2048/64523
...
Prompt processing progress: 22765/64523
BrokenPipeError: [Errno 32] Broken pipe
```

The prompt cache then grew over time:

```text
Prompt Cache: 10 sequences, 20.50 GB
Prompt Cache: 10 sequences, 23.35 GB
```

and on the second run:

```text
Prompt Cache: 10 sequences, 20.11 GB
Prompt Cache: 10 sequences, 22.07 GB
Prompt Cache: 10 sequences, 23.53 GB
Prompt Cache: 10 sequences, 24.93 GB
Prompt Cache: 10 sequences, 25.69 GB
Prompt Cache: 10 sequences, 26.28 GB
```

Then the process aborted:

```text
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
zsh: abort      mlx_lm.server --model mlx-community/Qwen3.5-4B-MLX-8bit
```

## Expected Behavior

The server should avoid aborting the process when memory pressure is high. Ideally it would either:

- enforce a safer default prompt cache memory limit,
- evict prompt cache entries before the Metal command buffer fails,
- return an HTTP error for requests that cannot be served within available memory,
- or catch/report the OOM in a controlled way instead of terminating the process.

It would also be helpful if `BrokenPipeError` from disconnected clients during keepalive/final SSE writes were handled quietly or caused associated request/cache cleanup.

## Notes

The help text shows cache limiting options exist:

```text
--prompt-cache-size PROMPT_CACHE_SIZE
--prompt-cache-bytes PROMPT_CACHE_BYTES
```

No explicit cache limit was passed in this run, so this was default behavior.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`mlx_lm.server` aborts with Metal OOM after prompt cache grows to ~23-26 GB #1390

Summary

Environment

Command

Observed Behavior

Expected Behavior

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

mlx_lm.server aborts with Metal OOM after prompt cache grows to ~23-26 GB #1390

Description

Summary

Environment

Command

Observed Behavior

Expected Behavior

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`mlx_lm.server` aborts with Metal OOM after prompt cache grows to ~23-26 GB #1390