Faster, more reliable local inference on Apple Silicon, with evidence-backed client compatibility and operator-first serving workflows.
This project builds on waybarrios/vllm-mlx and pushes it toward a more practical Apple Silicon backend for real local workflows. It grew out of systematic model evaluation work that repeatedly surfaced inference-engine defects and runtime-policy issues as practical blockers for otherwise promising local models.
| What | Why it matters | |
|---|---|---|
| Performance | +50.25% token throughput vs upstream baseline |
Measurably faster local serving on Apple Silicon |
| Client compatibility | Goose and Open WebUI evidence-backed; Jan and AnythingLLM queued next | Real tools connect today without guessing the backend contract |
| Tool reliability | Thinking-model tools: 6/9 -> 9/9. MLLM tools: 0/9 -> 9/9 |
Agent workflows that previously failed now work |
| Model compatibility | Qwen3.5 4B / 9B validated in 4-bit and 8-bit |
Current trending Qwen family runs on the fork today |
| Serving ergonomics | Validated profiles for text, deterministic, tools, JSON, and multimodal | Shorter path from clone to working backend |
| Runtime controls | Divergence monitoring, strict model id, frequency_penalty, enable_thinking |
Better debugging and safer correctness-sensitive operation |
| Upstream leverage | Useful upstream fixes integrated; fork-side hardening ships faster | Faster iteration without cutting off upstream value |
Benchmarked on Apple Silicon using mlx-community/Qwen3-0.6B-8bit. Full configuration appears in Detailed Benchmarks And Validation.
Confirmed on the current repo checkout:
mlx-community/Qwen3.5-4B-4bitmlx-community/Qwen3.5-4B-8bitmlx-community/Qwen3.5-9B-4bitmlx-community/Qwen3.5-9B-8bit
Validated behaviors on the Qwen3.5 family:
- server startup
/health/v1/models- text-only chat
- multimodal chat
- structured output / JSON-schema with safe default handling when
enable_thinkingis omitted
Additional Stage 3 distilled Qwen3.5 runtime qualification (2026-03-11):
- all 9 models promoted after JSON-mode remediation follow-up
- full model list and status notes: Supported Models
Recommended first profile:
scripts/serve_profile.sh mllm-default mlx-community/Qwen3.5-4B-4bitInstall and serve in two commands:
# Install
uv tool install git+https://github.com/swaylenhayes/vllm-mlx.git
# Serve (single user, max throughput)
vllm-mlx serve mlx-community/Qwen3-4B-Instruct-2507-4bit --port 8000Or with continuous batching for multiple users:
vllm-mlx serve mlx-community/Qwen3-4B-Instruct-2507-4bit --port 8000 --continuous-batchingFor the full operator workflow, see the Fork Operator Guide.
For validated serving profiles with tuned defaults, use the launcher scripts from a git checkout:
# Daily text serving
scripts/serve_profile.sh text-default mlx-community/Qwen3-4B-Instruct-2507-4bit
# Connect Goose
scripts/serve_client_profile.sh goose-text mlx-community/Qwen3-4B-Instruct-2507-4bit
# Connect Open WebUI
scripts/serve_client_profile.sh open-webui-text mlx-community/Qwen3-4B-Instruct-2507-4bit| Target | Status | Validated capabilities |
|---|---|---|
| Goose | evidence-backed | text, streaming, system prompt, tools, auth, strict model id |
| Open WebUI | evidence-backed | text, streaming, system prompt, multimodal, auth |
| Jan | queued next | checklist, corpus, and guide ready |
| AnythingLLM | queued next | checklist, corpus, and guide ready |
Open WebUI note:
- tool use remains
conditionaluntil the backend OpenAI tool-call request shape is independently captured
Details: Client Compatibility Guide · Client Settings Crosswalk
Reliability hardening. Clearer startup and runtime behavior, better defaults, explicit operator guidance, and deterministic serving paths for correctness-sensitive work.
Evidence-backed compatibility. Client validation against actual frontends and agent tools, not just raw endpoint claims. Each client status is backed by a documented test protocol.
Known-good serving paths. Profile and launcher helpers for common Mac-local workflows. Pick a profile, start the backend, connect a client.
Observability and control. Divergence monitoring with confidence intervals, strict model-id enforcement, request-level reasoning control, and frequency_penalty mapping.
Stronger tool and multimodal support. Tool-calling reliability improvements across thinking models and VLMs, metadata-based multimodal detection, validated Qwen3.5 family support, and proprietary format parser support (LiquidAI / WaveCut).
Selective upstream sync. Upstream fixes are integrated where they help. Missing features and hardening ship faster on the fork side.
Benchmark configuration: mlx-community/Qwen3-0.6B-8bit, 10 prompts, max_tokens=64, max_num_seqs=32, prefill_batch_size=8, completion_batch_size=16. Measured 2026-02-24.
| Snapshot | Commit | Total time (s) | Prompts/s | Tokens/s | Throughput (tok/s) |
|---|---|---|---|---|---|
| Upstream baseline | 1fd1c9a |
1.94 | 5.16 | 330.11 | 366.22 |
| Early fork hardening | a00ec35 |
1.31 | 7.62 | 487.72 | 541.06 |
| Current published fork | 26b143b |
1.29 | 7.75 | 496.00 | 550.25 |
| Comparison | Total time | Prompts/s | Tokens/s | Throughput |
|---|---|---|---|---|
| Early fork hardening vs upstream | -32.47% | +47.67% | +47.74% | +47.74% |
| Current published fork vs upstream | -33.51% | +50.19% | +50.25% | +50.25% |
| Current published fork vs early hardening | -1.53% | +1.71% | +1.70% | +1.70% |
Repeated-run protocol: 5 runs, 95% CI, concurrency=2, max_tokens=32.
| Model | Token agreement (mean, 95% CI) | Exact match (mean) | Verdict |
|---|---|---|---|
| Qwen3-4B-Instruct-2507-4bit | 97.86% (97.86-97.86) | 80.00% | Passes >=95% token gate |
| ZwZ-8B-VL-MLX-4bit | 68.65% (65.82-71.49) | 34.00% | Severe divergence |
| Qwen3-VL-30B-A3B-Instruct-4bit | 63.85% (57.20-70.50) | 32.00% | Severe divergence |
Recommended runtime policy:
- Default:
--batch-divergence-threshold 0.95 --batch-divergence-action warn - Correctness-sensitive VLM workloads:
--batch-divergence-action serialize
Workload: 10 prompts, max_tokens=64, concurrency=10 (Qwen3-4B-Instruct-2507-4bit).
| Profile | Total time (s) | Prompts/s | Tokens/s (completion) |
|---|---|---|---|
| Default | 6.95 | 1.44 | 92.04 |
Deterministic (--deterministic) |
7.16 | 1.40 | 89.44 |
The deterministic path costs only -2.82% throughput vs default, which is a small price for reproducible output in debugging and correctness-sensitive workflows.
| Area | Before | After |
|---|---|---|
| Thinking-model tool reliability | 6/9 ceiling | 9/9 in validated set |
| MLLM tool-calling | 0/9 on validated VLMs | 9/9 on validated VLMs |
| LiquidAI/WaveCut parsing | Unparsed proprietary format | Parser aliases liquidai / liquid / lfm |
| Decode controls | No frequency control | frequency_penalty mapped to repetition penalty |
| Model ID policy | Passthrough only | Optional strict enforcement via --strict-model-id |
| Reasoning control | Profile-level only | Request-level enable_thinking |
| Multimodal detection | Repo-name heuristics | Metadata-based detection |
| Client launch ergonomics | Manual flag mapping | Published profile and client launchers |
Upstream fixes are integrated selectively. The latest upstream sync includes MLLM serialization hardening (exclude_none=True), tool-call argument coercion, UTF-8 streaming decode, and MLLM prefill override clarity. Full validation snapshots (1000+ tests passing) are recorded in the improvement log.
Operator: Fork Operator Guide · Known-Good Model/Profile Matrix · Client Compatibility · Client Settings Crosswalk
Evidence: Fork Benefits · Improvement Log · Phase artifacts: benchmarks/phase-results/
Platform: Installation · Quick Start · Server Guide · Anthropic Messages API · Multimodal · Audio · Embeddings · Reasoning Models · MCP & Tool Calling · Continuous Batching · CLI Reference · Supported Models · Configuration · LLM Benchmarks · Image Benchmarks · Video Benchmarks · Audio Benchmarks
vLLM-MLX brings native Apple Silicon GPU acceleration to vLLM-style inference by integrating MLX, mlx-lm, mlx-vlm, mlx-audio, and mlx-embeddings. It supports text, image, video, and audio inference with OpenAI and Anthropic API compatibility, continuous batching, paged KV cache, MCP tool calling, reasoning model support, and native TTS in 10+ languages.
For full upstream platform documentation, see the original project: waybarrios/vllm-mlx.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:8000", api_key="not-needed")
response = client.messages.create(
model="default",
max_tokens=256,
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)# Claude Code
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude┌─────────────────────────────────────────────────────────────────────────┐
│ vLLM API Layer │
│ (OpenAI + Anthropic compatible) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MLXPlatform │
│ (vLLM platform plugin for Apple Silicon) │
└─────────────────────────────────────────────────────────────────────────┘
│
┌─────────────┬────────────┴────────────┬─────────────┐
▼ ▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ mlx-lm │ │ mlx-vlm │ │ mlx-audio │ │mlx-embeddings │
│(LLM inference)│ │ (Vision+LLM) │ │ (TTS + STT) │ │ (Embeddings) │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MLX │
│ (Apple ML Framework - Metal kernels) │
└─────────────────────────────────────────────────────────────────────────┘
Contributions welcome. See the Contributing Guide for details.
Areas where help is especially useful: performance benchmarks on different Apple Silicon chips, client compatibility validation (especially Jan and AnythingLLM), documentation improvements, and bug fixes.
Submit PRs to: github.com/swaylenhayes/vllm-mlx
Apache 2.0 — see LICENSE for details.
If you use this project in your research, please cite the original:
@software{vllm_mlx2025,
author = {Barrios, Wayner},
title = {vLLM-MLX: Apple Silicon MLX Backend for vLLM},
year = {2025},
url = {https://github.com/waybarrios/vllm-mlx},
note = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}- MLX — Apple's ML framework
- mlx-lm — LLM inference library
- mlx-vlm — Vision-language models
- mlx-audio — Text-to-Speech and Speech-to-Text
- mlx-embeddings — Text embeddings
- vLLM — High-throughput LLM serving
- waybarrios/vllm-mlx — Original project this work builds on