MLX Workbench

Local inference server for LLM, audio, and vision models on Apple Silicon.

FastAPI backend · React + Vite frontend · mlx-lm · mlx-whisper · mlx-vlm


LLM chat with sampler controls	Vision · image & video understanding	Live inference metrics

Requirements

macOS on Apple Silicon (MLX requires Metal)
Python 3.14
Node 20+
uv

Setup

bash scripts/setup.sh

This installs Python deps via uv and frontend deps via npm.

Run

# Terminal 1 — backend
uv run uvicorn main:app --reload --port 8000 --app-dir backend

# Terminal 2 — frontend
cd frontend && npm run dev

UI: http://localhost:3000
API docs: http://localhost:8000/docs

Usage

LLM (text generation)

Load a model from mlx-community on HuggingFace:

curl -X POST http://localhost:8000/llm/load \
  -H 'Content-Type: application/json' \
  -d '{"model_id": "mlx-community/Llama-3.2-1B-Instruct-4bit"}'

Stream a chat response:

curl -X POST http://localhost:8000/llm/chat \
  -H 'Content-Type: application/json' \
  -d '{
    "model_key": "llm::mlx-community/Llama-3.2-1B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Audio (transcription)

Load a Whisper model:

curl -X POST http://localhost:8000/audio/load \
  -H 'Content-Type: application/json' \
  -d '{"model_id": "mlx-community/whisper-small-mlx"}'

Transcribe an audio file:

curl -X POST http://localhost:8000/audio/transcribe \
  -F 'model_key=audio::mlx-community/whisper-small-mlx' \
  -F 'file=@recording.mp3'

Vision (image / video understanding)

Load a vision model:

curl -X POST http://localhost:8000/video/load \
  -H 'Content-Type: application/json' \
  -d '{"model_id": "mlx-community/SmolVLM-256M-Instruct-bf16"}'

Describe an image or video:

curl -X POST http://localhost:8000/video/generate \
  -F 'model_key=video::mlx-community/SmolVLM-256M-Instruct-bf16' \
  -F 'prompt=Describe this in detail.' \
  -F 'file=@photo.jpg'

Model management

# List all loaded models
curl http://localhost:8000/models/

# Unload a model
curl -X DELETE http://localhost:8000/models/llm::mlx-community/Llama-3.2-1B-Instruct-4bit

Layout

backend/
  main.py               FastAPI app and lifespan
  config.py             pydantic-settings, .env support
  routers/
    llm.py              /llm/load, /llm/load-stream, /llm/chat, /llm/metrics
    audio.py            /audio/load, /audio/load-stream, /audio/transcribe
    video.py            /video/load, /video/load-stream, /video/generate
    models.py           /models/ (list), DELETE /models/{key}
    health.py           /health/, /health/ready, /health/info
  services/
    model_registry.py   Loaded-model cache with streaming download progress
    llm_service.py      mlx-lm wrapper with chat templating and SSE streaming
    audio_service.py    mlx-whisper wrapper
    vision_service.py   mlx-vlm wrapper
    mlx_runtime.py      Single-threaded MLX executor (Metal GPU stream safety)
    load_progress.py    tqdm → SSE bridge for download progress
    metrics.py          In-process ring buffer for inference metrics
  utils/
    logger.py
frontend/
  src/
    App.tsx             Router and nav
    api/                axios client and SSE helper
    store/              Zustand state (model selection, sampler, chat history)
    components/         Sidebar nav
    pages/              Chat, Models, Metrics views

Environment variables

All settings are optional and can be set in a .env file at the project root:

Variable	Default	Description
`HF_TOKEN`	—	HuggingFace token for gated models
`UPLOAD_DIR`	`/tmp/mlx_uploads`	Temp dir for uploaded audio/video files
`MAX_UPLOAD_MB`	`500`	Max upload size
`MLX_MAX_TOKENS`	`2048`	Default max tokens for generation
`MLX_TEMPERATURE`	`0.7`	Default temperature
`MLX_TOP_P`	`0.9`	Default top-p

Contributing

Contributions are welcome. Please open an issue before submitting a pull request for non-trivial changes so we can align on the approach.

Keep changes focused — one thing per PR.
All code must run locally on Apple Silicon without any external services.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
backend		backend
frontend		frontend
scripts		scripts
static		static
tests/integration		tests/integration
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLX Workbench

Requirements

Setup

Run

Usage

LLM (text generation)

Audio (transcription)

Vision (image / video understanding)

Model management

Layout

Environment variables

Contributing

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MLX Workbench

Requirements

Setup

Run

Usage

LLM (text generation)

Audio (transcription)

Vision (image / video understanding)

Model management

Layout

Environment variables

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages