Skip to content

al2m4n/mlx-workbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLX Workbench

Local inference server for LLM, audio, and vision models on Apple Silicon.

FastAPI backend · React + Vite frontend · mlx-lm · mlx-whisper · mlx-vlm

License: MIT

LLM chat Vision inference Inference metrics
LLM chat with sampler controls Vision · image & video understanding Live inference metrics

Requirements

  • macOS on Apple Silicon (MLX requires Metal)
  • Python 3.14
  • Node 20+
  • uv

Setup

bash scripts/setup.sh

This installs Python deps via uv and frontend deps via npm.

Run

# Terminal 1 — backend
uv run uvicorn main:app --reload --port 8000 --app-dir backend

# Terminal 2 — frontend
cd frontend && npm run dev

Usage

LLM (text generation)

Load a model from mlx-community on HuggingFace:

curl -X POST http://localhost:8000/llm/load \
  -H 'Content-Type: application/json' \
  -d '{"model_id": "mlx-community/Llama-3.2-1B-Instruct-4bit"}'

Stream a chat response:

curl -X POST http://localhost:8000/llm/chat \
  -H 'Content-Type: application/json' \
  -d '{
    "model_key": "llm::mlx-community/Llama-3.2-1B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Audio (transcription)

Load a Whisper model:

curl -X POST http://localhost:8000/audio/load \
  -H 'Content-Type: application/json' \
  -d '{"model_id": "mlx-community/whisper-small-mlx"}'

Transcribe an audio file:

curl -X POST http://localhost:8000/audio/transcribe \
  -F 'model_key=audio::mlx-community/whisper-small-mlx' \
  -F 'file=@recording.mp3'

Vision (image / video understanding)

Load a vision model:

curl -X POST http://localhost:8000/video/load \
  -H 'Content-Type: application/json' \
  -d '{"model_id": "mlx-community/SmolVLM-256M-Instruct-bf16"}'

Describe an image or video:

curl -X POST http://localhost:8000/video/generate \
  -F 'model_key=video::mlx-community/SmolVLM-256M-Instruct-bf16' \
  -F 'prompt=Describe this in detail.' \
  -F 'file=@photo.jpg'

Model management

# List all loaded models
curl http://localhost:8000/models/

# Unload a model
curl -X DELETE http://localhost:8000/models/llm::mlx-community/Llama-3.2-1B-Instruct-4bit

Layout

backend/
  main.py               FastAPI app and lifespan
  config.py             pydantic-settings, .env support
  routers/
    llm.py              /llm/load, /llm/load-stream, /llm/chat, /llm/metrics
    audio.py            /audio/load, /audio/load-stream, /audio/transcribe
    video.py            /video/load, /video/load-stream, /video/generate
    models.py           /models/ (list), DELETE /models/{key}
    health.py           /health/, /health/ready, /health/info
  services/
    model_registry.py   Loaded-model cache with streaming download progress
    llm_service.py      mlx-lm wrapper with chat templating and SSE streaming
    audio_service.py    mlx-whisper wrapper
    vision_service.py   mlx-vlm wrapper
    mlx_runtime.py      Single-threaded MLX executor (Metal GPU stream safety)
    load_progress.py    tqdm → SSE bridge for download progress
    metrics.py          In-process ring buffer for inference metrics
  utils/
    logger.py
frontend/
  src/
    App.tsx             Router and nav
    api/                axios client and SSE helper
    store/              Zustand state (model selection, sampler, chat history)
    components/         Sidebar nav
    pages/              Chat, Models, Metrics views

Environment variables

All settings are optional and can be set in a .env file at the project root:

Variable Default Description
HF_TOKEN HuggingFace token for gated models
UPLOAD_DIR /tmp/mlx_uploads Temp dir for uploaded audio/video files
MAX_UPLOAD_MB 500 Max upload size
MLX_MAX_TOKENS 2048 Default max tokens for generation
MLX_TEMPERATURE 0.7 Default temperature
MLX_TOP_P 0.9 Default top-p

Contributing

Contributions are welcome. Please open an issue before submitting a pull request for non-trivial changes so we can align on the approach.

  • Keep changes focused — one thing per PR.
  • All code must run locally on Apple Silicon without any external services.

License

MIT — see LICENSE.

About

Local inference server for LLM, audio, and vision models on Apple Silicon, FastAPI backend, React UI, powered by mlx-lm, mlx-whisper, and mlx-vlm.

Resources

License

Stars

Watchers

Forks

Contributors