GitHub - LLMSystems/vLLMux: One-Stop LLM Model Management and Monitoring Platform

One-stop platform to deploy, route, monitor & evaluate your vLLM cluster

vLLMux is a self-hosted control plane for serving many LLMs on vLLM. Paste a vllm serve … command and it becomes a routable model; the router load-balances across instances; and a bundled Prometheus + Grafana stack monitors everything — all behind one Vue dashboard.

Highlights

One router controls the whole fleet — a single OpenAI- & Anthropic-compatible origin fronts every model. Route by the model field across /v1/chat/completions, /v1/messages, /v1/embeddings, /v1/rerank, /v1/score, /tokenize and more; the router resolves the group and load-balances its instances, so clients never address an instance directly.
Add a model by pasting vllm serve … — parsed into a form and layered on as a dynamic overlay; the router hot-reloads, no config.yaml edits.
Lifecycle + self-healing — per-instance state machine (stopped → starting → ready → sleeping → failed), VRAM pre-flight guard, GPU auto-placement, crash auto-restart with backoff.
Autoscaling with a warm-standby tier — per group, keep min_ready replicas warm and scale up on queue depth (wake first, else cold-start) to max_ready; fold idle replicas back down ready → sleep → stop. vLLM sleep mode (level-1) frees a replica's VRAM but wakes in seconds, so scaling down needn't mean a minute-long cold start. Set it from config.yaml or the dashboard; a live Grafana dashboard + alerts are bundled.
Cross-model fallback — give a group a fallback chain so a request degrades to another compatible model when the whole group is down, instead of failing.
Pluggable routing strategies — pick the load-balancing policy per model group or globally: least_load (default), round_robin, random, least_inflight, p2c, plus session_affinity / prefix_affinity for cache reuse on multi-turn chat & shared prompts. Switch it live from the dashboard; transparent failover + per-backend cooldown apply to every strategy.
Cross-instance KV-cache sharing — toggle it per model group in the editor: instances offload/load KV blocks to a shared store (vLLM OffloadingConnector) so a prefix computed on one replica is reused by another (verified ~99% external prefix-cache hit, 31% lower TTFT on a warmed prompt). Traffic & topology views show which groups pool KV vs keep their own.
Live observability — SSE status, animated system-topology & router-balancing graphs, per-model usage / latency / error stats.
Bundled Grafana monitoring — Prometheus auto-discovers every running instance; Overview / Capacity / Performance / GPU / Host dashboards embedded in-app, with SLO thresholds & alerts.
Lifecycle alerting — discrete model events (crash, restart-budget exhausted, recovered) pushed to Slack / Discord / a generic webhook, with per-sink severity floors and per-model cooldown; configured via env or the admin Notifications page (one-click test). Complements Grafana's metric alerts.
Playground — OpenAI-compatible chat (streaming) / completions / embeddings / reranking, with reasoning display.
Benchmark & evaluate — evalscope load tests (concurrency, arrival-rate, SLA auto-tune) plus 30+ accuracy datasets with LLM-as-judge.
Libraries — browse / pre-download HF model weights & datasets from the UI; tool-calling parser helper; LoRA support.
Multi-user & audit — role-based control (viewer/operator/admin) via named operator credentials, with a redacted audit log of every change; plus mint/revoke API keys with per-key usage attribution, rate limits and token quotas (total / daily / monthly). The env admin token and open local-dev still work unchanged.
Config versioning & backup — the dynamic-model overlay (where every runtime change lives) is snapshotted on each mutation; export it as a portable file, import to restore, and roll back to any past version with a side-by-side diff — all from the admin Config Versions page. config.yaml is never rewritten.

See docs/features.md for the full breakdown.

Quick start

Requires Docker with the NVIDIA Container Toolkit (on WSL2, enable GPU support in Docker Desktop).

cp deploy/.env.example deploy/.env   # set HF_TOKEN, which GPUs, the admin token
make up                              # build + start the whole stack
# open http://localhost:8884

make down stops it · make logs tails all services · make ps shows status.

curl http://localhost:8887/v1/models     # router: configured model groups
curl http://localhost:5000/api/models    # backend: lifecycle state of each instance
# http://localhost:8884/grafana          # dashboards + alerts

Every deploy/.env setting — host ports (FRONTEND_PORT/ROUTER_PORT/…), auth tokens, GPU selection, and cache locations — is documented inline in deploy/.env.example and tabulated in docs/deployment.md#environment-variables-deployenv. Full topology, the shared-netns rationale, volumes, and a manual run are in the same docs/deployment.md.

One endpoint for the whole fleet

Every model is reachable through a single OpenAI-compatible origin — the router on :8887 (or /v1 via the dashboard's nginx). Pick the model by its model field and the router load-balances across that group's instances:

Endpoint	Purpose
`POST /v1/chat/completions`	Chat — streaming supported; load-balanced across the model group
`POST /v1/completions`	Text completion
`POST /v1/messages`	Anthropic-compatible Messages API — streaming supported (Anthropic SSE)
`POST /v1/messages/count_tokens`	Count tokens for a Messages request (no generation)
`POST /v1/embeddings`	Embeddings — OpenAI-compatible (forwarded to the embedding server)
`POST /v1/rerank`	Reranking — Jina/Cohere-compatible (`query` + `documents` → sorted `results`)
`POST /v1/score`	Pairwise relevance scoring (`text_1` × `text_2`)
`POST /tokenize` · `POST /detokenize`	Token utilities — text ⇄ token ids (any model kind)
`GET /v1/models`	List configured model groups

curl http://localhost:8887/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "Qwen3-0.6B", "messages": [{"role": "user", "content": "hi"}]}'

Request/response shapes and auth details are in docs/API.md.

Architecture

flowchart LR
    Client([Clients])
    FE["<b>frontend</b><br/>nginx · :8884<br/>single origin"]
    VLLM["<b>vLLM instances</b><br/>:800x"]
    GF["<b>grafana</b><br/>/grafana"]
    DCGM["dcgm-exporter<br/>:9400 · GPU"]
    NODE["node-exporter<br/>:9100 · host"]

    subgraph netns["shared network namespace"]
        BE["<b>backend</b> · :5000<br/>model lifecycle"]
        RT["<b>router</b> · :8887<br/>OpenAI-compatible LB"]
        PR["<b>prometheus</b> · :9090"]
    end

    Client --> FE
    FE -->|/api| BE
    FE -->|/v1| RT
    FE -->|/grafana| GF
    BE -->|launch on demand| VLLM
    RT -->|route + balance| VLLM
    PR -->|scrape /metrics| VLLM
    PR --> DCGM
    PR --> NODE
    GF -->|query| PR

The router only routes — the backend owns model lifecycle. The frontend, router, backend, and Grafana sit behind nginx on a single origin; backend, router, and Prometheus share one network namespace so the spawned vLLM instances are reachable on localhost.

Documentation

Topic
Deployment & topology	docs/deployment.md
Configuration (`config.yaml`)	docs/configuration.md
Features in depth	docs/features.md
Monitoring (Prometheus + Grafana)	docs/monitoring.md
HTTP API	docs/API.md

Requirements

NVIDIA GPU (CUDA 13.1+ recommended) · 16GB+ RAM · 50GB+ disk.

Tip — running multiple instances on limited RAM. Each vLLM instance runs torch.compile + CUDA-graph capture on startup, which is heavy on system RAM (not VRAM). On a small box (e.g. WSL2 with ~8GB RAM), launching a second instance of the same model can exhaust RAM and thrash swap, leaving the new instance stuck in starting. Add --enforce-eager to the launch command to skip compilation: startup drops from minutes to seconds and RAM/CPU pressure falls sharply, at a small inference-latency cost. RAM — not VRAM — is usually the bottleneck for multi-instance, so give WSL more memory (.wslconfig → memory=12GB, then wsl --shutdown) before scaling out.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
.claude		.claude
.github/workflows		.github/workflows
apps		apps
assets		assets
deploy		deploy
docs		docs
packages		packages
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README_zh-CN.md		README_zh-CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Highlights

Quick start

One endpoint for the whole fleet

Architecture

Documentation

Requirements

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Highlights

Quick start

One endpoint for the whole fleet

Architecture

Documentation

Requirements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages