An OpenAI-compatible reverse proxy that fans one endpoint out across many backends — local LLM servers (llama.cpp / llama-swap / vLLM / Ollama …), cloud APIs (together.ai, OpenAI, OpenRouter …), and ComfyUI image-generation servers. Callers see a single OpenAI endpoint; the gateway handles discovery, priority routing, failover, virtual aliases, per-backend concurrency, an optional multi-user layer, call parking, and a built-in management console.
It sits between OpenAI-compatible clients (N8N, LibreChat, Open WebUI, LangChain code, image clients like anima-verse, …) and a fleet of backends.
- Why
- Quick start
- Configuration — backends, aliases, knobs
- Authentication & multi-user — keys, allow-lists, quotas
- Routing — priority, prefixing,
local, concurrency, parking - Call parking — queue instead of
503when busy - Reasoning control — thinking on/off per request, per alias, per model×backend
- Media generation — ComfyUI image/video/audio, aliases, mapping, LoRA, jobs
- The
/uiconsole - Stats & routing dashboard
- Endpoint reference
- Try it
- Running & deploying
- One endpoint for many backends. Point your tools at one URL; add/remove backends without touching clients. Chat, embeddings, the Responses API, and image generation all go through the same gateway.
- Auto-discovery. Each backend's catalog is polled —
/v1/modelsfor LLMs,/object_infofor ComfyUI (models + installed LoRAs). No manual registry. - Strict priority routing + failover.
priorityis a first-class deployment ordering: aliasfastroutes to a local box first, a cloud provider as fallback — exactly that, every time, no routing-strategy ceremony. - Virtual aliases.
fast,vision,translatormap to different real model IDs per backend; an alias can even override a backend's priority for itself. - Cloud-as-backend. A per-backend
api_keywires in any OpenAI-compatible provider as just another prioritised backend. - Multi-user. Optional per-user API keys with model/alias/backend allow-lists
(which also filter what each key sees in
/v1/models) and monthly cost quotas. - Call parking (default). When every matching backend is busy, the call
queues until one frees instead of returning
503— no client change needed. Park time is per-alias; async is the standard Responses background mode. - Media generation. ComfyUI workflows exposed as OpenAI image endpoints + a native job API — image, video, and audio outputs — with a convention-free node mapping, dynamic LoRAs, and LoRA-aware backend routing.
- Built-in console at
/ui. Manage backends, aliases, workflow mappings, users, server settings; run a chat/media playground; watch jobs, stats, parked calls, and the live routing map. Server-rendered, zero JS framework. - Hot config reload.
config.yamlchanges apply live; most management also lives in a writable store edited from the console.
git clone https://github.com/KaletoAI/llm-gateway.git
cd llm-gateway
python3 -m venv venv && venv/bin/pip install -r requirements.txt
cp config.example.yaml config.yaml
$EDITOR config.yaml # set backends + api_key
venv/bin/uvicorn main:app --host 0.0.0.0 --port 4000 # add --reload for devPoint any OpenAI-compatible client at http://<host>:4000/v1 with the api_key
you set. Open http://<host>:4000/ui for the management console.
requirements.txtomitswatchfiles; it ships transitively withuvicorn[standard]and powers the hot-reload ofconfig.yaml. Keep that extra.
config.example.yaml is the documented template. Copy to config.yaml
(gitignored, hot-reloaded on save). Two things read only at startup:
stats.enabled and the jobs/stats DB paths.
Config vs. store. config.yaml is the bootstrap source. Once the console is
used, most state (backends, chat aliases, generation aliases + mappings, users,
server settings) lives in a writable SQLite store (store.db) which then
becomes the source of truth. The store is seeded once from config and merged
over it; you can run almost entirely from config or almost entirely from the
console — both work.
api_key: "sk-change-me" # master/admin key (see Authentication)
health_check_interval: 30 # seconds between backend liveness polls
log_per_call: true # one log line per forwarded request
model_prefix: true # list models as <backend>/<model>
# max_concurrent: 1 # global default in-flight cap per backend
backends:
- name: local-gpu
url: http://192.168.1.10:8080 # llama-swap / llama.cpp / vLLM / …
priority: 1 # 1 = preferred
max_concurrent: 1 # single-slot llama.cpp → one at a time
# local: true # ALSO list its models bare (see below)
- name: together # cloud fallback (OpenAI-compatible)
url: https://api.together.xyz
priority: 99
api_key: "tgp_v1_…" # sent as Bearer to this backend
chat_only: true # drop non-chat models at discovery
serverless_only: true # drop dedicated-endpoint-only models
virtual_models: # chat aliases
"translator": "Aya-Expanse-8B" # same model on every backend
"fast": # per-backend mapping
local-gpu: "Qwen3.5-9B"
together: "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"
"cheap": # per-alias priority override
local-cpu: { model: "gemma-3-9b-it", priority: 1 }
local-gpu: { model: "Qwen3.5-9B", priority: 2 }A backend's value under an alias is normally just the model name; make it
{model, priority} to override that backend's priority for this alias only.
Two layers, both optional:
- Master key — the top-level
api_key. Clients sendAuthorization: Bearer <key>. Also unlocks the/uiconsole (sign in with an admin key). Leave empty to run fully open (bootstrap mode). - Per-user keys — created in the Users tab. Each user has its own API
key (generate one in the form, or paste your own), a role (
user/admin), an enabled flag, an optional model allow-list, and an optional monthly cost quota. Calls are attributed to the user (stats source, job owner).
Bootstrap-open → locked. With no users and no master key, the gateway and console are fully open. Add an admin user (or set a master key) to lock it down.
A user's allow-list can contain any mix of:
| Entry kind | Grants |
|---|---|
chat alias (fast) |
that chat alias |
image alias (Qwen) |
that generation alias |
backend name (together) |
all of that backend's models |
model id (together/llama-3… or bare) |
that specific model |
An empty allow-list = everything allowed (the default). A non-empty list both
restricts usage (a disallowed model → 403) and filters /v1/models so
the key only sees what it's allowed. This is how you point an image client at the
gateway and have it see just the image aliases instead of the whole 400-model
catalog: give it a key whose allow-list is the image alias(es) (or the ComfyUI
backend), and GET /v1/models returns only those. ?type=image / ?type=chat
narrows by namespace too.
quota_req_day— requests per day (in-memory counter) →429when exceeded.quota_cost_month— summed USD cost for the month (from the stats log) → blocked when exceeded. Needs stats enabled and priced backends; streaming calls count as0(nousagein stream chunks).
For each request the gateway walks backends in priority order (1 = first) and
takes the first that is (1) enabled, (2) healthy (last discovery poll ok), (3)
not busy (below its max_concurrent in-flight cap), (4) mapped for the alias
(or exposes the bare/real model), and (5) actually has the resolved model. If
that backend errors on the forward, the remaining matching backends are tried in
order. When every match is busy the call parks (queues) by default until a
backend frees, and only 503s if the park time runs out (below).
With model_prefix: true (default), /v1/models lists every model as
<backend>/<model> so the provider is visible. Input is liberal: a prefixed id
routes to exactly that backend; a bare id or an alias routes by priority. Backend
names never collide with vendor prefixes (moonshotai/…), so the leading segment
disambiguates. model_prefix: false → legacy bare, de-duplicated listing.
local: true on a backend additionally lists its models bare (alongside
the prefixed id). A bare request then routes by priority across every local
backend that serves it — same failover/busy-spill as a virtual alias; shared ids
collapse to one entry. Independent of model_prefix.
A live per-backend in-flight counter; at/above the cap the backend is busy and
skipped, so the request spills to the next backend instead of overloading a slow
one. Match it to real parallelism (1 for llama.cpp --parallel 1; unset for a
cloud API). Missing/0 = unlimited. The counter is released when the response
completes — including when a streamed response finishes, not when headers are
sent. Busy state shows in /health and the Routing tab.
| Flag | Effect (at discovery) |
|---|---|
chat_only |
keep only type == "chat" models (drops image/video/embedding). Understands Together's type and OpenRouter's architecture.output_modalities. Backends without those fields (llama-swap, vLLM) are unaffected — so don't set it on a backend whose embedding models you want routable. |
serverless_only |
keep only models with non-zero pricing (Together's dedicated-only models are 0/0; on OpenRouter this also drops :free). |
Naming an alias the same as a real model id shadows that model. /health's
alias_model_conflicts and the Routing tab flag every collision, split into
covered (in the mapping → still routable) and shadowed (hosts the model
but isn't mapped → unreachable by that name) — the latter is the actionable case.
When all backends that map an alias are busy (at their in-flight cap), the
call is held in a FIFO queue until a mapping backend frees (then dispatched)
instead of returning 503. This is the default — no client field needed, so
callers stay plain-OpenAI. A standard request just sees a slightly slower 200,
or a 503 (with Retry-After) if the wait runs out.
- Park time is per-alias (
park_sin the chat-alias editor, or configalias_park): blank = the global default (park_timeout_s, 60 s, Server tab),0= parking off for that alias (immediate503when busy).max_parkedcaps the queue. - Fair: when a slot frees, the oldest waiting call whose alias can use that backend is dispatched first (no head-of-line blocking across aliases). Live queue is visible in the Parked calls panel on the Dashboard.
- On timeout the call leaves the queue with a
503+Retry-After.
Applies to /v1/chat/completions, /v1/completions, /v1/embeddings,
/v1/responses, and the generation path (a busy ComfyUI backend queues the job
rather than 503-ing). The distinction "all busy" vs. "no backend at all" is
explicit — a genuine no-backend still 503s.
Async LLM requests don't use a custom field: they follow the official OpenAI
Responses background mode — POST /v1/responses with background:true
returns immediately with {id, status:"queued"}; poll GET /v1/responses/{id}
(→ in_progress/completed/failed/cancelled) and cancel via
POST /v1/responses/{id}/cancel. The background worker parks in the same queue.
One normalized switch turns a thinking model's reasoning off/on — regardless of which mechanism the model actually needs. Clients send a single field:
"auto" (or omitting the field) leaves the request untouched. The OpenAI
reasoning_effort field works as an alias (minimal → off, anything else → on),
and /v1/responses also accepts the reasoning: {effort} object shape.
Rules decide the mechanism. The switch is translated per (model × backend) by an ordered rule list (UI → Reasoning tab; stored, hot). The first enabled rule whose model glob matches the real model and whose backend set contains the serving backend wins; its adapter does the work:
| Adapter | What it does |
|---|---|
enable_thinking |
sets chat_template_kwargs: {enable_thinking: bool} (vLLM; llama.cpp with --jinja) |
reasoning_effort |
sets reasoning_effort (off → minimal, on → high; overridable per rule) |
nothink_token |
appends a token (default /nothink) to the last user message |
prefill |
appends a closed <think>…</think> assistant turn |
none |
no mechanism — reported as unsupported |
No matching rule → the request is forwarded unchanged and the control is
reported as unsupported — it never fails a call. What was actually applied
comes back in the x-reasoning-control response header (e.g. off:prefill,
on:noop, unsupported) and is logged per call in the LLM Calls tab.
Per-alias default. A chat alias can carry a reasoning default (chat-alias
editor → reasoning: auto|on|off), applied when the client sends nothing — so
tool (off) and tool-thinking (auto) can point at the same backend and
model. An explicit client reasoning field always wins.
Thinking output on /v1/responses. Models that stream their thinking in the
reasoning delta channel are translated to Responses-API reasoning events
(response.reasoning_summary_text.delta) and a reasoning output item —
output_text stays answer-only; clients that don't know reasoning events simply
ignore them.
A ComfyUI backend speaks a different protocol, so it declares type: comfyui.
Discovery is via /object_info (checkpoints/UNETs/VAEs and installed LoRAs);
dispatch submits a parametrised workflow, polls /history, and fetches whatever
artifacts it produced — image, video (e.g. SaveVideo), or audio — with each
artifact's kind/mime carried through the API and rendered in the console.
backends:
- name: gpu-3090
type: comfyui
url: http://192.168.1.20:8188
priority: 1
max_concurrent: 1 # one generation at a time on this GPU
# poll_interval: 1.0 # seconds between /history polls
# max_wait: 600 # hard cap for a single generation
# read_timeout: 60 # per-HTTP-request read timeout (hung read → failover)
# disconnect_grace: 30 # tolerated unreachability before failing overA generation alias (the model of a generation request) maps to an ordered
list of candidate backends. Each candidate carries the workflow (a ComfyUI
API-format JSON) and a mapping that binds logical params to concrete
workflow nodes+fields:
image_models:
"flux":
- backend: gpu-3090
task: text2img
workflow_json: { … } # the ComfyUI API JSON (owned by the gateway)
mapping:
prompt: { node: "6", field: "text" }
width: { node: "5", field: "width" }
seed: { node: "3", field: "seed" }
fixed: # pinned node values (models, switches, …)
- { node: "4", field: "unet_name", value: "flux1-dev.safetensors" }The mapping is convention-free — it works with any workflow regardless of node naming. (An auto-detect heuristic pre-fills it for templated workflows; the explicit mapping always wins.) In practice you author all this in the Mapping tab of the console rather than by hand: paste the ComfyUI API JSON, the gateway owns it, auto-suggests the mapping, and gives you discovery-fed dropdowns.
Key mapping concepts:
- Workflow + mapping are backend-independent (shared across an alias's candidates). Only Pinned values are per-backend (one tab per backend), so the same alias can use a different checkpoint on each GPU while looking identical from outside. A request param that targets a pinned node/field is ignored — a pin is authoritative; the API can't override it.
- Image input slots (a
LoadImage/LoadImageMasknode) become file-upload request fields. By default an unfilled slot gets an 8×8 placeholder; mark a slotrequired(Mapping checkbox) to leave it empty instead so ComfyUI errors clearly when a needed image/mask is missing (inpaint). - Numeric fields (strength, steps, cfg) render with
min/max/steppulled live from/object_info.
LoRAs are first-class:
- Pinned LoRA — a
fixedbinding on a LoRA-loader slot; the API can't change it. - Dynamic LoRAs — the client sends
lora_1,lora_2, … (+ optionalstrength_N). The gateway cascades them into the next free slots of the workflow's LoRA stack, never overwriting a pinned/occupied slot — so a client needn't know which slot is reserved. - LoRA-aware routing — a backend that lacks a requested LoRA is dropped from
the candidate set (decided over all candidates incl. busy, so the request parks
for the backend that has it rather than spilling to one that doesn't). A LoRA
installed on no backend is ignored (priority decides). An explicit
backendpin is never overridden. GET /v1/generations/{alias}/lorasreturns the LoRA filenames valid for an alias (the union installed across its backends) — for building a correct picker.
Every generation is a job: SQLite metadata + on-disk artifacts under
jobs/<id>/<n>.<ext> (image, video, or audio — the artifact's kind/mime flow
through the API and the console), lifecycle queued → running → done|failed,
retrievable by id until its TTL (default 24 h), then pruned. The job also keeps
its inputs (prompt, params, reference images) so it stays inspectable in the
Media Jobs tab. A running job can be cancelled (POST /v1/jobs/{id}/cancel or the ✕ button),
which interrupts the ComfyUI prompt to free the GPU. On a restart, any job left
running/queued is reconciled to failed.
- OpenAI Images API (for OpenAI image clients):
POST /v1/images/generations— JSON, text→image. Bonus: LocalAI-styleref_images(base64/URL list). Extra keys pass through as workflow params.POST /v1/images/edits— multipart;imagefile(s) + the OpenAImaskfield map positionally onto the workflow's image slots.response_format=url(job-result URL, needs the Bearer key to fetch) orb64_json(inline). These are synchronous and block until the image is ready (and park if the backend is busy rather than503-ing).
- Native job API —
POST /v1/generationswith{model, prompt, mode, params}.mode: "async"returns202 {job_id}; pollGET /v1/jobs/{id}and fetchGET /v1/jobs/{id}/result/{n}.mode: "sync"blocks and returns inline.
See docs/anima-versa-integration.md for a full client-integration walkthrough.
A server-rendered console mounted at /ui (sign in with an admin key once
locked). Tabs:
| Tab | What |
|---|---|
| Dashboard | live per-backend status + in-flight, parked calls, media-job counts/recent, recent LLM calls |
| Server | runtime + restart-required settings (API key, caps, park time/queue, stats/jobs, TTL/prune) |
| Backends | add/edit/remove backends (LLM + ComfyUI) |
| Input | what clients can call — chat aliases, generation models, endpoints |
| Routing Overview | the live alias→backend map + collisions (searchable) |
| Mapping | register a ComfyUI workflow, wire its node mapping, pin values; chat-alias editor (per-alias park_s + reasoning default) |
| Reasoning | the normalized-thinking rule list (model glob × backend set → adapter) + test resolver |
| Chat Playground | send a chat completion — as a real API client through /v1/chat/completions (auth, routing, parking, stats all apply) |
| Media Playground | run a generation via POST /v1/generations (pick alias/backend, upload refs, set params) — image/video/audio |
| Media Jobs | list + detail of generation jobs (inputs + outputs, within TTL) |
| LLM Calls | per-call history with stored request/response bodies |
| Statistic | the call-stats dashboard (search, aggregates, drilldown) |
| Users | multi-user keys, allow-lists, quotas, IP aliases |
Opt-in SQLite call log, surfaced in the Statistic and Routing tabs of the
console (no separate port — the old standalone dashboard was folded into /ui).
Every call records timestamp, duration, backend, source, alias, model, endpoint,
HTTP status, tokens, and USD cost.
stats:
enabled: false # read at startup only — toggling needs a restart
db_path: stats.db
retention_days: 0 # 0 = keep forever; else prune older rows hourly- Cost comes from each backend's pricing (cached at discovery, normalised to USD/million tokens — Together's per-million and OpenRouter's per-token schemas). Local backends → 0.
- Source is the authenticated user, else the
X-Sourceheader, else client IP (IP aliases give those friendly names; reverse-DNS is auto-resolved). - Streaming calls record real tokens when the backend honors
stream_options.include_usage(requested automatically), else0. - The applied reasoning control is logged per call (LLM Calls tab column).
- Recent calls store the full request/response body (large/binary bodies on disk), viewable per-call, pruned with the same retention.
| Method | Path | Notes |
|---|---|---|
GET |
/v1/models |
catalog filtered by the caller's allow-list; ?type=chat|image |
GET |
/v1/models/{id} |
single-model lookup |
POST |
/v1/chat/completions |
chat; priority + failover; streaming; parking; reasoning: off|on|auto |
POST |
/v1/completions |
completions; same routing |
POST |
/v1/embeddings |
embeddings; same routing |
POST |
/v1/responses |
Responses API ↔ chat bridge; streaming; parking; background:true (async) |
GET |
/v1/responses/{id} |
poll a background response (queued→…→completed/failed/cancelled) |
POST |
/v1/responses/{id}/cancel |
cancel a background response |
POST |
/v1/images/generations |
text→image (sync); may return a video/audio URL for such aliases |
POST |
/v1/images/edits |
multipart image+mask edit (sync) |
| Method | Path | Notes |
|---|---|---|
POST |
/v1/generations |
run a generation alias (sync or mode:"async"); per-field reference images via images: {param: base64|URL} |
GET |
/v1/generations/{alias}/loras |
LoRAs valid for an alias |
GET |
/v1/jobs/{id} |
job status + results |
GET |
/v1/jobs/{id}/result/{n} |
a result artifact (owner-gated) |
GET |
/v1/jobs/{id}/input/{n} |
a stored reference image (owner-gated) |
POST |
/v1/jobs/{id}/cancel |
cancel a queued/running job (interrupts ComfyUI) |
| Method | Path | Notes |
|---|---|---|
GET |
/health |
per-backend health/model/priority + busy/inflight + conflicts |
* |
/ui/** |
the management console |
Every proxied LLM response carries x-gateway-backend (which backend served
the call) and, when a reasoning switch was requested, x-reasoning-control
(what was actually applied).
Clients on LangChain.js (N8N's AI Agent, …) call /v1/responses; most backends
only speak /v1/chat/completions. The gateway translates request
(input/instructions/tools → messages/system/tool schema) and response
(choices[0].message → output[…], token field renames) transparently, and
routes through the same dispatch/parking path as chat. stream: true is
supported (chat SSE → Responses SSE events). background: true runs the
request asynchronously per the official OpenAI pattern: it returns immediately
with a queued response object; poll GET /v1/responses/{id} until a terminal
state and cancel via POST /v1/responses/{id}/cancel. The background worker
parks in the shared queue (longer async window) — so long-running or busy
requests never time out the client connection. Thinking-model output arrives as
Responses-API reasoning items/events (see Reasoning control).
KEY=sk-change-me ; B=http://localhost:4000
# List models (filtered by your key's allow-list)
curl $B/v1/models -H "Authorization: Bearer $KEY"
# Chat through an alias
curl $B/v1/chat/completions -H "Authorization: Bearer $KEY" \
-H "Content-Type: application/json" \
-d '{"model":"fast","messages":[{"role":"user","content":"hi"}]}'
# Embeddings
curl $B/v1/embeddings -H "Authorization: Bearer $KEY" \
-H "Content-Type: application/json" \
-d '{"model":"embedding","input":["hallo welt"]}'
# Text→image (sync, inline base64)
curl $B/v1/images/generations -H "Authorization: Bearer $KEY" \
-H "Content-Type: application/json" \
-d '{"model":"flux","prompt":"a red apple","size":"1024x1024","response_format":"b64_json"}'
# LoRAs valid for an alias
curl $B/v1/generations/flux/loras -H "Authorization: Bearer $KEY"
# Backend health snapshot
curl $B/healthllm-gateway.service is an example systemd unit (assumes /opt/llm-gateway with
venv/ next to main.py):
sudo install -m 0644 llm-gateway.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now llm-gateway
journalctl -u llm-gateway -fdeploy.sh is an rsync-over-SSH helper (DEPLOY_HOST=root@host ./deploy.sh):
syncs code (excluding config.yaml, venv/), installs requirements in a remote
venv, syncs the systemd unit, restarts.
Secrets & data never to commit:
config.yaml,store.db(+secret.key— they travel together, keys encrypted at rest),stats.db*,jobs.db*,jobs/,*.key. All gitignored.
MIT — see LICENSE.
{ "model": "tool", "reasoning": "off", "messages": [...] } // "off" | "on" | "auto"