Skip to content

KaletoAI/llm-gateway

Repository files navigation

llm-gateway

An OpenAI-compatible reverse proxy that fans one endpoint out across many backends — local LLM servers (llama.cpp / llama-swap / vLLM / Ollama …), cloud APIs (together.ai, OpenAI, OpenRouter …), and ComfyUI image-generation servers. Callers see a single OpenAI endpoint; the gateway handles discovery, priority routing, failover, virtual aliases, per-backend concurrency, an optional multi-user layer, call parking, and a built-in management console.

It sits between OpenAI-compatible clients (N8N, LibreChat, Open WebUI, LangChain code, image clients like anima-verse, …) and a fleet of backends.


Contents


Why

  • One endpoint for many backends. Point your tools at one URL; add/remove backends without touching clients. Chat, embeddings, the Responses API, and image generation all go through the same gateway.
  • Auto-discovery. Each backend's catalog is polled — /v1/models for LLMs, /object_info for ComfyUI (models + installed LoRAs). No manual registry.
  • Strict priority routing + failover. priority is a first-class deployment ordering: alias fast routes to a local box first, a cloud provider as fallback — exactly that, every time, no routing-strategy ceremony.
  • Virtual aliases. fast, vision, translator map to different real model IDs per backend; an alias can even override a backend's priority for itself.
  • Cloud-as-backend. A per-backend api_key wires in any OpenAI-compatible provider as just another prioritised backend.
  • Multi-user. Optional per-user API keys with model/alias/backend allow-lists (which also filter what each key sees in /v1/models) and monthly cost quotas.
  • Call parking (default). When every matching backend is busy, the call queues until one frees instead of returning 503 — no client change needed. Park time is per-alias; async is the standard Responses background mode.
  • Media generation. ComfyUI workflows exposed as OpenAI image endpoints + a native job API — image, video, and audio outputs — with a convention-free node mapping, dynamic LoRAs, and LoRA-aware backend routing.
  • Built-in console at /ui. Manage backends, aliases, workflow mappings, users, server settings; run a chat/media playground; watch jobs, stats, parked calls, and the live routing map. Server-rendered, zero JS framework.
  • Hot config reload. config.yaml changes apply live; most management also lives in a writable store edited from the console.

Quick start

git clone https://github.com/KaletoAI/llm-gateway.git
cd llm-gateway
python3 -m venv venv && venv/bin/pip install -r requirements.txt
cp config.example.yaml config.yaml
$EDITOR config.yaml                    # set backends + api_key
venv/bin/uvicorn main:app --host 0.0.0.0 --port 4000   # add --reload for dev

Point any OpenAI-compatible client at http://<host>:4000/v1 with the api_key you set. Open http://<host>:4000/ui for the management console.

requirements.txt omits watchfiles; it ships transitively with uvicorn[standard] and powers the hot-reload of config.yaml. Keep that extra.


Configuration

config.example.yaml is the documented template. Copy to config.yaml (gitignored, hot-reloaded on save). Two things read only at startup: stats.enabled and the jobs/stats DB paths.

Config vs. store. config.yaml is the bootstrap source. Once the console is used, most state (backends, chat aliases, generation aliases + mappings, users, server settings) lives in a writable SQLite store (store.db) which then becomes the source of truth. The store is seeded once from config and merged over it; you can run almost entirely from config or almost entirely from the console — both work.

api_key: "sk-change-me"                # master/admin key (see Authentication)
health_check_interval: 30              # seconds between backend liveness polls
log_per_call: true                     # one log line per forwarded request
model_prefix: true                     # list models as <backend>/<model>
# max_concurrent: 1                    # global default in-flight cap per backend

backends:
  - name: local-gpu
    url: http://192.168.1.10:8080      # llama-swap / llama.cpp / vLLM / …
    priority: 1                        # 1 = preferred
    max_concurrent: 1                  # single-slot llama.cpp → one at a time
    # local: true                      # ALSO list its models bare (see below)
  - name: together                     # cloud fallback (OpenAI-compatible)
    url: https://api.together.xyz
    priority: 99
    api_key: "tgp_v1_…"                # sent as Bearer to this backend
    chat_only: true                    # drop non-chat models at discovery
    serverless_only: true              # drop dedicated-endpoint-only models

virtual_models:                        # chat aliases
  "translator": "Aya-Expanse-8B"       # same model on every backend
  "fast":                              # per-backend mapping
    local-gpu: "Qwen3.5-9B"
    together:  "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"
  "cheap":                             # per-alias priority override
    local-cpu: { model: "gemma-3-9b-it", priority: 1 }
    local-gpu: { model: "Qwen3.5-9B",    priority: 2 }

A backend's value under an alias is normally just the model name; make it {model, priority} to override that backend's priority for this alias only.


Authentication & multi-user

Two layers, both optional:

  • Master key — the top-level api_key. Clients send Authorization: Bearer <key>. Also unlocks the /ui console (sign in with an admin key). Leave empty to run fully open (bootstrap mode).
  • Per-user keys — created in the Users tab. Each user has its own API key (generate one in the form, or paste your own), a role (user / admin), an enabled flag, an optional model allow-list, and an optional monthly cost quota. Calls are attributed to the user (stats source, job owner).

Bootstrap-open → locked. With no users and no master key, the gateway and console are fully open. Add an admin user (or set a master key) to lock it down.

Allow-list (what a key may use — and see)

A user's allow-list can contain any mix of:

Entry kind Grants
chat alias (fast) that chat alias
image alias (Qwen) that generation alias
backend name (together) all of that backend's models
model id (together/llama-3… or bare) that specific model

An empty allow-list = everything allowed (the default). A non-empty list both restricts usage (a disallowed model → 403) and filters /v1/models so the key only sees what it's allowed. This is how you point an image client at the gateway and have it see just the image aliases instead of the whole 400-model catalog: give it a key whose allow-list is the image alias(es) (or the ComfyUI backend), and GET /v1/models returns only those. ?type=image / ?type=chat narrows by namespace too.

Quotas

  • quota_req_day — requests per day (in-memory counter) → 429 when exceeded.
  • quota_cost_month — summed USD cost for the month (from the stats log) → blocked when exceeded. Needs stats enabled and priced backends; streaming calls count as 0 (no usage in stream chunks).

Routing

For each request the gateway walks backends in priority order (1 = first) and takes the first that is (1) enabled, (2) healthy (last discovery poll ok), (3) not busy (below its max_concurrent in-flight cap), (4) mapped for the alias (or exposes the bare/real model), and (5) actually has the resolved model. If that backend errors on the forward, the remaining matching backends are tried in order. When every match is busy the call parks (queues) by default until a backend frees, and only 503s if the park time runs out (below).

Provider-prefixed model names

With model_prefix: true (default), /v1/models lists every model as <backend>/<model> so the provider is visible. Input is liberal: a prefixed id routes to exactly that backend; a bare id or an alias routes by priority. Backend names never collide with vendor prefixes (moonshotai/…), so the leading segment disambiguates. model_prefix: false → legacy bare, de-duplicated listing.

local: true on a backend additionally lists its models bare (alongside the prefixed id). A bare request then routes by priority across every local backend that serves it — same failover/busy-spill as a virtual alias; shared ids collapse to one entry. Independent of model_prefix.

Per-backend concurrency cap (max_concurrent)

A live per-backend in-flight counter; at/above the cap the backend is busy and skipped, so the request spills to the next backend instead of overloading a slow one. Match it to real parallelism (1 for llama.cpp --parallel 1; unset for a cloud API). Missing/0 = unlimited. The counter is released when the response completes — including when a streamed response finishes, not when headers are sent. Busy state shows in /health and the Routing tab.

Per-backend model filters

Flag Effect (at discovery)
chat_only keep only type == "chat" models (drops image/video/embedding). Understands Together's type and OpenRouter's architecture.output_modalities. Backends without those fields (llama-swap, vLLM) are unaffected — so don't set it on a backend whose embedding models you want routable.
serverless_only keep only models with non-zero pricing (Together's dedicated-only models are 0/0; on OpenRouter this also drops :free).

Alias / model-name collisions

Naming an alias the same as a real model id shadows that model. /health's alias_model_conflicts and the Routing tab flag every collision, split into covered (in the mapping → still routable) and shadowed (hosts the model but isn't mapped → unreachable by that name) — the latter is the actionable case.


Call parking

When all backends that map an alias are busy (at their in-flight cap), the call is held in a FIFO queue until a mapping backend frees (then dispatched) instead of returning 503. This is the default — no client field needed, so callers stay plain-OpenAI. A standard request just sees a slightly slower 200, or a 503 (with Retry-After) if the wait runs out.

  • Park time is per-alias (park_s in the chat-alias editor, or config alias_park): blank = the global default (park_timeout_s, 60 s, Server tab), 0 = parking off for that alias (immediate 503 when busy). max_parked caps the queue.
  • Fair: when a slot frees, the oldest waiting call whose alias can use that backend is dispatched first (no head-of-line blocking across aliases). Live queue is visible in the Parked calls panel on the Dashboard.
  • On timeout the call leaves the queue with a 503 + Retry-After.

Applies to /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/responses, and the generation path (a busy ComfyUI backend queues the job rather than 503-ing). The distinction "all busy" vs. "no backend at all" is explicit — a genuine no-backend still 503s.

Async LLM requests don't use a custom field: they follow the official OpenAI Responses background modePOST /v1/responses with background:true returns immediately with {id, status:"queued"}; poll GET /v1/responses/{id} (→ in_progress/completed/failed/cancelled) and cancel via POST /v1/responses/{id}/cancel. The background worker parks in the same queue.


Reasoning control

One normalized switch turns a thinking model's reasoning off/on — regardless of which mechanism the model actually needs. Clients send a single field:

{ "model": "tool", "reasoning": "off", "messages": [...] }   // "off" | "on" | "auto"

"auto" (or omitting the field) leaves the request untouched. The OpenAI reasoning_effort field works as an alias (minimal → off, anything else → on), and /v1/responses also accepts the reasoning: {effort} object shape.

Rules decide the mechanism. The switch is translated per (model × backend) by an ordered rule list (UI → Reasoning tab; stored, hot). The first enabled rule whose model glob matches the real model and whose backend set contains the serving backend wins; its adapter does the work:

Adapter What it does
enable_thinking sets chat_template_kwargs: {enable_thinking: bool} (vLLM; llama.cpp with --jinja)
reasoning_effort sets reasoning_effort (off → minimal, on → high; overridable per rule)
nothink_token appends a token (default /nothink) to the last user message
prefill appends a closed <think>…</think> assistant turn
none no mechanism — reported as unsupported

No matching rule → the request is forwarded unchanged and the control is reported as unsupportedit never fails a call. What was actually applied comes back in the x-reasoning-control response header (e.g. off:prefill, on:noop, unsupported) and is logged per call in the LLM Calls tab.

Per-alias default. A chat alias can carry a reasoning default (chat-alias editor → reasoning: auto|on|off), applied when the client sends nothing — so tool (off) and tool-thinking (auto) can point at the same backend and model. An explicit client reasoning field always wins.

Thinking output on /v1/responses. Models that stream their thinking in the reasoning delta channel are translated to Responses-API reasoning events (response.reasoning_summary_text.delta) and a reasoning output item — output_text stays answer-only; clients that don't know reasoning events simply ignore them.


Media generation

A ComfyUI backend speaks a different protocol, so it declares type: comfyui. Discovery is via /object_info (checkpoints/UNETs/VAEs and installed LoRAs); dispatch submits a parametrised workflow, polls /history, and fetches whatever artifacts it produced — image, video (e.g. SaveVideo), or audio — with each artifact's kind/mime carried through the API and rendered in the console.

backends:
  - name: gpu-3090
    type: comfyui
    url: http://192.168.1.20:8188
    priority: 1
    max_concurrent: 1        # one generation at a time on this GPU
    # poll_interval: 1.0     # seconds between /history polls
    # max_wait: 600          # hard cap for a single generation
    # read_timeout: 60       # per-HTTP-request read timeout (hung read → failover)
    # disconnect_grace: 30   # tolerated unreachability before failing over

Generation aliases + mapping

A generation alias (the model of a generation request) maps to an ordered list of candidate backends. Each candidate carries the workflow (a ComfyUI API-format JSON) and a mapping that binds logical params to concrete workflow nodes+fields:

image_models:
  "flux":
    - backend: gpu-3090
      task: text2img
      workflow_json: { … }          # the ComfyUI API JSON (owned by the gateway)
      mapping:
        prompt:          { node: "6", field: "text" }
        width:           { node: "5", field: "width" }
        seed:            { node: "3", field: "seed" }
      fixed:                          # pinned node values (models, switches, …)
        - { node: "4", field: "unet_name", value: "flux1-dev.safetensors" }

The mapping is convention-free — it works with any workflow regardless of node naming. (An auto-detect heuristic pre-fills it for templated workflows; the explicit mapping always wins.) In practice you author all this in the Mapping tab of the console rather than by hand: paste the ComfyUI API JSON, the gateway owns it, auto-suggests the mapping, and gives you discovery-fed dropdowns.

Key mapping concepts:

  • Workflow + mapping are backend-independent (shared across an alias's candidates). Only Pinned values are per-backend (one tab per backend), so the same alias can use a different checkpoint on each GPU while looking identical from outside. A request param that targets a pinned node/field is ignored — a pin is authoritative; the API can't override it.
  • Image input slots (a LoadImage / LoadImageMask node) become file-upload request fields. By default an unfilled slot gets an 8×8 placeholder; mark a slot required (Mapping checkbox) to leave it empty instead so ComfyUI errors clearly when a needed image/mask is missing (inpaint).
  • Numeric fields (strength, steps, cfg) render with min/max/step pulled live from /object_info.

LoRAs

LoRAs are first-class:

  • Pinned LoRA — a fixed binding on a LoRA-loader slot; the API can't change it.
  • Dynamic LoRAs — the client sends lora_1, lora_2, … (+ optional strength_N). The gateway cascades them into the next free slots of the workflow's LoRA stack, never overwriting a pinned/occupied slot — so a client needn't know which slot is reserved.
  • LoRA-aware routing — a backend that lacks a requested LoRA is dropped from the candidate set (decided over all candidates incl. busy, so the request parks for the backend that has it rather than spilling to one that doesn't). A LoRA installed on no backend is ignored (priority decides). An explicit backend pin is never overridden.
  • GET /v1/generations/{alias}/loras returns the LoRA filenames valid for an alias (the union installed across its backends) — for building a correct picker.

Jobs & TTL

Every generation is a job: SQLite metadata + on-disk artifacts under jobs/<id>/<n>.<ext> (image, video, or audio — the artifact's kind/mime flow through the API and the console), lifecycle queued → running → done|failed, retrievable by id until its TTL (default 24 h), then pruned. The job also keeps its inputs (prompt, params, reference images) so it stays inspectable in the Media Jobs tab. A running job can be cancelled (POST /v1/jobs/{id}/cancel or the ✕ button), which interrupts the ComfyUI prompt to free the GPU. On a restart, any job left running/queued is reconciled to failed.

Two ways to call it

  • OpenAI Images API (for OpenAI image clients):
    • POST /v1/images/generations — JSON, text→image. Bonus: LocalAI-style ref_images (base64/URL list). Extra keys pass through as workflow params.
    • POST /v1/images/edits — multipart; image file(s) + the OpenAI mask field map positionally onto the workflow's image slots. response_format = url (job-result URL, needs the Bearer key to fetch) or b64_json (inline). These are synchronous and block until the image is ready (and park if the backend is busy rather than 503-ing).
  • Native job APIPOST /v1/generations with {model, prompt, mode, params}. mode: "async" returns 202 {job_id}; poll GET /v1/jobs/{id} and fetch GET /v1/jobs/{id}/result/{n}. mode: "sync" blocks and returns inline.

See docs/anima-versa-integration.md for a full client-integration walkthrough.


The /ui console

A server-rendered console mounted at /ui (sign in with an admin key once locked). Tabs:

Tab What
Dashboard live per-backend status + in-flight, parked calls, media-job counts/recent, recent LLM calls
Server runtime + restart-required settings (API key, caps, park time/queue, stats/jobs, TTL/prune)
Backends add/edit/remove backends (LLM + ComfyUI)
Input what clients can call — chat aliases, generation models, endpoints
Routing Overview the live alias→backend map + collisions (searchable)
Mapping register a ComfyUI workflow, wire its node mapping, pin values; chat-alias editor (per-alias park_s + reasoning default)
Reasoning the normalized-thinking rule list (model glob × backend set → adapter) + test resolver
Chat Playground send a chat completion — as a real API client through /v1/chat/completions (auth, routing, parking, stats all apply)
Media Playground run a generation via POST /v1/generations (pick alias/backend, upload refs, set params) — image/video/audio
Media Jobs list + detail of generation jobs (inputs + outputs, within TTL)
LLM Calls per-call history with stored request/response bodies
Statistic the call-stats dashboard (search, aggregates, drilldown)
Users multi-user keys, allow-lists, quotas, IP aliases

Stats & routing dashboard

Opt-in SQLite call log, surfaced in the Statistic and Routing tabs of the console (no separate port — the old standalone dashboard was folded into /ui). Every call records timestamp, duration, backend, source, alias, model, endpoint, HTTP status, tokens, and USD cost.

stats:
  enabled: false        # read at startup only — toggling needs a restart
  db_path: stats.db
  retention_days: 0     # 0 = keep forever; else prune older rows hourly
  • Cost comes from each backend's pricing (cached at discovery, normalised to USD/million tokens — Together's per-million and OpenRouter's per-token schemas). Local backends → 0.
  • Source is the authenticated user, else the X-Source header, else client IP (IP aliases give those friendly names; reverse-DNS is auto-resolved).
  • Streaming calls record real tokens when the backend honors stream_options.include_usage (requested automatically), else 0.
  • The applied reasoning control is logged per call (LLM Calls tab column).
  • Recent calls store the full request/response body (large/binary bodies on disk), viewable per-call, pruned with the same retention.

Endpoint reference

OpenAI-compatible

Method Path Notes
GET /v1/models catalog filtered by the caller's allow-list; ?type=chat|image
GET /v1/models/{id} single-model lookup
POST /v1/chat/completions chat; priority + failover; streaming; parking; reasoning: off|on|auto
POST /v1/completions completions; same routing
POST /v1/embeddings embeddings; same routing
POST /v1/responses Responses API ↔ chat bridge; streaming; parking; background:true (async)
GET /v1/responses/{id} poll a background response (queued→…→completed/failed/cancelled)
POST /v1/responses/{id}/cancel cancel a background response
POST /v1/images/generations text→image (sync); may return a video/audio URL for such aliases
POST /v1/images/edits multipart image+mask edit (sync)

Native generation + jobs

Method Path Notes
POST /v1/generations run a generation alias (sync or mode:"async"); per-field reference images via images: {param: base64|URL}
GET /v1/generations/{alias}/loras LoRAs valid for an alias
GET /v1/jobs/{id} job status + results
GET /v1/jobs/{id}/result/{n} a result artifact (owner-gated)
GET /v1/jobs/{id}/input/{n} a stored reference image (owner-gated)
POST /v1/jobs/{id}/cancel cancel a queued/running job (interrupts ComfyUI)

Other

Method Path Notes
GET /health per-backend health/model/priority + busy/inflight + conflicts
* /ui/** the management console

Every proxied LLM response carries x-gateway-backend (which backend served the call) and, when a reasoning switch was requested, x-reasoning-control (what was actually applied).

Responses API bridge

Clients on LangChain.js (N8N's AI Agent, …) call /v1/responses; most backends only speak /v1/chat/completions. The gateway translates request (input/instructions/toolsmessages/system/tool schema) and response (choices[0].messageoutput[…], token field renames) transparently, and routes through the same dispatch/parking path as chat. stream: true is supported (chat SSE → Responses SSE events). background: true runs the request asynchronously per the official OpenAI pattern: it returns immediately with a queued response object; poll GET /v1/responses/{id} until a terminal state and cancel via POST /v1/responses/{id}/cancel. The background worker parks in the shared queue (longer async window) — so long-running or busy requests never time out the client connection. Thinking-model output arrives as Responses-API reasoning items/events (see Reasoning control).


Try it

KEY=sk-change-me ; B=http://localhost:4000

# List models (filtered by your key's allow-list)
curl $B/v1/models -H "Authorization: Bearer $KEY"

# Chat through an alias
curl $B/v1/chat/completions -H "Authorization: Bearer $KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"fast","messages":[{"role":"user","content":"hi"}]}'

# Embeddings
curl $B/v1/embeddings -H "Authorization: Bearer $KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"embedding","input":["hallo welt"]}'

# Text→image (sync, inline base64)
curl $B/v1/images/generations -H "Authorization: Bearer $KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"flux","prompt":"a red apple","size":"1024x1024","response_format":"b64_json"}'

# LoRAs valid for an alias
curl $B/v1/generations/flux/loras -H "Authorization: Bearer $KEY"

# Backend health snapshot
curl $B/health

Running & deploying

llm-gateway.service is an example systemd unit (assumes /opt/llm-gateway with venv/ next to main.py):

sudo install -m 0644 llm-gateway.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now llm-gateway
journalctl -u llm-gateway -f

deploy.sh is an rsync-over-SSH helper (DEPLOY_HOST=root@host ./deploy.sh): syncs code (excluding config.yaml, venv/), installs requirements in a remote venv, syncs the systemd unit, restarts.

Secrets & data never to commit: config.yaml, store.db (+ secret.key — they travel together, keys encrypted at rest), stats.db*, jobs.db*, jobs/, *.key. All gitignored.

License

MIT — see LICENSE.

About

OpenAI-compatible proxy with priority routing, auto-discovery, virtual models, and Responses API bridge

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors