`llm-gateway` — OpenAI-compatible API gateway in front of vLLM

A thin FastAPI service that sits in front of vllm/vllm-openai (vendor Docker container) on a self-hosted GPU host. Owns the cross-cutting gateway concerns so the engine container stays as opaque vendor code:

Bearer auth + per-key token-bucket rate-limit
Multi-model registry + routing
OpenAI-compat POST /v1/chat/completions + GET /v1/models
Streaming SSE forwarding
Tool-call parser registry per model family
Circuit breaker + graceful shutdown against a dead upstream
Prometheus /metrics (matched-route labels — bounded cardinality)
/health (process) + /ready (vLLM upstream reachable + model loaded)

vLLM stays opaque — best-of-breed CUDA-level work, no rewrite. This package is the seam where gateway cross-cutting concerns live.

Topology

SSH tunnel: ssh -L 8000:127.0.0.1:8000 user@host
   ↓
[llm_gateway gateway,         127.0.0.1:8000]   ← this package
   ↓ httpx proxy
[vllm/vllm-openai container,  127.0.0.1:18000]  ← vendor Docker
   ↓
GPU + Qwen2.5-7B-Instruct-AWQ

Both bound to 127.0.0.1. The SSH tunnel is the auth boundary; no public LLM endpoint.

Folder map

llm_gateway/
├── app.py             # FastAPI factory (create_app(settings, backend))
├── config.py          # pydantic-settings — env-driven config
├── api/               # HTTP endpoints (/health, /ready, /v1/*)
├── inference/         # InferenceBackend Protocol + NoopBackend stub
│                      # + VLLMHTTPBackend + CircuitBreakerBackend
├── middleware/        # auth / rate-limit / body-limit / logging / metrics
├── observability/     # Prometheus metrics + /metrics router
├── tool_calling/      # parser enum + pre-flight validator
└── models/            # ModelDefinition registry (no weights — metadata only)

Local development

poetry install
poetry run pytest tests/unit/ -v        # unit tests
poetry run pyright llm_gateway/         # type check
poetry run ruff check llm_gateway/      # lint
poetry run black --check llm_gateway/   # format

# Run with NoopBackend (useful for smoking /health and /ready):
poetry run python -m llm_gateway
curl -s http://127.0.0.1:8000/health
curl -s -i http://127.0.0.1:8000/ready

See deploy/README.md for the production deploy.

Configuration

Settings load from environment variables (or .env next to the gateway). The most-relevant keys:

Variable	Required	Default	Purpose
`BEARER_TOKEN`	yes (prod)	`""` (auth disabled in dev)	Bearer header clients send
`VLLM_UPSTREAM_URL`	no	`http://127.0.0.1:18000`	vLLM container URL
`RATE_LIMIT_RPM`	no	`60`	Per-key requests/minute cap
`MAX_REQUEST_BODY_BYTES`	no	`1048576` (1 MiB)	Reject larger bodies with 413
`CIRCUIT_BREAKER_ENABLED`	no	`true`	Open after 5 consecutive upstream failures
`LOG_LEVEL`	no	`INFO`	stdlib logging level

Full set: llm_gateway/config.py.

Security defaults

What the gateway protects against out of the box:

Loopback bind (127.0.0.1); never set 0.0.0.0 in production.
Bearer-token auth with constant-time comparison.
Per-key token-bucket rate-limit with LRU + TTL eviction (bounded memory).
Request body size cap (1 MiB default; 413 above).
Cloud metadata-service IPs (IMDS, ECS task metadata, Alibaba) rejected at config load — gateway cannot be misconfigured into an SSRF relay.
Container hardening: cap_drop: ALL, no-new-privileges, read_only rootfs, non-root UID 1001 (see deploy/).
Access logger redacts authorization / bearer / api_key / token / secret / password extra={…} fields automatically.

Ops scripts

PowerShell helpers for running the gateway on a single AWS EC2 GPU host with idle-shutdown cost guardrails live in scripts/ops/ — setup-ssh.ps1, fix-and-start.ps1, restore-idle-protection.ps1, teardown-ssh.ps1. Tag-based instance discovery (tag:application=vllm-serving + tag:environment=<env>) means zero hardcoded IDs. See scripts/ops/README.md for the operator workflow + IAM permissions list.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
deploy		deploy
llm_gateway		llm_gateway
scripts/ops		scripts/ops
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
SOLID_AUDIT.md		SOLID_AUDIT.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`llm-gateway` — OpenAI-compatible API gateway in front of vLLM

Topology

Folder map

Local development

Configuration

Security defaults

Ops scripts

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-gateway — OpenAI-compatible API gateway in front of vLLM

Topology

Folder map

Local development

Configuration

Security defaults

Ops scripts

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`llm-gateway` — OpenAI-compatible API gateway in front of vLLM

Packages