Skip to content

Releases: TadMSTR/ollama-queue-proxy

v0.3.0 — OpenAI-compat embeddings endpoint

28 May 12:19

Choose a tag to compare

What's new

Added

  • POST /v1/embeddings — OpenAI-compat endpoint. Clients using the OpenAI Embeddings API (Graphiti, LlamaIndex, LangChain, etc.) can now route through OQP without changing their provider configuration — only the port changes. The request is translated to /api/embed internally; the response is wrapped back into OpenAI format. Auth, priority ceiling, and the embedding cache all apply identically to native /api/embed requests.

    Graphiti example:

    # Before — direct to Ollama
    api_url: http://localhost:11434/v1
    # After — through OQP (gains cache + queuing)
    api_url: http://localhost:11435/v1

Fixed

  • asyncio.get_event_loop()get_running_loop() — eliminates DeprecationWarning in Python 3.10+ and RuntimeError in 3.12+.
  • Streaming response generator now closes the underlying httpx connection in a finally block — prevents connection leaks when a client disconnects mid-stream.
  • _apply_env_overrides no longer crashes with TypeError when an env var contains a numeric path component (e.g. OQP_OLLAMA__HOSTS__0__URL); the key is silently skipped with a comment in config.

Docker

ghcr.io/tadmstr/ollama-queue-proxy:v0.3.0
ghcr.io/tadmstr/ollama-queue-proxy:latest

Full changelog

See CHANGELOG.md.

v0.2.0 — client injection, model-aware routing, embedding cache

23 Apr 01:37

Choose a tag to compare

Highlights

  • Client injection — port-based auth bypass for clients that can't send Bearer headers; loopback by default, non-loopback bind requires allow_public_injection: true.
  • Model-aware routing — weighted round-robin across Ollama hosts that already have the requested model loaded, with fast-path invalidation on miss.
  • Embedding response cache — SHA256-keyed Valkey/Dragonfly cache for /api/embed and /api/embeddings; hits bypass the queue and upstream entirely.
  • keep_alive defaulting — proxy-level body injection so Ollama doesn't unload models between bursty requests.
  • Per-client concurrency capsmax_concurrent on auth.keys[], enforced via per-client async semaphore with fairness bound.

Security

Pre-release audit clean (0 critical / high / medium). Two low findings and one informational note remediated before tag:

  • L1allow_public_injection now fails config validation on non-loopback bind; warning also fires when auth is enabled (injection bypasses Bearer auth).
  • L2/metrics escapes Prometheus label values, preventing label-injection via client-supplied model names.
  • N1Dockerfile CMD switched to ollama-queue-proxy console script so main:run() orchestrates the N+1 server gather in containerized deployments.

Compatibility

All v0.1.x configs continue to work unchanged. New fields default to v0.1.x-equivalent behavior.

Full changelog in CHANGELOG.md.

v0.1.2

21 Apr 18:39

Choose a tag to compare

What's Changed

Patch release fixing two bugs discovered during claudebox deployment.

Bug fixes

  • Streaming response detection now handles application/x-ndjson content-type — Ollama uses this for /api/generate and /api/chat streaming responses; the previous check only matched text/event-stream and chunked application/json
  • Webhook SSRF check now supports an allowed_hosts list in config — enables webhook delivery to internal hostnames (e.g., ntfy on a LAN IP) without disabling the SSRF guard entirely

Full Changelog: v0.1.1...v0.1.2

v0.1.1

21 Apr 12:11

Choose a tag to compare

What's Changed

Security fixes

  • SSRF validation bypass via hostnamesvalidate_webhook_url() previously only checked raw IP literals; hostnames (e.g., http://localhost/hook) bypassed the blocklist. Now resolves hostnames to IP via socket.getaddrinfo() before blocklist comparison. Added 169.254.0.0/16 (link-local / cloud metadata) and fe80::/10 to _PRIVATE_NETWORKS.
  • Dockerfile missing USER instruction — container now runs as appuser (non-root) by default, consistent with the compose user: 1000:1000 override.
  • Queue management tier parameter now validated?tier=bogus returns HTTP 400 instead of unhandled KeyError → 500. Accepts high, normal, low.
  • CI action versions updatedactions/checkout → v6.0.2, actions/setup-python → v6.2.0 with correct SHA pins.

Full Changelog: https://github.com/TadMSTR/ollama-queue-proxy/blob/main/CHANGELOG.md