A lightweight, provider-agnostic Python library for calling LLMs — one API for every provider.
Switch between OpenAI, Gemini, Anthropic, Groq, Mistral, Cohere, Azure, Bedrock, or Ollama without changing your application code. Same function, same response shape, same error types — every time.
from llmgate import completion
# OpenAI
resp = completion("gpt-4o-mini", messages)
# Switch to Groq — literally one word changes
resp = completion("groq/llama-3.1-8b-instant", messages)
# Switch to Gemini
resp = completion("gemini-2.5-flash-lite", messages)
print(resp.text) # always the samepip install llmgateOptional provider extras:
pip install llmgate[mistral] # Mistral
pip install llmgate[cohere] # Cohere
pip install llmgate[bedrock] # AWS Bedrock (boto3)
pip install llmgate[ollama] # Ollama (local models)
pip install llmgate[all] # everythingimport os
from llmgate import completion
# Set your key (or put it in a .env file)
os.environ["GROQ_API_KEY"] = "gsk_..."
response = completion(
model="groq/llama-3.1-8b-instant",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.text)
# → "Hello! How can I help you today?"| Provider | Core / Optional | Model prefix | Install |
|---|---|---|---|
| OpenAI | Core | gpt-4o, o1-, o3- |
included |
| Anthropic | Core | claude- |
included |
| Google Gemini | Core | gemini- |
included |
| Groq | Core | groq/ |
included |
| Mistral | Optional | mistral/ |
llmgate[mistral] |
| Cohere | Optional | cohere/ |
llmgate[cohere] |
| Azure OpenAI | Optional | azure/ |
included (uses openai) |
| AWS Bedrock | Optional | bedrock/ |
llmgate[bedrock] |
| Ollama (local) | Optional | ollama/ |
llmgate[ollama] |
Provider is auto-detected from the model string. Use provider= to override.
from llmgate import completion, acompletion
# Sync
resp = completion(
model="gpt-4o-mini",
messages=[...],
provider=None, # auto-detected; override with "openai", "groq", etc.
api_key=None, # overrides env var for this call
max_tokens=None,
temperature=None,
top_p=None,
stream=False, # True → returns Iterator[StreamChunk]
tools=[...], # tool / function definitions
tool_choice=None, # "auto" | "none" | specific tool name
response_format=None, # Pydantic model class → enables structured output
middleware=[...], # list of BaseMiddleware instances
)
# Async — identical signature
resp = await acompletion("gemini-2.5-flash-lite", messages)resp.text # str — first choice content
resp.parsed # BaseModel | None — populated when response_format set
resp.id # str — provider response ID
resp.model # str
resp.provider # str — "openai" | "gemini" | "anthropic" | ...
resp.choices # list[Choice]
resp.choices[0].message.role # "assistant"
resp.choices[0].message.content
resp.choices[0].message.tool_calls # list[ToolCall] | None
resp.usage.prompt_tokens # int
resp.usage.completion_tokens # int
resp.usage.total_tokens # int
resp.raw # raw SDK response (escape hatch)from llmgate import completion
for chunk in completion("gpt-4o-mini", messages, stream=True):
print(chunk.delta, end="", flush=True)
# Async
async for chunk in await acompletion("groq/llama-3.1-8b-instant", messages, stream=True):
print(chunk.delta, end="", flush=True)from llmgate import completion
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
},
"required": ["city"],
},
},
}]
resp = completion("gpt-4o-mini", messages, tools=tools, tool_choice="auto")
if resp.tool_calls:
for tc in resp.tool_calls:
print(tc.function, tc.arguments)Pass any Pydantic BaseModel as response_format to get a validated, typed instance in resp.parsed.
from pydantic import BaseModel
from llmgate import completion, parse
class Movie(BaseModel):
title: str
year: int
rating: float
# Full response
resp = completion(
"groq/llama-3.1-8b-instant",
[{"role": "user", "content": "Name a great sci-fi film."}],
response_format=Movie,
)
movie: Movie = resp.parsed
# Shorthand — returns the Pydantic instance directly
movie = parse("gemini-2.5-flash-lite", messages, response_format=Movie)
# Async
movie = await aparse("claude-haiku-4-5-20251001", messages, response_format=Movie)Provider strategies:
| Provider | Strategy |
|---|---|
| OpenAI / Azure | Native json_schema (schema-constrained) |
| Gemini | response_schema + response_mime_type (native) |
| Groq / Mistral / Cohere / Ollama | json_object mode + Pydantic validation |
| Anthropic / Bedrock | Schema injected into system prompt + extraction |
Note:
stream=Trueandresponse_formatcannot be used together.
Model Prefixes & Auto-Routing: If a model string does not contain a recognized prefix (like
gemini/,cohere/,mistral/,bedrock/,ollama/,azure/), it will default to OpenAI. Since OpenAI's embedding models (e.g.,text-embedding-3-small) lack a distinct provider prefix likegpt-, they are routed to OpenAI by default. For all other providers, you must include the explicit provider prefix to avoid accidental routing to OpenAI.
from llmgate import embed, aembed
# Single text → OpenAI (auto-detected)
resp = embed("text-embedding-3-small", "Hello world")
vector: list[float] = resp.embeddings[0]
# Batch
resp = embed("text-embedding-3-small", ["Hello", "world"])
vectors: list[list[float]] = resp.embeddings
# Other providers
resp = embed("gemini/gemini-embedding-2", "Hello")
resp = embed("cohere/embed-english-v3.0", "Hello")
resp = embed("mistral/mistral-embed", "Hello")
resp = embed("ollama/nomic-embed-text", "Hello")
resp = embed("bedrock/amazon.titan-embed-text-v2:0", "Hello")
resp = embed("azure/my-embedding-deployment", "Hello")
# Control dimensions (OpenAI / Gemini / Azure)
resp = embed("text-embedding-3-small", "Hello", dimensions=256)
# Async
resp = await aembed("text-embedding-3-small", "Hello")EmbeddingResponse:
resp.embeddings # list[list[float]] — one vector per input
resp.model # str
resp.provider # str
resp.usage # TokenUsageAnthropic and Groq do not offer embedding APIs — they raise
EmbeddingsNotSupported.
Provider-Specific Parameters:
llmgate natively supports advanced provider-specific optimization parameters:
| Param | Providers | Description |
|---|---|---|
task_type |
Gemini | Optimization hint (e.g. RETRIEVAL_DOCUMENT, RETRIEVAL_QUERY) |
title |
Gemini | Document title (improves quality when task_type="RETRIEVAL_DOCUMENT") |
input_type |
Cohere, Bedrock | Purpose hint: search_document, search_query, classification, clustering |
truncate |
Cohere, Ollama | Overflow strategy — Cohere: NONE/START/END; Ollama: true/false |
encoding_format |
OpenAI, Azure, Mistral | Output encoding: float or base64 |
user |
OpenAI, Azure | End-user identifier for abuse monitoring |
Pass images alongside text using TextPart and ImagePart content blocks. Works with any vision-capable model — the right wire format is applied per provider automatically.
from llmgate import completion
from llmgate.types import ImageURL, ImageBytes, TextPart, ImagePart, Message
import base64
# ── URL-based image (OpenAI, Anthropic, Groq, Mistral, Azure) ──────────────
resp = completion(
"gpt-4o-mini",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
],
}],
)
print(resp.text)
# ── Base64 image (all providers) ───────────────────────────────────────────
with open("photo.jpg", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
resp = completion(
"claude-opus-4-7",
messages=[Message(
role="user",
content=[
ImagePart(type="image_bytes", image_bytes=ImageBytes(data=b64, mime_type="image/jpeg")),
TextPart(text="Describe this image."),
],
)],
)
print(resp.text)
# ── detail hint (OpenAI / Azure only, silently ignored by others) ──────────
resp = completion(
"gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Any text in this image?"},
{"type": "image_url", "image_url": {"url": "https://...", "detail": "high"}},
],
}],
)Vision support by provider:
| Provider | URL images | Base64 / bytes | Notes |
|---|---|---|---|
| OpenAI | ✅ | ✅ | detail param (auto/low/high) supported |
| Azure OpenAI | ✅ | ✅ | Identical to OpenAI |
| Anthropic | ✅ | ✅ | Up to 100 images per request |
| Gemini | ✅* | ✅ | *URL images fetched client-side and sent inline |
| Groq | ✅ | ✅ | Model must be llama-4-scout-17b (vision preview) |
| Mistral | ✅ | ✅ | image_url sent as plain string (handled automatically) |
| Bedrock | ✅* | ✅ | *URL images fetched client-side; raw bytes sent to Converse API |
| Ollama | ❌ | ✅ | base64 only; URL images fetched client-side automatically |
| Cohere | — | — | Raises VisionNotSupported (API not yet stable) |
Note:
image_urlcontent parts accept bothhttps://URLs anddata:image/jpeg;base64,...data URIs.
Execute multiple completion requests in parallel with configurable concurrency control.
from llmgate import batch, abatch
from llmgate.types import CompletionRequest, Message
# Simple usage with dicts
results = batch([
{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "What is 1+1?"}]},
{"model": "groq/llama-3.1-8b-instant", "messages": [{"role": "user", "content": "What is 2+2?"}]},
{"model": "gemini-2.5-flash-lite", "messages": [{"role": "user", "content": "What is 3+3?"}]},
], max_concurrency=3)
# Iterate successful responses (same order as input)
for resp in results.results:
if resp is not None:
print(resp.text)
# Check aggregate stats
print(f"{results.successful}/{results.successful + results.failed} succeeded")
print(f"Total tokens: {results.total_tokens}")
print(f"Success rate: {results.success_rate:.1%}")
# Inspect failures
for err in results.errors:
print(f"Request {err.index} failed ({err.error_type}): {err.error}")
# Async version
results = await abatch(requests, max_concurrency=10)
# Fail immediately on first error instead of collecting
results = batch(requests, fail_fast=True)
# Apply middleware per-request
from llmgate.middleware import RetryMiddleware
results = batch(requests, middleware=[RetryMiddleware(max_retries=2)])BatchResult fields:
results.results # list[CompletionResponse | None] — same order as input
results.errors # list[BatchError] — details for each failed request
results.successful # int — count of successful requests
results.failed # int — count of failed requests
results.total_tokens # int — aggregate token usage across successes
results.success_rate # float — fraction that succeeded (0.0 – 1.0)LLMGate batch methods use the gate's configured middleware automatically:
from llmgate import LLMGate
from llmgate.middleware import RetryMiddleware, LoggingMiddleware
gate = LLMGate(middleware=[RetryMiddleware(max_retries=3), LoggingMiddleware()])
results = gate.batch(requests, max_concurrency=5)
results = await gate.abatch(requests, max_concurrency=5)Pass a list of model strings to automatically try each one in order. If a model fails with a rate-limit, provider error, or auth error, the next model in the chain is tried transparently.
from llmgate import completion
# Try gpt-4o-mini → fall back to groq → fall back to gemini
resp = completion(
model=["gpt-4o-mini", "groq/llama-3.1-8b-instant", "gemini-2.5-flash-lite"],
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.text)
print(resp.provider) # → whichever model succeeded
print(resp.fallback_attempts) # → ["gpt-4o-mini"] if first model failedfrom llmgate import LLMGate
from llmgate.middleware import RetryMiddleware
gate = LLMGate(
fallback_chain=["gpt-4o-mini", "groq/llama-3.1-8b-instant", "gemini-2.5-flash-lite"],
middleware=[RetryMiddleware(max_retries=2)], # retries each model before fallback
)
# model arg is optional when fallback_chain is set
resp = gate.completion(messages=[{"role": "user", "content": "Hello!"}])
resp = await gate.acompletion(messages=[{"role": "user", "content": "Hello!"}])Note: When
fallback_chainis set on the gate,RetryMiddleware(and other middleware) applies to each individual model in the chain — so each candidate is retried before the next one is tried.
from llmgate import LLMGate
from llmgate.middleware import RetryMiddleware, FallbackMiddleware
gate = LLMGate(middleware=[
RetryMiddleware(max_retries=2),
FallbackMiddleware(
models=["groq/llama-3.1-8b-instant", "gemini-2.5-flash-lite"],
),
])
resp = gate.completion("gpt-4o-mini", messages)from llmgate import completion
from llmgate.exceptions import RateLimitError
# Only fall back on rate limits — auth errors propagate immediately
resp = completion(
model=["gpt-4o-mini", "groq/llama-3.1-8b-instant"],
messages=messages,
fallback_on=(RateLimitError,),
)Default fallback_on: (RateLimitError, ProviderAPIError, AuthError)
from llmgate.exceptions import AllProvidersFailedError
try:
resp = completion(
model=["gpt-4o-mini", "groq/llama-3.1-8b-instant"],
messages=messages,
)
except AllProvidersFailedError as e:
for model, exc in e.errors:
print(f" {model}: {exc}")Note:
stream=Trueworks seamlessly with a model list. See the streaming documentation for details on fallback strategies (stream_fallback_mode).
Apply logging, retry, caching, and rate-limiting as composable middleware:
from llmgate import LLMGate
from llmgate.middleware import (
RetryMiddleware,
LoggingMiddleware,
CacheMiddleware,
RateLimitMiddleware,
)
gate = LLMGate(middleware=[
RetryMiddleware(max_retries=3, backoff_factor=0.5),
LoggingMiddleware(level="INFO"),
CacheMiddleware(ttl=300),
RateLimitMiddleware(rpm=60),
])
resp = gate.completion("gpt-4o-mini", messages)
resp = await gate.acompletion("gemini-2.5-flash-lite", messages)
# Streaming through middleware
for chunk in gate.stream("groq/llama-3.1-8b-instant", messages):
print(chunk.delta, end="", flush=True)
# Embeddings through middleware
resp = gate.embed("text-embedding-3-small", "Hello")from llmgate.exceptions import (
AuthError, # 401 / bad API key
RateLimitError, # 429 / quota exceeded
ProviderAPIError, # other provider errors
ModelNotFoundError, # unknown model / no provider matched
EmbeddingsNotSupported, # provider doesn't have an embeddings API
VisionNotSupported, # provider doesn't support image inputs
)
try:
resp = completion("gpt-4o-mini", messages)
except AuthError as e:
print(f"Bad API key for {e.provider}")
except RateLimitError as e:
print(f"Rate limited by {e.provider} — back off and retry")
except ProviderAPIError as e:
print(f"Provider error: {e}")
except ModelNotFoundError as e:
print(f"Unknown model: {e.model}")| Variable | Provider |
|---|---|
OPENAI_API_KEY |
OpenAI |
ANTHROPIC_API_KEY |
Anthropic |
GEMINI_API_KEY / GOOGLE_API_KEY |
Gemini |
GROQ_API_KEY |
Groq |
MISTRAL_API_KEY |
Mistral |
COHERE_API_KEY |
Cohere |
AZURE_OPENAI_API_KEY |
Azure OpenAI |
AZURE_OPENAI_ENDPOINT |
Azure OpenAI |
AZURE_OPENAI_API_VERSION |
Azure OpenAI (default: 2024-02-01) |
AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_DEFAULT_REGION |
Bedrock |
OLLAMA_HOST |
Ollama (default: http://localhost:11434) |
These features are shipped ✅ or planned 🗓️:
| Feature | Status |
|---|---|
Multi-provider completion (completion(), acompletion()) |
✅ v0.1 |
Streaming (stream=True) |
✅ v0.2 |
| Tool / function calling | ✅ v0.2 |
| Composable middleware (logging, cache, retry, rate-limit) | ✅ v0.2 |
| 5 additional providers (Mistral, Cohere, Azure, Bedrock, Ollama) | ✅ v0.2 |
Structured outputs (Pydantic response_format) |
✅ v0.3 |
Embeddings API (embed(), aembed()) |
✅ v0.3 |
| Batch completions — parallel requests with concurrency control | ✅ v0.4 |
| Vision / multimodal — image inputs (8 providers: URL + base64) | ✅ v0.5 |
Fallback / routing — multi-model chains, AllProvidersFailedError |
✅ v0.6 |
| Embedding Batching — parallel/native provider batching | ✅ v0.7 |
Streaming Fallback — seamless mid-stream recovery (stream_fallback_mode) |
✅ v0.8 |
| Embedding Middleware — fully supported middleware chain for embeddings | ✅ v0.8.1 |
| Strict Dependencies & Fixes — google-genai 1.74.0 batching fix & exact dependency pinning | ✅ v0.8.2 |
Gemini multi-turn fix — corrected assistant_parts → parts key so conversation history is sent correctly |
✅ v0.8.3 |
| Automatic tool-call loop — orchestrate multi-step tool use | 🗓️ planned |
| Token counting — local tokenisation before sending | 🗓️ planned |
| Prompt templates — reusable, parameterised prompt builders | 🗓️ planned |
git clone https://github.com/thatAverageGuy/llm-gate
cd llm-gate
uv sync # install deps + dev deps
uv run pytest # run full test suite (all mocked — no API keys needed)
uv build # build wheel + sdistSee CONTRIBUTING.md for contribution guidelines.
MIT — see LICENSE.