Providers and Models

Providers are configured via YAML fragments in litellm/config/providers/. make run assembles them into litellm/config.yaml (auto-generated, gitignored). Free-tier providers are tried first in fallback chains. Each provider is opt-in: set its flag to 1 in .env (e.g. GROQ=1) and fill in the API key. The flag activates the provider — the key alone does nothing.

Free-tier reality check

"Free" never means unlimited. Every cloud provider on this gateway has a hard cap somewhere — RPM, RPD, TPM, TPD, monthly tokens, monthly request count, or a tiny dollar-denominated credit. Cross the cap and you get 429s, blocked accounts, or pay-as-you-go billing. The fallback chains in litellm/config/fallbacks.json hop to the next provider on 429, but if you've exhausted all of them you're either falling all the way to local models or getting an error.

Numbers below were correct at last check (provider docs change — click through for current values before relying on a tier).

Provider	CC required?	Per-minute	Per-day	Monthly cap	Notes	Official limits page
Groq	No	30 RPM, 6–12K TPM	1K–14.4K RPD, 100K–500K TPD	—	Per-model. `llama-3.3-70b`: 1K RPD / 100K TPD. `llama-3.1-8b`: 14.4K RPD / 500K TPD.	console.groq.com/docs/rate-limits
Cerebras	No	5 RPM, 30K TPM	1M TPH, 1M TPD	—	"Free Trial" eligible models only: `qwen-3-235b`, `gpt-oss-120b`, `zai-glm-4.7`, `llama3.1-8b`.	inference-docs.cerebras.ai/support/rate-limits
OpenRouter	No (for $0)	20 RPM on `:free`	50 RPD with $0 / 1000 RPD with $10+	—	Daily cap is per-account, not per-model.	openrouter.ai/docs/api-reference/limits
HuggingFace	No	varies per provider	varies per provider	$0.10 credits/mo (PRO: $2/mo)	Once credits run out you must purchase more — there is no "stays free forever" tier.	huggingface.co/docs/inference-providers/pricing
Mistral	No	not published	not published	not published	"Experiment" plan exists but Mistral doesn't publish numeric limits — see Admin → Limits in console after sign-up. Only `mistral-large`, `mistral-small`, `ministral-8b`, `mistral-embed` are free-tier.	docs.mistral.ai/admin/user-management-finops/tier
Cohere	No	20 RPM chat, 10 RPM rerank, 2K inputs/min embed	—	1,000 API calls/month (chat)	Hard monthly request cap is very low — runs out fast on any real workload.	docs.cohere.com/v2/docs/rate-limits
Claudebox	Subscription	depends on plan	depends on plan	—	Uses your Claude Pro/Max OAuth — no extra cost beyond the sub.	anthropic.com/pricing
Pibox-zai	Subscription	depends on plan	depends on plan	—	pi-coding-agent pointed at z.ai — uses your z.ai subscription.	z.ai
Anthropic	Yes	tiered	tiered	pay-per-token, no free tier	Not free. Standard API.	docs.anthropic.com/en/api/rate-limits
OpenAI	Yes	tiered	tiered	pay-per-token, no free tier	Not free. Standard API.	platform.openai.com/docs/guides/rate-limits
Local (CPU / CUDA)	N/A	unlimited	unlimited	unlimited	Only constrained by your hardware. Last-resort fallback when all cloud tiers fail.	—

What this means for the gateway:

Hammer Groq → 429 → fallback chain hops. A single requesting client doing >30 chat completions/minute is hitting Groq's RPM ceiling, not yours.
Cohere is a footgun: 1,000 calls/month at trial is enough for testing, not enough for any real workload. Don't put Cohere first in a custom fallback chain unless you've enabled production billing.
HuggingFace free is ~$0.10/month — designed for evaluation, not production. Use a custom provider key (your own HF Pro / direct Together / Fireworks / etc.) for sustained use.
OpenRouter $0 → 50 req/day total across all :free models. Bumping to $10 loaded raises it to 1000 RPD.
Cerebras free tier is brutally rate-capped: 5 RPM (not per-second, not per-day — per minute) is the bottleneck long before the 1M TPD budget. And it's only 4 models — anything else needs the paid Developer plan.
Mistral doesn't publish free-tier numbers anywhere — the "Experiment" plan exists but exact RPS/TPM/TPMonth values live only in your account's Admin → Limits page. Plan accordingly, treat it as low-volume eval-only until you've seen your numbers.
Local models are the only true "no limit" — at the cost of your own VRAM / CPU / latency.

Groq (free tier — 30 RPM, 1K–14.4K RPD per model, no CC)

Model	Alias	Notes
llama-3.1-8b-instant	`groq-llama-3.1-8b`	fast
llama-3.3-70b-versatile	`groq-llama-3.3-70b`
llama-4-scout-17b-16e-instruct	`groq-llama-4-scout`	multimodal
moonshotai/kimi-k2-instruct	`groq-kimi-k2`
openai/gpt-oss-20b	`groq-gpt-oss-20b`
openai/gpt-oss-120b	`groq-gpt-oss-120b`
qwen/qwen3-32b	`groq-qwen3-32b`
compound-beta	`groq-compound`	tool use
compound-beta-mini	`groq-compound-mini`	tool use, fast
whisper-large-v3	`groq-whisper-large-v3`	transcription
whisper-large-v3-turbo	`groq-whisper-large-v3-turbo`	transcription, fast

Cerebras (free tier — 5 RPM / 30K TPM / 1M TPD, no CC)

Sign up: cloud.cerebras.ai — no credit card required. The "Free Trial" plan covers 4 models only (qwen-3-235b, gpt-oss-120b, zai-glm-4.7, llama3.1-8b) and is capped at 5 requests per minute / 30K tokens per minute / 1M tokens per hour / 1M tokens per day per model. Token bucketing — quota replenishes continuously, not on a fixed reset. The 5 RPM ceiling burns out long before the 1M TPD budget on any real workload. Limits page: inference-docs.cerebras.ai/support/rate-limits. Among the fastest inference available (Llama 3.1 8B ~1,800 t/s, Qwen3 235B ~1,400 t/s).

Model	Alias	Notes
qwen-3-235b-a22b-instruct-2507	`cerebras-qwen3-235b`	flagship, very fast — free-tier eligible
gpt-oss-120b	`cerebras-gpt-oss-120b`	free-tier eligible
zai-glm-4.7	`cerebras-glm-4.7`	free-tier eligible
llama3.1-8b	`cerebras-llama-3.1-8b`	fastest, free-tier eligible

OpenRouter (free tier — 50 RPD at $0, 1000 RPD at $10+)

Sign up: openrouter.ai — 50 req/day free across all :free models with $0 loaded; 1000 req/day once you've loaded ≥$10 in credits (lifetime, not monthly). Limits page: openrouter.ai/docs/api-reference/limits.

Model	Alias
nousresearch/hermes-3-llama-3.1-405b	`or-hermes-3-405b`
qwen/qwen3-coder	`or-qwen3-coder`
qwen/qwen3-next-80b-a3b-instruct	`or-qwen3-80b`
nvidia/nemotron-3-super-120b-a12b	`or-nemotron-120b`
minimax/minimax-m2.5	`or-minimax-m2.5`
meta-llama/llama-3.3-70b-instruct	`or-llama-3.3-70b`
openai/gpt-oss-120b	`or-gpt-oss-120b`
openai/gpt-oss-20b	`or-gpt-oss-20b`

HuggingFace Inference Providers ($0.10/mo free credits — not really "free")

Sign up: huggingface.co. Free users get $0.10 in credits per month (PRO: $2/mo, Team/Enterprise: $2/seat/mo). Past that you're pay-as-you-go at the provider's rate — HF doesn't mark up. Treat this as a "try before you buy" tier, not sustained free inference. Pricing: huggingface.co/docs/inference-providers/pricing.

Model	Alias	Notes
meta-llama/Llama-3.1-8B-Instruct	`hf-llama-3.1-8b`
meta-llama/Llama-3.3-70B-Instruct	`hf-llama-3.3-70b`
meta-llama/Llama-4-Scout-17B-16E-Instruct	`hf-llama-4-scout`	multimodal
Qwen/Qwen3-8B	`hf-qwen3-8b`
Qwen/QwQ-32B	`hf-qwq-32b`	reasoning
deepseek-ai/DeepSeek-R1	`hf-deepseek-r1`	reasoning
Qwen/Qwen2.5-VL-72B-Instruct	`hf-qwen-vl-72b`	multimodal
Qwen/Qwen2.5-VL-7B-Instruct	`hf-qwen3-vl-8b`	multimodal
google/gemma-3-12b-it	`hf-gemma-3-12b`	multimodal
black-forest-labs/FLUX.1-schnell	`hf-flux-schnell`	image gen, fast

Mistral AI (free "Experiment" tier — exact limits not published, no CC)

Sign up: console.mistral.ai — no credit card required to start. Mistral has a free "Experiment" plan ("intended for evaluation and prototyping only") and a paid "Scale" plan (pay-as-you-go, auto-promoted Tier 1 → Tier 4 by cumulative billing). The free plan covers mistral-large, mistral-small, ministral-8b, and mistral-embed. Anything else (magistral, devstral, codestral, voxtral) requires Scale plan.

Mistral does not publish numeric free-tier RPS/TPM/TPMonth values anywhere on their public docs site. The official tier page (docs.mistral.ai/admin/user-management-finops/tier) explicitly directs you to "Admin → Limits" inside your own console to see exact numbers. Treat the free tier as low-volume eval until you've signed in and checked yours.

Model	Alias	Tier	Notes
mistral-large-2512	`mistral-large`	free
mistral-small-2603	`mistral-small`	free	multimodal
ministral-3-8b-2512	`ministral-8b`	free	fast
magistral-medium-2509	`magistral-medium`	paid	reasoning
magistral-small-2509	`magistral-small`	paid	reasoning
devstral-2512	`devstral`	paid	coding agent
codestral-2508	`codestral`	paid	code completion
mistral-embed	`mistral-embed`	free	embeddings
voxtral-small-25-07	`voxtral-small`	-	audio transcription

Cohere (trial — 20 RPM chat, 1K calls/month total cap, no CC)

Sign up: dashboard.cohere.com — no credit card required. Trial key gives access to all models, but the monthly chat cap is only 1,000 API calls — runs out fast on any real workload. Rerank: 10 RPM. Embed: 2,000 inputs/min (text) or 5 inputs/min (images). Limits page: docs.cohere.com/v2/docs/rate-limits. For production, switch to a production key (500 RPM chat, contact sales).

Model	Alias	Notes
command-a-03-2025	`cohere-command-a`	flagship, 256K ctx, tool use
command-r-plus-08-2024	`cohere-command-r-plus`	strong, 128K ctx
command-r-08-2024	`cohere-command-r`	balanced
command-r7b-12-2024	`cohere-command-r7b`	fast, small
c4ai-aya-expanse-32b	`cohere-aya-32b`	multilingual (23 languages)
embed-v4.0	`cohere-embed`	embeddings
rerank-v3.5	`cohere-rerank`	reranking

Claudebox (requires Claude subscription or API key)

Full Claude Code CLI in API mode — not a standard LLM API. Each request runs Claude Code's full agentic loop with tool use, file I/O, shell access, and web browsing. Authentication: either an OAuth token from a Claude Pro/Max/Team subscription, or an Anthropic API key (pay-per-use).

Set up with claude setup-token or generate at console.anthropic.com.

Alias	Underlying model	Best for
`claudebox-haiku`	Claude Haiku 4.5	Quick tasks, high-volume, minimal token use
`claudebox-sonnet`	Claude Sonnet 4.6	Daily coding, balanced speed/intelligence
`claudebox-opus`	Claude Opus 4.6	Complex reasoning, architecture, hard debugging

Pibox-zai — pi-coding-agent via z.ai (requires z.ai account)

z.ai provides an Anthropic-compatible API backed by GLM models. Routed through pibox — pi-coding-agent wrapped in an API server, pointed at z.ai. Same agentic capabilities (shell, files, tools, MCP) as claudebox. Why pibox over a second claudebox: pi speaks the Anthropic wire protocol natively, no Claude Code license/OAuth ceremony, and pibox adds a /files/* CRUD API plus optional Telegram + cron modes for free. The -zai suffix names the upstream — future PIBOX_* flags can run pi against OpenAI, OpenRouter, etc.

Alias	Underlying model
`pibox-zai-glm-4.5-air`	GLM-4.5-Air
`pibox-zai-glm-4.7`	GLM-4.7
`pibox-zai-glm-5.1`	GLM-5.1

Override the exposed list with PIBOX_ZAI_AVAILABLE_MODELS=glm-4.5-air,glm-4.7,glm-5.1 and the default model with PIBOX_ZAI_DEFAULT_MODEL=glm-4.7 in .env.

Anthropic (optional, API key required)

Standard Anthropic API — not agentic, just LLM inference. Sign up: console.anthropic.com.

Alias	Model	Notes
`anthropic-claude-opus-4`	claude-opus-4-6	multimodal
`anthropic-claude-sonnet-4`	claude-sonnet-4-6	multimodal
`anthropic-claude-haiku-4`	claude-haiku-4-5	multimodal

OpenAI (optional, API key required)

Alias	Model	Notes
`openai-gpt-4o`	gpt-4o	multimodal
`openai-gpt-4o-mini`	gpt-4o-mini	multimodal
`openai-o3`	o3	reasoning
`openai-o3-mini`	o3-mini	reasoning
`openai-dall-e-3`	dall-e-3	image gen
`openai-gpt-image-1`	gpt-image-1	image gen
`openai-whisper`	whisper-1	transcription
`openai-gpt-4o-transcribe`	gpt-4o-transcribe	transcription, lower WER than whisper, streaming
`openai-gpt-4o-mini-transcribe`	gpt-4o-mini-transcribe	transcription, cheaper variant of the gpt-4o transcriber
`openai-tts-1`	tts-1	text-to-speech
`openai-tts-1-hd`	tts-1-hd	text-to-speech

Ollama (local CPU — `OLLAMA=1`)

Models are downloaded on first start and cached in .data/ollama/. No GPU required.

Alias	Model	Notes
`local-ollama-cpu-llama3.2-3b`	llama3.2:3b	general chat, ~2GB RAM
`local-ollama-cpu-qwen3-4b`	qwen3:4b	general chat, thinking mode, ~2.6GB RAM
`local-ollama-cpu-smollm2-1.7b`	smollm2:1.7b	general chat, smallest, ~1GB RAM
`local-ollama-cpu-qwen2.5-coder-1.5b`	qwen2.5-coder:1.5b	code, ~1GB RAM
`local-ollama-cpu-qwen2.5-coder-3b`	qwen2.5-coder:3b	code, ~2GB RAM
`local-ollama-cpu-phi4-mini`	phi4-mini	general chat, 128K ctx, ~2.5GB RAM
`local-ollama-cpu-gemma4-e2b`	gemma4:e2b	general chat + vision (Gemma 4), ~7.2GB RAM
`local-ollama-cpu-gemma3-4b`	gemma3:4b	general chat + vision — lightweight, ~2.6GB RAM
`local-ollama-cpu-dolphin-phi`	dolphin-phi:latest	uncensored, ~1.6GB RAM
`local-ollama-cpu-nuextract-v1.5`	nuextract	structured extraction — unstructured text → JSON, ~2.3GB RAM
`local-ollama-cpu-bge-m3`	bge-m3	embeddings, multilingual, 8192 ctx, ~570MB RAM
`local-ollama-cpu-qwen3-embed-0.6b`	qwen3-embedding:0.6b	embeddings, ~500MB RAM

Ollama CUDA (local NVIDIA — `OLLAMA_CUDA=1`)

Requires nvidia-container-toolkit. Flash attention + quantized KV cache enabled. Resource manager unloads the CUDA LLM before any CUDA TTS/STT request.

Alias	Model	Notes
`local-ollama-cuda-qwen3-8b`	qwen3:8b	general chat, thinking mode, ~5GB VRAM
`local-ollama-cuda-llama3.1-8b`	llama3.1:8b	general chat, ~5GB VRAM
`local-ollama-cuda-gemma4-e2b`	gemma4:e2b	general chat + vision, ~7.2GB VRAM
`local-ollama-cuda-gemma4-e4b`	gemma4:e4b	general chat + vision, ~9.6GB VRAM
`local-ollama-cuda-qwen2.5-coder-7b`	qwen2.5-coder:7b	code, ~5GB VRAM
`local-ollama-cuda-deepseek-coder-v2-16b`	deepseek-coder-v2:16b	code, MoE 2.4B active, 160K ctx, ~8.9GB VRAM
`local-ollama-cuda-deepseek-r1-8b`	deepseek-r1:8b	reasoning, thinking mode, ~5.2GB VRAM
`local-ollama-cuda-qwen3-abliterated-16b`	huihui_ai/qwen3-abliterated:16b	uncensored, ~9.8GB VRAM
`local-ollama-cuda-gemma4-abliterated-e4b`	huihui_ai/gemma-4-abliterated:e4b	uncensored + vision, ~9.6GB VRAM
`local-ollama-cuda-dolphin-phi`	dolphin-phi:latest	uncensored, tiny, ~1.6GB VRAM
`local-ollama-cuda-llama3.2-3b`	llama3.2:3b	general chat, ~2.0GB VRAM
`local-ollama-cuda-qwen3-4b`	qwen3:4b	general chat, thinking mode, ~2.6GB VRAM
`local-ollama-cuda-smollm2-1.7b`	smollm2:1.7b	tiny general chat, ~1.0GB VRAM
`local-ollama-cuda-qwen2.5-coder-1.5b`	qwen2.5-coder:1.5b	code completion, tiny, ~1.0GB VRAM
`local-ollama-cuda-qwen2.5-coder-3b`	qwen2.5-coder:3b	code completion, small, ~2.0GB VRAM
`local-ollama-cuda-phi4-mini`	phi4-mini	general chat + reasoning, ~2.5GB VRAM
`local-ollama-cuda-gemma3-4b`	gemma3:4b	general chat + vision, lightweight, ~2.6GB VRAM
`local-ollama-cuda-nuextract-v1.5`	iodose/nuextract-v1.5	structured extraction — unstructured text → JSON, ~2.3GB VRAM
`local-ollama-cuda-bge-m3`	bge-m3	embeddings, multilingual, 8192 ctx, ~570MB VRAM
`local-ollama-cuda-qwen3-embed-0.6b`	qwen3-embedding:0.6b	embeddings, ~500MB VRAM

talkies CPU (local — `TALKIES=1`)

Unified OpenAI-compatible speech service via psyb0t/talkies:v0.3.0. One container exposes both /v1/audio/transcriptions (whisper + canary-180m) and /v1/audio/speech (Kokoro-82M TTS). Stereo channel-split diarization (diarization=true → segments tagged with "channel": "L"/"R"), VAD-chunked long audio, idle-unload TTL. Weights auto-downloaded into .data/talkies/ on first request. Loaded models auto-unload after TALKIES_MODEL_TTL (default 10m).

Alias	Model	Mode
`local-talkies-whisper-large-v3`	Systran/faster-whisper-large-v3	transcription (multilingual, highest accuracy)
`local-talkies-whisper-large-v3-turbo`	deepdml/faster-whisper-large-v3-turbo-ct2	transcription (multilingual, ~8x faster than large-v3)
`local-talkies-canary-180m-flash`	nvidia/canary-180m-flash	transcription (English, FastConformer encoder)
`local-talkies-kokoro-tts`	hexgrad/Kokoro-82M	TTS — ~41 voices across en/es/fr/hi/it/pt (`af_heart`, `bm_george`, `ef_dora`, …; discover via `GET /v1/audio/voices`)

talkies CUDA (local NVIDIA — `TALKIES_CUDA=1`)

CUDA-accelerated talkies (psyb0t/talkies:v0.3.0-cuda). Adds Parakeet TDT, Canary-1B-Flash, and Canary-Qwen-2.5B SALM on top of the CPU set. Kokoro TTS still runs on CPU inside the CUDA image (fast enough that it doesn't need a GPU). Shares .data/talkies/ with the CPU variant. The LiteLLM resource manager evicts these from VRAM whenever a competing CUDA job (LLM / image / TTS / other STT) arrives.

Alias	Model	Mode
`local-talkies-cuda-whisper-large-v3`	Systran/faster-whisper-large-v3	transcription (CUDA, multilingual)
`local-talkies-cuda-whisper-large-v3-turbo`	deepdml/faster-whisper-large-v3-turbo-ct2	transcription (CUDA, fastest Whisper at near-large WER)
`local-talkies-cuda-parakeet-tdt-0.6b-v3`	nvidia/parakeet-tdt-0.6b-v3	transcription (CUDA, 25 European languages)
`local-talkies-cuda-canary-180m-flash`	nvidia/canary-180m-flash	transcription (CUDA, English)
`local-talkies-cuda-canary-1b-flash`	nvidia/canary-1b-flash	transcription (CUDA, EN/DE/FR/ES + EN↔X translation)
`local-talkies-cuda-canary-qwen-2.5b`	nvidia/canary-qwen-2.5b	transcription (CUDA, English, NeMo SALM hybrid ASR+LLM)
`local-talkies-cuda-kokoro-tts`	hexgrad/Kokoro-82M	TTS (runs on CPU inside the CUDA image)
`local-talkies-cuda-qwen3-tts`	Qwen/Qwen3-TTS-12Hz-0.6B-Base	TTS — voice cloning via reference `.wav` files in `${DATA_DIR_TALKIES}/custom-voices/`; samples `alloy`/`echo`/`fable` baked in; supports 17 languages (en, zh, ja, ko, fr, de, es, it, pt, ru, vi, th, id, ar, tr, pl, nl)

sd.cpp CPU (local — `SDCPP=1`)

Local CPU image generation via stable-diffusion.cpp. Go wrapper with model hot-swap, idle auto-unload, OpenAI-compatible /v1/images/generations. Models cached in .data/sdcpp/models/.

Alias	Model	Notes
`local-sdcpp-cpu-sd-turbo`	stabilityai/sd-turbo	fastest, smallest (~1.7GB)
`local-sdcpp-cpu-sdxl-turbo`	stabilityai/sdxl-turbo	better quality (~2.5GB)

sd.cpp CUDA (local NVIDIA — `SDCPP_CUDA=1`)

CUDA-accelerated image generation. Same Go wrapper with CUDA backend. Non-blocking — rejects concurrent requests with 503 (resource manager handles scheduling via semaphore).

Alias	Model	Notes
`local-sdcpp-cuda-sd-turbo`	stabilityai/sd-turbo	fastest on GPU (~1.7GB VRAM)
`local-sdcpp-cuda-sdxl-turbo`	stabilityai/sdxl-turbo	fast, good quality (~2.5GB VRAM)
`local-sdcpp-cuda-sdxl-lightning`	ByteDance/SDXL-Lightning	fast, high quality (~2.5GB VRAM)
`local-sdcpp-cuda-flux-schnell`	black-forest-labs/FLUX.1-schnell	best quality, largest (~7GB VRAM)
`local-sdcpp-cuda-juggernaut-xi`	RunDiffusion/Juggernaut-XI-v11	photorealistic SDXL fine-tune (~2.5GB VRAM)

Fallbacks

Every model has its own fallback chain. When a provider fails, is rate-limited, or returns an error, LiteLLM automatically tries the next model in the chain. Free providers are always tried first.

For example, groq-llama-3.3-70b falls back through cerebras-qwen3-235b → mistral-small → cohere-command-r → or-llama-3.3-70b → hf-llama-3.3-70b → claudebox-sonnet → pibox-zai-glm-4.7 → openai-gpt-4o. See litellm/config/fallbacks.json for all chains.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Providers and Models

Free-tier reality check

Groq (free tier — 30 RPM, 1K–14.4K RPD per model, no CC)

Cerebras (free tier — 5 RPM / 30K TPM / 1M TPD, no CC)

OpenRouter (free tier — 50 RPD at $0, 1000 RPD at $10+)

HuggingFace Inference Providers ($0.10/mo free credits — not really "free")

Mistral AI (free "Experiment" tier — exact limits not published, no CC)

Cohere (trial — 20 RPM chat, 1K calls/month total cap, no CC)

Claudebox (requires Claude subscription or API key)

Pibox-zai — pi-coding-agent via z.ai (requires z.ai account)

Anthropic (optional, API key required)

OpenAI (optional, API key required)

Ollama (local CPU — `OLLAMA=1`)

Ollama CUDA (local NVIDIA — `OLLAMA_CUDA=1`)

talkies CPU (local — `TALKIES=1`)

talkies CUDA (local NVIDIA — `TALKIES_CUDA=1`)

sd.cpp CPU (local — `SDCPP=1`)

sd.cpp CUDA (local NVIDIA — `SDCPP_CUDA=1`)

Fallbacks

Uh oh!

FilesExpand file tree

providers.md

Latest commit

History

providers.md

File metadata and controls

Providers and Models

Free-tier reality check

Groq (free tier — 30 RPM, 1K–14.4K RPD per model, no CC)

Cerebras (free tier — 5 RPM / 30K TPM / 1M TPD, no CC)

OpenRouter (free tier — 50 RPD at $0, 1000 RPD at $10+)

HuggingFace Inference Providers ($0.10/mo free credits — not really "free")

Mistral AI (free "Experiment" tier — exact limits not published, no CC)

Cohere (trial — 20 RPM chat, 1K calls/month total cap, no CC)

Claudebox (requires Claude subscription or API key)

Pibox-zai — pi-coding-agent via z.ai (requires z.ai account)

Anthropic (optional, API key required)

OpenAI (optional, API key required)

Ollama (local CPU — OLLAMA=1)

Ollama CUDA (local NVIDIA — OLLAMA_CUDA=1)

talkies CPU (local — TALKIES=1)

talkies CUDA (local NVIDIA — TALKIES_CUDA=1)

sd.cpp CPU (local — SDCPP=1)

sd.cpp CUDA (local NVIDIA — SDCPP_CUDA=1)

Fallbacks

Ollama (local CPU — `OLLAMA=1`)

Ollama CUDA (local NVIDIA — `OLLAMA_CUDA=1`)

talkies CPU (local — `TALKIES=1`)

talkies CUDA (local NVIDIA — `TALKIES_CUDA=1`)

sd.cpp CPU (local — `SDCPP=1`)

sd.cpp CUDA (local NVIDIA — `SDCPP_CUDA=1`)