Intelligent LLM request router that classifies incoming requests and routes them to the optimal model based on task type.
- A classifier (Gemma 4 31B Turbo, with Mistral Nemo, DeepSeek V3.2, and GLM 5.1 fallbacks) analyzes incoming requests
- Requests are categorized into task types: general, math reasoning, general reasoning, programming, creative, vision
- Each task type routes to the best-suited model with automatic fallback on failure
- Self-answer optimization: For trivially simple questions (greetings, basic facts), the classifier answers directly — saving a round-trip to a second model
- Universal fallback: Kimi K2.6 serves as the last-resort fallback for all task types (with Kimi K2.5 as a secondary legacy fallback)
- Supports both OpenAI Chat Completions (
/v1/chat/completions) and Anthropic Messages (/v1/messages) API formats
| Task Type | Primary Model | Fallbacks |
|---|---|---|
| General | DeepSeek V3.2 | Gemma 4 31B Turbo, Kimi K2.6, GLM 5.1, Qwen3 32B, Mistral Nemo, Qwen3.5 397B, Kimi K2.5 |
| Math Reasoning | Kimi K2.6 | GLM 5.1, Kimi K2.5 |
| General Reasoning | Kimi K2.6 | GLM 5.1, MiniMax M2.5, Kimi K2.5 |
| Programming | GLM 5.1 | Kimi K2.6, MiniMax M2.5, Kimi K2.5 |
| Creative | Kimi K2.6 | Qwen3.5 397B, Kimi K2.5 |
| Vision | Kimi K2.6 | Qwen3.6 27B, Gemma 4 31B Turbo, Kimi K2.5 |
| Priority | Model | Role |
|---|---|---|
| Primary | Gemma 4 31B Turbo TEE | Task classification + self-answer |
| Fallback 1 | Mistral Nemo Instruct 2407 TEE | Classification only |
| Fallback 2 | DeepSeek V3.2 TEE | Classification only |
| Fallback 3 | GLM 5.1 | Classification only |
| Method | Path | Description |
|---|---|---|
| GET | /health |
Health check |
| GET | /v1/models |
List available models |
| GET | /v1/router/metrics |
Routing metrics |
| POST | /v1/chat/completions |
OpenAI-compatible chat completions |
| POST | /v1/messages |
Anthropic Messages API |
All inference endpoints require an API key via Authorization: Bearer <key> or x-api-key: <key> header.
The router accepts keys matching either CHUTES_API_KEY or ROUTER_API_KEY environment variables.
| Variable | Required | Description |
|---|---|---|
CHUTES_API_KEY |
Yes | API key for upstream LLM provider (Chutes) |
UPSTREAM_API_BASE |
No | Override upstream API URL (default: https://llm.chutes.ai/v1) |
ROUTER_API_KEY |
No | Separate key for caller authentication (defaults to CHUTES_API_KEY) |
# Install dependencies
pip install -r requirements.txt
# Set your API key
export CHUTES_API_KEY="your-key"
# Run locally
uvicorn model_router.server:app --host 0.0.0.0 --port 8000Deployed to the chutesai Vercel team.
# Deploy to production
cd model-router
vercel --prodThe Vercel deployment uses api/index.py as the serverless entrypoint. Set CHUTES_API_KEY in the Vercel project environment variables.
pip install -r requirements.txt
uvicorn model_router.server:app --host 0.0.0.0 --port 8000from openai import OpenAI
client = OpenAI(
base_url="https://model-router-ten.vercel.app/v1",
api_key="your-chutes-api-key"
)
response = client.chat.completions.create(
model="model-router",
messages=[{"role": "user", "content": "Write a quicksort in Python"}]
)import anthropic
client = anthropic.Anthropic(
base_url="https://model-router-ten.vercel.app",
api_key="your-chutes-api-key"
)
message = client.messages.create(
model="model-router",
max_tokens=4096,
messages=[{"role": "user", "content": "What's in this image?"}]
)flowchart TD
A["Client Request"] --> B{"API Format?"}
B -->|"/v1/chat/completions"| C["OpenAI Handler"]
B -->|"/v1/messages"| D["Anthropic Handler"]
C --> E["Task Classifier"]
D --> E
E --> F{"Has images?"}
F -->|Yes| G["vision"]
F -->|No| L["LLM Classification<br/><i>Gemma 4 31B</i><br/>→ Mistral Nemo<br/>→ DeepSeek V3.2<br/>→ GLM 5.1"]
L --> M{"Task Type"}
M --> N["general_text"]
M --> O["math_reasoning"]
M --> K["general_reasoning"]
M --> J["programming"]
M --> P["creative"]
M --> G
N --> Q{"Self-answer<br/>available?"}
Q -->|"Yes (conf ≥ 0.95)"| R["Return directly<br/><i>No routing needed</i>"]
Q -->|No| S["Model Selector"]
O --> S
K --> S
J --> S
P --> S
G --> S
S --> T["Try Primary Model"]
T -->|"429 / 5xx"| U["Try Fallback 1"]
U -->|"429 / 5xx"| V["Try Fallback 2"]
V -->|"429 / 5xx"| W["Try Fallback N..."]
W -->|"All failed"| X["503 Error"]
T -->|"Success"| Y["Return Response"]
U -->|"Success"| Y
V -->|"Success"| Y
W -->|"Success"| Y
style R fill:#2d5a2d,stroke:#4a4,color:#fff
style Y fill:#2d5a2d,stroke:#4a4,color:#fff
style X fill:#5a2d2d,stroke:#a44,color:#fff
style E fill:#2d3a5a,stroke:#49a,color:#fff
style S fill:#2d3a5a,stroke:#49a,color:#fff
Each task type has a dedicated primary model and ordered fallback chain. On upstream failure (429/5xx), models are tried left-to-right. Kimi K2.6 serves as universal last-resort for all task types (with Kimi K2.5 retained as a secondary legacy fallback).
flowchart LR
subgraph general["General"]
G1["DeepSeek V3.2"] --> G2["Gemma 4 31B Turbo"] --> G3["Kimi K2.6"] --> G4["GLM 5.1"] --> G5["Qwen3 32B"] --> G6["Mistral Nemo"] --> G7["Qwen3.5 397B"] --> G8["Kimi K2.5"]
end
subgraph math["Math Reasoning"]
M1["Kimi K2.6"] --> M2["GLM 5.1"] --> M3["Kimi K2.5"]
end
subgraph genreason["General Reasoning"]
GR1["Kimi K2.6"] --> GR2["GLM 5.1"] --> GR3["MiniMax M2.5"] --> GR4["Kimi K2.5"]
end
subgraph prog["Programming"]
P1["GLM 5.1"] --> P2["Kimi K2.6"] --> P3["MiniMax M2.5"] --> P4["Kimi K2.5"]
end
subgraph creative["Creative"]
C1["Kimi K2.6"] --> C2["Qwen3.5 397B"] --> C3["Kimi K2.5"]
end
subgraph vision["Vision"]
V1["Kimi K2.6"] --> V2["Qwen3.6 27B"] --> V3["Gemma 4 31B Turbo"] --> V4["Kimi K2.5"]
end
style G1 fill:#1a3a1a,stroke:#4a4,color:#fff
style M1 fill:#1a3a1a,stroke:#4a4,color:#fff
style GR1 fill:#1a3a1a,stroke:#4a4,color:#fff
style P1 fill:#1a3a1a,stroke:#4a4,color:#fff
style C1 fill:#1a3a1a,stroke:#4a4,color:#fff
style V1 fill:#1a3a1a,stroke:#4a4,color:#fff
Classifier chain: Gemma 4 31B Turbo TEE → Mistral Nemo Instruct 2407 TEE → DeepSeek V3.2 TEE → GLM 5.1 (used for classification only; not part of routing).
Most chat-tier ModelConfig entries default to DEFAULT_CHAT_MAX_TOKENS = 65_535, which sits just under the common 65k output limit among the models we route to. Lower-cap models override that value, and the router clamps caller-provided max_tokens to the selected model's configured cap before forwarding upstream.
This matters because our primary models — Kimi K2.6, DeepSeek R1, Qwen3-235B-Thinking, and others — are reasoning models. Their response stream looks like:
delta.reasoning_content: " The user wants ..." ← consumed first, off-screen
delta.reasoning_content: " So the answer is ..."
delta.content: "The square root of 144 is 12." ← only what the user sees
The reasoning portion still spends tokens against the budget. With a tight cap (the historic 4-8k defaults) a reasoning model can burn the entire budget on reasoning_content before any content is produced — the upstream then returns content: null with finish_reason: "length". That shape is indistinguishable from "model produced nothing", which the non-streaming empty-detection path treats as a failure and falls through to the next candidate. The result: K2.6 looks broken even though it was just thinking.
A 65k cap gives reasoning models comfortable headroom and isn't a hard ceiling on cost — most assistant turns finish well below it (finish_reason: "stop" long before token exhaustion). For models with lower limits, such as Qwen/Qwen3-32B-TEE at 40,960 output tokens, the router forwards the lower cap.
content |
tool_calls |
finish_reason |
router's verdict |
|---|---|---|---|
null |
populated | (any) | Not empty — model chose to call a tool. Pass through. |
null |
[] |
length |
Empty for this budget. Falling back is reasonable; the next model may fit a complete answer. |
null |
[] |
stop |
Genuinely empty. Falling back is correct. |
| non-empty | (any) | (any) | Not empty. Pass through. |
The streaming path (_chunk_has_useful_output) already counts reasoning_content as a useful chunk, so reasoning streams aren't classified as empty mid-flight. The non-streaming path doesn't have the same shortcut and relies on the final content field — bumping max_tokens is the cheapest way to make that path happy with reasoning models too.
When the primary is at upstream capacity (429 / "infrastructure at maximum capacity" / 5xx), the router demotes the request to the next model in the task chain rather than serving the user a 503. This is a feature, not a bug: an answer from a live fallback such as Gemma 4 31B Turbo, Kimi K2.6, or GLM 5.1 is strictly better than no answer. The frontend's response header (X-Router-Model) reports whichever model actually produced the answer so callers can persist the truth. If you see disproportionate fallback usage in the DB, the upstream chute health is the most likely cause — check /v1/router/metrics's errors_by_model and requests_by_model for the smoking gun.