Model Router

Intelligent LLM request router that classifies incoming requests and routes them to the optimal model based on task type.

How It Works

A classifier (Gemma 4 31B Turbo, with Mistral Nemo, DeepSeek V3.2, and GLM 5.1 fallbacks) analyzes incoming requests
Requests are categorized into task types: general, math reasoning, general reasoning, programming, creative, vision
Each task type routes to the best-suited model with automatic fallback on failure
Self-answer optimization: For trivially simple questions (greetings, basic facts), the classifier answers directly — saving a round-trip to a second model
Universal fallback: Kimi K2.6 serves as the last-resort fallback for all task types (with Kimi K2.5 as a secondary legacy fallback)
Supports both OpenAI Chat Completions (/v1/chat/completions) and Anthropic Messages (/v1/messages) API formats

Model Routing Table

Task Type	Primary Model	Fallbacks
General	DeepSeek V3.2	Gemma 4 31B Turbo, Kimi K2.6, GLM 5.1, Qwen3 32B, Mistral Nemo, Qwen3.5 397B, Kimi K2.5
Math Reasoning	Kimi K2.6	GLM 5.1, Kimi K2.5
General Reasoning	Kimi K2.6	GLM 5.1, MiniMax M2.5, Kimi K2.5
Programming	GLM 5.1	Kimi K2.6, MiniMax M2.5, Kimi K2.5
Creative	Kimi K2.6	Qwen3.5 397B, Kimi K2.5
Vision	Kimi K2.6	Qwen3.6 27B, Gemma 4 31B Turbo, Kimi K2.5

Classifier Models

Priority	Model	Role
Primary	Gemma 4 31B Turbo TEE	Task classification + self-answer
Fallback 1	Mistral Nemo Instruct 2407 TEE	Classification only
Fallback 2	DeepSeek V3.2 TEE	Classification only
Fallback 3	GLM 5.1	Classification only

API Endpoints

Method	Path	Description
GET	`/health`	Health check
GET	`/v1/models`	List available models
GET	`/v1/router/metrics`	Routing metrics
POST	`/v1/chat/completions`	OpenAI-compatible chat completions
POST	`/v1/messages`	Anthropic Messages API

Authentication

All inference endpoints require an API key via Authorization: Bearer <key> or x-api-key: <key> header.

The router accepts keys matching either CHUTES_API_KEY or ROUTER_API_KEY environment variables.

Environment Variables

Variable	Required	Description
`CHUTES_API_KEY`	Yes	API key for upstream LLM provider (Chutes)
`UPSTREAM_API_BASE`	No	Override upstream API URL (default: `https://llm.chutes.ai/v1`)
`ROUTER_API_KEY`	No	Separate key for caller authentication (defaults to `CHUTES_API_KEY`)

Local Development

# Install dependencies
pip install -r requirements.txt

# Set your API key
export CHUTES_API_KEY="your-key"

# Run locally
uvicorn model_router.server:app --host 0.0.0.0 --port 8000

Deployment

Vercel (current)

Deployed to the chutesai Vercel team.

# Deploy to production
cd model-router
vercel --prod

The Vercel deployment uses api/index.py as the serverless entrypoint. Set CHUTES_API_KEY in the Vercel project environment variables.

Docker / Self-hosted

pip install -r requirements.txt
uvicorn model_router.server:app --host 0.0.0.0 --port 8000

Usage Examples

OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="https://model-router-ten.vercel.app/v1",
    api_key="your-chutes-api-key"
)

response = client.chat.completions.create(
    model="model-router",
    messages=[{"role": "user", "content": "Write a quicksort in Python"}]
)

Anthropic SDK

import anthropic

client = anthropic.Anthropic(
    base_url="https://model-router-ten.vercel.app",
    api_key="your-chutes-api-key"
)

message = client.messages.create(
    model="model-router",
    max_tokens=4096,
    messages=[{"role": "user", "content": "What's in this image?"}]
)

Architecture

flowchart TD
    A["Client Request"] --> B{"API Format?"}
    B -->|"/v1/chat/completions"| C["OpenAI Handler"]
    B -->|"/v1/messages"| D["Anthropic Handler"]
    C --> E["Task Classifier"]
    D --> E

    E --> F{"Has images?"}
    F -->|Yes| G["vision"]
    F -->|No| L["LLM Classification<br/><i>Gemma 4 31B</i><br/>→ Mistral Nemo<br/>→ DeepSeek V3.2<br/>→ GLM 5.1"]

    L --> M{"Task Type"}
    M --> N["general_text"]
    M --> O["math_reasoning"]
    M --> K["general_reasoning"]
    M --> J["programming"]
    M --> P["creative"]
    M --> G

    N --> Q{"Self-answer<br/>available?"}
    Q -->|"Yes (conf ≥ 0.95)"| R["Return directly<br/><i>No routing needed</i>"]
    Q -->|No| S["Model Selector"]
    O --> S
    K --> S
    J --> S
    P --> S
    G --> S

    S --> T["Try Primary Model"]
    T -->|"429 / 5xx"| U["Try Fallback 1"]
    U -->|"429 / 5xx"| V["Try Fallback 2"]
    V -->|"429 / 5xx"| W["Try Fallback N..."]
    W -->|"All failed"| X["503 Error"]
    T -->|"Success"| Y["Return Response"]
    U -->|"Success"| Y
    V -->|"Success"| Y
    W -->|"Success"| Y

    style R fill:#2d5a2d,stroke:#4a4,color:#fff
    style Y fill:#2d5a2d,stroke:#4a4,color:#fff
    style X fill:#5a2d2d,stroke:#a44,color:#fff
    style E fill:#2d3a5a,stroke:#49a,color:#fff
    style S fill:#2d3a5a,stroke:#49a,color:#fff

Decision Graph & Fallback Chains

Each task type has a dedicated primary model and ordered fallback chain. On upstream failure (429/5xx), models are tried left-to-right. Kimi K2.6 serves as universal last-resort for all task types (with Kimi K2.5 retained as a secondary legacy fallback).

flowchart LR
    subgraph general["General"]
        G1["DeepSeek V3.2"] --> G2["Gemma 4 31B Turbo"] --> G3["Kimi K2.6"] --> G4["GLM 5.1"] --> G5["Qwen3 32B"] --> G6["Mistral Nemo"] --> G7["Qwen3.5 397B"] --> G8["Kimi K2.5"]
    end
    subgraph math["Math Reasoning"]
        M1["Kimi K2.6"] --> M2["GLM 5.1"] --> M3["Kimi K2.5"]
    end
    subgraph genreason["General Reasoning"]
        GR1["Kimi K2.6"] --> GR2["GLM 5.1"] --> GR3["MiniMax M2.5"] --> GR4["Kimi K2.5"]
    end
    subgraph prog["Programming"]
        P1["GLM 5.1"] --> P2["Kimi K2.6"] --> P3["MiniMax M2.5"] --> P4["Kimi K2.5"]
    end
    subgraph creative["Creative"]
        C1["Kimi K2.6"] --> C2["Qwen3.5 397B"] --> C3["Kimi K2.5"]
    end
    subgraph vision["Vision"]
        V1["Kimi K2.6"] --> V2["Qwen3.6 27B"] --> V3["Gemma 4 31B Turbo"] --> V4["Kimi K2.5"]
    end

    style G1 fill:#1a3a1a,stroke:#4a4,color:#fff
    style M1 fill:#1a3a1a,stroke:#4a4,color:#fff
    style GR1 fill:#1a3a1a,stroke:#4a4,color:#fff
    style P1 fill:#1a3a1a,stroke:#4a4,color:#fff
    style C1 fill:#1a3a1a,stroke:#4a4,color:#fff
    style V1 fill:#1a3a1a,stroke:#4a4,color:#fff

Classifier chain: Gemma 4 31B Turbo TEE → Mistral Nemo Instruct 2407 TEE → DeepSeek V3.2 TEE → GLM 5.1 (used for classification only; not part of routing).

max_tokens & capacity

Why we send a generous `max_tokens`

Most chat-tier ModelConfig entries default to DEFAULT_CHAT_MAX_TOKENS = 65_535, which sits just under the common 65k output limit among the models we route to. Lower-cap models override that value, and the router clamps caller-provided max_tokens to the selected model's configured cap before forwarding upstream.

This matters because our primary models — Kimi K2.6, DeepSeek R1, Qwen3-235B-Thinking, and others — are reasoning models. Their response stream looks like:

delta.reasoning_content: " The user wants ..."   ← consumed first, off-screen
delta.reasoning_content: " So the answer is ..."
delta.content: "The square root of 144 is 12."   ← only what the user sees

The reasoning portion still spends tokens against the budget. With a tight cap (the historic 4-8k defaults) a reasoning model can burn the entire budget on reasoning_content before any content is produced — the upstream then returns content: null with finish_reason: "length". That shape is indistinguishable from "model produced nothing", which the non-streaming empty-detection path treats as a failure and falls through to the next candidate. The result: K2.6 looks broken even though it was just thinking.

A 65k cap gives reasoning models comfortable headroom and isn't a hard ceiling on cost — most assistant turns finish well below it (finish_reason: "stop" long before token exhaustion). For models with lower limits, such as Qwen/Qwen3-32B-TEE at 40,960 output tokens, the router forwards the lower cap.

What "empty response" means by case

`content`	`tool_calls`	`finish_reason`	router's verdict
`null`	populated	(any)	Not empty — model chose to call a tool. Pass through.
`null`	`[]`	`length`	Empty for this budget. Falling back is reasonable; the next model may fit a complete answer.
`null`	`[]`	`stop`	Genuinely empty. Falling back is correct.
non-empty	(any)	(any)	Not empty. Pass through.

The streaming path (_chunk_has_useful_output) already counts reasoning_content as a useful chunk, so reasoning streams aren't classified as empty mid-flight. The non-streaming path doesn't have the same shortcut and relies on the final content field — bumping max_tokens is the cheapest way to make that path happy with reasoning models too.

Capacity-driven fallback is intentional

When the primary is at upstream capacity (429 / "infrastructure at maximum capacity" / 5xx), the router demotes the request to the next model in the task chain rather than serving the user a 503. This is a feature, not a bug: an answer from a live fallback such as Gemma 4 31B Turbo, Kimi K2.6, or GLM 5.1 is strictly better than no answer. The frontend's response header (X-Router-Model) reports whichever model actually produced the answer so callers can persist the truth. If you see disproportionate fallback usage in the DB, the upstream chute health is the most likely cause — check /v1/router/metrics's errors_by_model and requests_by_model for the smoking gun.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
api		api
model_router		model_router
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Model Router

How It Works

Model Routing Table

Classifier Models

API Endpoints

Authentication

Environment Variables

Local Development

Deployment

Vercel (current)

Docker / Self-hosted

Usage Examples

OpenAI SDK

Anthropic SDK

Architecture

Decision Graph & Fallback Chains

max_tokens & capacity

Why we send a generous `max_tokens`

What "empty response" means by case

Capacity-driven fallback is intentional

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Model Router

How It Works

Model Routing Table

Classifier Models

API Endpoints

Authentication

Environment Variables

Local Development

Deployment

Vercel (current)

Docker / Self-hosted

Usage Examples

OpenAI SDK

Anthropic SDK

Architecture

Decision Graph & Fallback Chains

max_tokens & capacity

Why we send a generous max_tokens

What "empty response" means by case

Capacity-driven fallback is intentional

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Why we send a generous `max_tokens`

Packages