-
-
Notifications
You must be signed in to change notification settings - Fork 0
Routing
Nenya's routing system dynamically selects the optimal upstream provider for each request based on multiple factors including latency, cost, and model capabilities.
When a client sends a request with model: "agent-name", Nenya resolves the agent and routes through its model list. Agents define:
-
Strategy:
"fallback"(try first, then next on failure) or"round-robin"(distribute across targets) - Model list: ordered list of models to try
- Circuit breaker: per target, protects against cascading failures
- Cooldown: seconds to skip a model after a retryable error
Request → Model A → fails → Model B → fails → Model C → succeeds
↓
Circuit breaker trips A, B → tries next
Request 1 → Model A
Request 2 → Model B
Request 3 → Model C
Request 4 → Model A (wraps)
When auto_reorder_by_latency is enabled and routing_strategy is "balanced", targets are scored using a multi-dimensional formula:
score = (latency_normalized * latency_weight)
- (cost_normalized * cost_weight)
+ capability_boost
+ score_bonus
-
latency_normalized:
(maxLat - modelLatency) / (maxLat - minLat)— higher = faster -
cost_normalized:
(modelCost - minCost) / (maxCost - minCost)— lower = cheaper - score_bonus: Per-model override
- capability_boost: +0.1 per matching capability, -0.1 per mismatch
When governance.auto_reorder_by_latency is enabled, targets are sorted by historical median latency (fastest first). A ±5% random jitter is applied to prevent thundering herd — all clients hitting the fastest provider simultaneously.
{
"governance": {
"auto_reorder_by_latency": true,
"routing_strategy": "balanced",
"routing_latency_weight": 1.0,
"routing_cost_weight": 0.0
}
}The LatencyTracker maintains per-model sorted sample buffers (incremental binary-search insertion, O(n) per record) and computes median latency. At most 100 samples per model. Stale entries (no updates for 1 hour) are evicted automatically.
Each agent+provider+model combination is tracked independently:
| State | Behavior |
|---|---|
| Closed | Normal operation. Tracks consecutive failures. Trips to Open after failure_threshold failures. |
| Open | All requests skipped. After cooldown_seconds, transitions to HalfOpen. |
| HalfOpen | Allows up to half_open_max_requests probe requests (configurable via governance.half_open_max_requests, default 3). All succeed → Closed. Any fail → Open. |
| ForceOpen | Immediately opened (used for HTTP 429). Extends cooldown for quota exhaustion. |
The circuit breaker is checked twice per target: once during target list construction and again immediately before sending.
Check state via /statsz:
{
"circuit_breakers": {
"build:gemini:gemini-3-flash": "closed",
"build:deepseek:deepseek-reasoner": "open"
}
}- Configuration — Agent model selectors and regex patterns
- Providers — Provider capabilities
- Architecture — Request lifecycle and graceful degradation
Getting Started
- Home — Project overview
- Quick Start — Install and run in 5 minutes
- Client Setup — OpenCode, Cursor, and other clients
- Deployment — Bare metal, container, Kubernetes
Core Concepts
- Configuration — Config reference and examples
- Providers — 22 providers, capabilities, special behaviors
- Routing — Latency-aware routing and fallback chains
- Architecture — Package overview and request lifecycle
- MCP Integration — MCP server integration
Reference
- Passthrough Proxy — Raw provider endpoint proxying
- Secrets — Systemd credentials and container secrets
- Model Discovery — Dynamic model catalog fetching
- API Endpoints — Endpoint reference
Operations
- Demo — Test all pipeline tiers
- Troubleshooting — Common issues and solutions
- FAQ — Frequently asked questions
- Security — Security policy and vulnerability reporting
Project
- Roadmap — Planned features
- Disclaimer — Legal disclaimer