-
Notifications
You must be signed in to change notification settings - Fork 383
Closed
Labels
area:inferenceInference routing and configuration workInference routing and configuration workstate:agent-readyApproved for agent implementationApproved for agent implementationstate:in-progressWork is currently in progressWork is currently in progress
Milestone
Description
Summary
Simplify the inference routing system by removing the implicit catch-all mechanism and replacing it with an explicit inference.local hostname addressable inside every sandbox. Inference configuration moves from per-route CRUD to cluster-level config backed by the existing provider system.
Context
The current inference routing has two paths:
- Direct allow — Network policy explicitly allows traffic to a specific endpoint (e.g.,
api.anthropic.com). Works for any endpoint, not inference-specific. - Implicit catch-all — Requests that aren't directly allowed but are detected as inference calls get silently routed through the privacy router to a configured backend.
The catch-all is confusing. A typo in a policy (e.g., api.entropics.com instead of api.anthropic.com) silently reroutes inference to the local model instead of failing visibly. As John put it: "explicit policies for allowances and then we have this implicit secret inference catch-all which breaks the mental model."
Decisions
- Remove the implicit catch-all — No more
inspect_for_inferenceOPA action. If a request isn't explicitly allowed, it's denied. - Introduce
inference.local— An always-addressable hostname inside every sandbox that routes through our inference router. No credentials needed from the agent's perspective. inference.localdefaults to managed NVIDIA inference — If no local model is deployed (e.g., on Brev/CPU), the router points to managed NVIDIA endpoints. When a local model is available, it switches over.- Direct allow unchanged — Explicit network policy allows (e.g., Claude → Anthropic) continue as-is. The router is for "your custom agent" inference.
- Single model override — Router rewrites the model name from client to whatever is configured. Client-specified model is ignored.
- Cluster-level inference config — How
inference.localroutes is configured at the cluster level, not per-sandbox. Config is: provider name + model name. - Providers as the credential mechanism — Instead of routes carrying API keys, use the existing provider system for secure credential injection. New providers:
openai,anthropic,nvidia(all API-key-only for now). Related: Inference route API keys stored in plain object store #21, Inference route API keys exposed via ListInferenceRoutes #20. - Credential injection at supervisor level — Still planned independently of router changes.
Router Flow
- Request from agent hits
inference.local - Router detects the inference API format (OpenAI, Anthropic, etc.)
- Router fetches cluster inference config via gRPC (cached, periodically refreshed) → gets provider name + model name
- Router fetches provider credentials (API key) via the provider system
- Router makes the upstream request with the correct API key and model override
- API format translation (e.g., OpenAI ↔ Anthropic) is out of scope — handled in feat(router): add inference API translation between protocols #90
User Flow
# 1. Create a provider with credentials
nemoclaw provider create --name nvidia_build --type nvidia --from-existing
# 2. Configure cluster-level inference
nemoclaw cluster inference set --provider nvidia_build --model llama-3.1-8b
# 3. Inside any sandbox, agent hits inference.local — just works
curl http://inference.local/v1/chat/completions \
-d '{"model": "anything", "messages": [...]}'
# model is overwritten to llama-3.1-8b, routed to nvidia_build with injected API keyImplementation
Remove
- Remove
nemoclaw inference create/update/delete/listCLI commands - Remove inference route gRPC RPCs (
CreateInferenceRoute,UpdateInferenceRoute,DeleteInferenceRoute,ListInferenceRoutes,GetInferenceRoute) - Remove
InferenceRoute/InferenceRouteSpecdata model from proto (or deprecate) - Remove
inspect_for_inferenceOPA action and the implicit catch-all code path in the sandbox proxy - Remove
routing_hintconcept and route-level API key storage
Add
- Add
inference.localDNS/hostname resolution inside the sandbox (resolve to the router) - Add cluster-level inference configuration (proto fields, storage, gRPC endpoint)
- Add
nemoclaw cluster inference set/getCLI commands (provider name + model name) - Create
openai,anthropic, andnvidiaproviders- Note:
nvidiaone already exists.
- Note:
- Update the router to read cluster config (cached + refreshed) and fetch provider credentials
- Default route: managed NVIDIA inference when no local model is present
- Drop a skill/instructions in the sandbox telling agents about
inference.local
Update
- Update policy files (
dev-sandbox-policy.yaml,policy-local.yaml,policy-frontier.yaml, etc.) to remove inference catch-all rules - Update OPA rego rules — simplify to allow/deny (no tri-state)
- Update architecture docs (
architecture/security-policy.md, etc.)
GTC Priorities
- Primary: Awesome Brev cloud experience with managed endpoints (most users won't have Spark hardware)
- Secondary: In-cluster local model on Spark (aim for it, don't let it block)
Related Issues
- Inference route API keys stored in plain object store #21 — Provider configuration
- Inference route API keys exposed via ListInferenceRoutes #20 — Provider secrets
- feat(router): add inference API translation between protocols #90 — API format translation (OpenAI ↔ Anthropic)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area:inferenceInference routing and configuration workInference routing and configuration workstate:agent-readyApproved for agent implementationApproved for agent implementationstate:in-progressWork is currently in progressWork is currently in progress