Cercano is a local-first AI development tool that runs open-source models on your own hardware — fast, private, and at zero cost. Currently powered by Ollama, with pluggable backend support planned.
Cercano works in two ways:
1. Local, in-agent Tool — Plug Cercano into cloud-based agents like Claude Code, Cursor, or Copilot via MCP. Instead of sending everything to the cloud, Cercano supercharges your frontier coding experience by providing a set of skills running locally, such as research, summarization, extraction, classification, and code explanation that can not only massively reduce your cloud context window and usage (and costs), but actually provide better context to the cloud model.
2. Standalone Agent — Use Cercano directly as your AI coding assistant. It routes tasks to local models first, falls back to cloud when needed, and runs an agentic loop that generates, validates, and self-corrects code automatically. Integrates with VS Code and other IDEs via gRPC. Please note that Cercano's stand-alone agent is still relatively primitive and is undergoing rapid development.
- Local-First Architecture — Run powerful open-source models (qwen3-coder, GLM-4.7-Flash, etc.) locally via Ollama.
- Cloud Fallback — Seamless integration with Google Gemini and Anthropic Claude for tasks that exceed local model capabilities.
- Smart Router — Embedding-based classifier routes requests to local or cloud models. Ultra-fast, no LLM call needed for routing.
- Agentic Self-Correction — Iterative loop that generates code, validates it (e.g., via compilation), and self-corrects automatically.
- Remote Inference — Point Cercano at a remote Ollama instance (e.g., a Mac Studio on your LAN) for access to larger models. Runtime-configurable with automatic fallback if the remote goes down.
- Pluggable Engine Architecture — Inference backends are abstracted behind
InferenceEngineandEmbeddingServiceinterfaces. Ollama is the built-in engine; adding new backends (ONNX, vLLM, etc.) requires only implementing the interface and registering it — no changes to the core agent, router, or MCP tools.
When used as a co-processor inside cloud agents, Cercano provides specialized tools that keep work local:
| Tool | What it does | Why local? |
|---|---|---|
cercano_summarize |
Condense files, logs, or text into concise summaries | Keep large content out of cloud context windows |
cercano_extract |
Pull specific info (errors, signatures, config) from large text | Filter noise locally, send only what matters |
cercano_classify |
Triage errors, logs, or code with category + confidence | Quick local triage without cloud round-trip |
cercano_explain |
Explain what code does, its components and data flow | Understand code locally before deciding what to send to cloud |
cercano_local |
General-purpose prompt execution against local models | Offload any simple task to local inference |
cercano_fetch |
Fetch a URL and extract readable text (HTML stripped to plain text) | Read web pages without stuffing raw HTML into cloud context |
cercano_research |
Research a question via DuckDuckGo search + local model analysis | Get distilled, sourced answers without browsing the web yourself |
cercano_document |
Generate doc comments for exported Go symbols and write them to the file | Entire read-think-write cycle stays local — host never sees file contents |
cercano_deep_research |
Multi-source research with ranked findings, reference chasing, and gap analysis | Dozens of page fetches and analyses stay local — host gets only the compiled report |
Run cercano_init once per project to make all Cercano tools project-aware. It scans the repo, feeds key files through a local model, and writes .cercano/context.md — a concise reference document that gets automatically prepended to all tool calls. The host AI can optionally provide domain knowledge it already has.
If you use a Cercano tool without initializing first, it will suggest running init.
Cercano tracks how much work stays local and how many cloud tokens you save:
cercano_stats— MCP tool that returns usage summary, token savings, and breakdowns by tool, model, and day.cercano_submit_usage— Opt-in tool for host agents to submit their cloud token usage data, enabling accurate local-vs-cloud comparison. (Usually handled automatically by the PostToolUse hook.)cercano stats— CLI command for a quick terminal summary of cumulative usage.- Cloud token capture — A PostToolUse hook parses Claude Code's transcript to automatically record cloud token usage alongside local metrics. Run
cercano setupto configure the hook.
Data is stored locally in ~/.config/cercano/telemetry.db (SQLite). No prompt content, file paths, or credentials are ever recorded.
- MCP Server — Expose all tools to any MCP-compatible agent (Claude Code, Cursor, Copilot, etc.).
- IDE Integration — VS Code extension with gRPC-based architecture. Zed extension in progress.
- Model Discovery — Query available models on any Ollama instance via
cercano_models. - Runtime Configuration — Switch models, Ollama endpoints, and cloud providers on the fly via
cercano_config.
Cercano can run as a standalone gRPC server (for IDE clients) or embedded inside an MCP host (for cloud agents like Claude Code). Both modes share the same core engine.
Standalone Mode Co-Processor Mode
(IDE clients) (Cloud agents)
┌───────────┐ ┌──────────────┐
│ VS Code │ │ Claude Code │
│ Zed, etc │ │ Cursor, etc │
└─────┬─────┘ └──────┬───────┘
│ gRPC │ MCP (stdio)
│ │
┌───────┴───────────────────┐ ┌────────────┴────────────────┐
│ CERCANO SERVER │ │ CERCANO (embedded) │
│ │ │ │
│ ┌──────────────────────┐ │ │ ┌───────────────────────┐ │
│ │ Agent │ │ │ │ MCP Tool Handlers │ │
│ │ ┌───────┐ ┌───────┐ │ │ │ │ summarize, extract, │ │
│ │ │Router │ │ Loop │ │ │ │ │ classify, explain │ │
│ │ └───────┘ └───────┘ │ │ │ └───────────┬───────────┘ │
│ └──────────┬───────────┘ │ │ │ │
│ │ │ │ ┌─────┴──────┐ │
│ │ │ │ │ Agent │ │
└─────────────┼─────────────┘ │ └─────┬──────┘ │
│ └──────────────┼──────────────┘
│ │
┌────────┴────────┐ ┌────────┴────────┐
│ Engine Layer │ │ Engine Layer │
│ (Ollama, etc.) │ │ (Ollama, etc.) │
└─────────────────┘ └─────────────────┘
- Core Agent (Go) — Handles model routing, agentic loops, conversation history, and provides a gRPC interface.
- Smart Router — Uses semantic classification (via embeddings) to route requests. Ultra-fast, no LLM call needed.
- Coordinator (LoopAgent) — Google ADK-backed iterative loop that generates code, validates it, and self-corrects with cloud escalation.
- Engine Layer — Pluggable inference backends behind
InferenceEngineandEmbeddingServiceinterfaces. Ollama is the default; new engines register viaEngineRegistry. - MCP Tool Handlers — Specialized prompt templates for summarize, extract, classify, explain, fetch, and research. Each tool wraps the core agent with task-specific prompting.
- Conversation Store — Server-side multi-turn history so the LLM can resolve references across requests.
source/server/: The core Go-based AI agent and gRPC server.source/clients/: IDE-specific extensions.vscode/: VS Code extension (TypeScript).zed/: Zed extension (Rust).
source/proto/: Protocol Buffer definitions for gRPC.test/: Integration and sandbox tests.conductor/: Product definitions, tech stack, and project planning documents.
- Backend - Go (Golang)
- Local LLM Runtime - Ollama (qwen3-coder, nomic-embed-text)
- Cloud LLMs - Google Gemini, Anthropic Claude
- Communication - gRPC
- Frontend/Clients - TypeScript (VS Code), Rust (Zed)
brew tap bryancostanich/cercano
brew install cercano
cercano setup # detects/installs Ollama, pulls models, creates configRequires Go 1.21+.
git clone https://github.com/bryancostanich/Cercano.git
cd Cercano/source/server
make build
bin/cercano setup # detects/installs Ollama, pulls models, creates config
bin/cercano # starts the gRPC servercercano setup handles everything: if no AI engine backend is detected, it offers to install Ollama automatically (via Homebrew on macOS or the official installer on Linux), starts it, and pulls the required models. Use --install-engine to skip the interactive prompt for scripted/CI use.
claude mcp add --transport stdio cercano -- cercano --mcpOr add to your project's .mcp.json. In --mcp mode, Cercano starts an embedded gRPC server — no separate server needed.
- Install the VS Code extension dependencies:
cd source/clients/vscode && npm install
- Open
source/clients/vscodein VS Code and press F5 to launch. - In the Extension Development Host, open the Chat panel and type
@cercanofollowed by your question.
cd source/server
make dev # build + restart in one commandSystem config at ~/.config/cercano/config.yaml persists across restarts (Ollama URL, model, port, etc.).
Cercano is local-first — cloud providers are only used for escalation when local models can't handle a task.
- In the Chat panel, type
@cercano /configto open the configuration menu. - Set your API key (Google Gemini or Anthropic Claude).
- Select your preferred cloud provider for escalation.
The following settings are available under cercano.* in VS Code Settings:
| Setting | Default | Description |
|---|---|---|
cercano.localModel |
qwen3-coder |
Ollama model for local inference (changeable at runtime via @cercano /config) |
cercano.server.autoLaunch |
true |
Automatically start the server on activation |
cercano.server.binaryPath |
(empty) | Override path to the server binary |
cercano.server.port |
50052 |
gRPC server port |
cercano.ollama.url |
http://localhost:11434 |
Ollama server URL |
cercano.provider |
local |
Cloud provider for escalation (google or anthropic) |
cercano.model |
(empty) | Override cloud model name |
Cercano can be used as an MCP (Model Context Protocol) server, allowing cloud-based agents like Claude Code and Cursor to delegate work to local models — faster, private, and at zero cost.
-
Build Cercano:
cd source/server make build -
Add to Claude Code (choose one):
Via CLI:
claude mcp add --transport stdio cercano -- /path/to/Cercano/source/server/bin/cercano --mcp
Via
.mcp.json(project scope):{ "mcpServers": { "cercano": { "type": "stdio", "command": "/path/to/Cercano/source/server/bin/cercano", "args": ["--mcp"] } } }In
--mcpmode, Cercano starts an embedded gRPC server automatically — no separate server process needed.
See the tool table in Key Features above for the full list. Additional utility tools:
| Tool | Description |
|---|---|
cercano_models |
List models available on the active Ollama instance. Useful for discovering models on a remote machine. |
cercano_config |
Switch models, Ollama endpoints, or cloud providers at runtime without restarting. |
Once the MCP server is connected, your agent can call Cercano tools directly:
Chat query (offload to local model):
cercano_local(prompt: "What is a goroutine in Go? Answer in one sentence.")
→ "A goroutine is a lightweight thread of execution managed by the Go runtime."
[Model: qwen3-coder, Confidence: 1.00, Escalated: false]
Switch local model at runtime:
cercano_config(action: "set", local_model: "GLM-4.7-Flash")
→ Configuration update success: updated: [local_model=GLM-4.7-Flash]
Agentic code generation (with validation loop):
cercano_local(
prompt: "Add a health check endpoint that returns JSON",
file_path: "internal/server/health.go",
work_dir: "/path/to/project/source/server"
)
→ Generated code with automatic build validation and self-correction.
Point at a remote Ollama instance:
cercano_config(action: "set", ollama_url: "http://mac-studio.local:11434")
→ Configuration update success: updated: [ollama_url=http://mac-studio.local:11434]
Discover available models:
cercano_models()
→ Available models (2):
- qwen3-coder:latest (4.7 GB)
- llama3:70b (39.1 GB)
Summarize a file locally (keep large content out of cloud context):
cercano_summarize(file_path: "internal/agent/router.go", max_length: "brief")
→ "This Go package implements a smart routing system that selects between local
and cloud AI models based on semantic similarity of user requests."
Extract specific info from large text:
cercano_extract(text: "<500 lines of logs>", query: "error and warning messages")
→ WARN Remote endpoint health check failed (attempt 1/3)
ERROR Remote endpoint unreachable after 3 attempts, falling back to local
Classify/triage an error locally:
cercano_classify(
text: "panic: runtime error: invalid memory address or nil pointer dereference",
categories: "bug, config issue, infra problem"
)
→ Category: bug
Confidence: high
Reasoning: Nil pointer dereference is a programming bug in the code logic.
Explain unfamiliar code before deciding what to send to cloud:
cercano_explain(file_path: "internal/agent/router.go")
→ This code implements a smart routing system for an AI agent that selects
between local and cloud models based on semantic similarity...
Research a question (search + fetch + local model analysis):
cercano_research(query: "How does the Ollama REST API list models?")
→ Ollama lists models via GET /api/tags, which returns a JSON array of
installed models with name, size, and modification date...
Sources:
- https://docs.ollama.com/api/introduction
- https://github.com/ollama/ollama/blob/main/docs/api.md
Multi-turn conversation:
cercano_local(prompt: "Explain the SmartRouter", conversation_id: "abc123")
cercano_local(prompt: "How does it handle escalation?", conversation_id: "abc123")
→ Second call has full context from the first.
| Agent | Status |
|---|---|
| Claude Code | Verified — tool discovery, chat queries, config updates, model switching |
| Cursor | Not yet tested |
| Flag | Default | Description |
|---|---|---|
--grpc-addr |
localhost:50052 |
Address of the Cercano gRPC server |
Cercano publishes its tools as Agent Skills — an open standard for packaging AI capabilities so they're discoverable by any compatible agent. Over 30 agents support this standard, including Claude Code, Cursor, Copilot, Gemini CLI, Codex, and more.
| Skill | Description |
|---|---|
cercano-local |
General-purpose local inference — chat queries and agentic code generation |
cercano-summarize |
Summarize text or files locally (brief, medium, or detailed) |
cercano-extract |
Pull specific information from text (errors, signatures, config values) |
cercano-classify |
Categorize/triage text with confidence scores and reasoning |
cercano-explain |
Explain code — what it does, key interfaces, and data flow |
cercano-fetch |
Fetch a URL and extract readable text (HTML stripped to plain text) |
cercano-research |
Research a question via DuckDuckGo search + local model analysis |
cercano-config |
View/change Cercano's runtime configuration |
cercano-models |
List available models on the connected Ollama instance |
cercano-init |
Initialize project context for project-aware responses |
cercano-document |
Generate doc comments for exported Go symbols and write directly to the file |
cercano-deep-research |
Multi-source research with ranked findings, reference chasing, and gap analysis |
cercano-stats |
View usage statistics and cloud token savings |
Each skill is a SKILL.md file that tells the agent what the tool does, its parameters, and how to invoke it via MCP.
Agents scan well-known directories for SKILL.md files at startup:
| Directory | Discovered by |
|---|---|
.agents/skills/<skill-name>/SKILL.md |
Any Agent Skills-compatible agent |
.claude/skills/<skill-name>/SKILL.md |
Claude Code (also appears as slash commands) |
Cercano ships its skill definitions in both locations.
To make Cercano's skills available to your agent, copy the skill files into your project:
# For any Agent Skills-compatible agent
cp -r /path/to/Cercano/.agents/skills/* .agents/skills/
# For Claude Code specifically (enables /cercano-* slash commands)
cp -r /path/to/Cercano/.claude/skills/* .claude/skills/The cercano_skills MCP tool also provides programmatic access to skill definitions:
cercano_skills(action: "list") → catalog of all skills
cercano_skills(action: "get", name: "cercano-local") → full SKILL.md content
For a detailed guide on writing custom SKILL.md files, see docs/agent-skills-guide.md.
Cercano can delegate inference to a remote Ollama instance — for example, another machine on your LAN with more GPU memory and larger models. The remote endpoint is runtime-configurable with automatic fallback to local Ollama if the remote goes down.
-
Ensure Ollama is running on the remote machine and accessible over the network:
# On the remote machine (e.g., mac-studio.local) OLLAMA_HOST=0.0.0.0 ollama serve -
Point Cercano at the remote instance:
Via environment variable (at startup):
OLLAMA_URL=http://mac-studio.local:11434 bin/agent
Via MCP at runtime (no restart needed):
cercano_config(action: "set", ollama_url: "http://mac-studio.local:11434") -
Discover available models on the remote machine:
cercano_models() → Available models (3): - qwen3-coder:latest (4.7 GB, modified: 2026-03-15T10:30:00Z) - llama3:70b (39.1 GB, modified: 2026-03-14T09:00:00Z) - deepseek-coder-v2:latest (8.9 GB, modified: 2026-03-13T14:00:00Z) -
Switch to a model that's only available on the remote:
cercano_config(action: "set", local_model: "llama3:70b")
When a remote endpoint is configured, Cercano monitors it with periodic health checks:
- Pings the remote every 30 seconds via
GET /api/tags - After 3 consecutive failures, automatically switches to local Ollama
- When the remote recovers, automatically switches back
- Response metadata includes
[Endpoint: url]or[Endpoint: url (fallback)]so you always know which instance served the request
No configuration is needed — fallback is automatic whenever a remote URL is set.
Cercano is in active development. For detailed information on the project's goals and technical decisions, refer to the documents in the conductor/ directory.
cd source/server
make all # Build both agent and MCP server
make test # Run all tests- Competitive Audit — Agent Features Landscape - Feature matrix across 12+ open-source and commercial agents (Codex, Aider, Continue, Cody, OpenHands, SWE-Agent, Claude Code, Cursor, Windsurf, GitHub Copilot, JetBrains AI, Amazon Q) to inform Cercano's tool design and roadmap.
- Semantic Codebase Search - Embedding-based code search by intent ("find auth-related code"), not just string matching. Requires indexing pipeline, storage, and nearest-neighbor retrieval.
- User-Friendly Distribution - Setup/launch scripts, Docker containerization, and CI/CD pipeline with GitHub Actions for automated cross-platform releases.
- AI Engine Agnosticism - Abstract the local inference layer to support pluggable backends (ONNX Runtime, Enso, etc.) beyond Ollama.
- Web Research Tool - Fetch URLs, search the web via DuckDuckGo, and use local models to analyze and distill results. Keeps raw web content out of the cloud context window.
- Stand-alone CLI - Create a stand alone Command Line Interface (CLI) for cercano that doesn't really on other CLI integrations.
- PDF Parsing - Extract text from local and remote PDFs for use with summarize, extract, explain, and research tools.
- Documentation Site Indexing - Crawl a documentation site once, index it persistently, and make it searchable across sessions (similar to Cursor's @Docs).
- Better VS Code Agent Window Integration - Make Cercano available as a model dropdown in the VS Code agent window alongside Gemini, Claude, etc.
- LLM-Based Conversation Compaction - Replace simple truncation-based compaction with LLM-powered summarization for better context retention in long conversations.
- Per-Model Configuration - Configurable per-model settings (context window, classification thresholds, history depth, compaction limits) instead of hardcoded constants.
- Simplify Provider Routing - Evaluate removing the SmartRouter's embedding-based local/cloud routing in favor of always-local with coordinator-driven cloud escalation.
- Zed Extension - Build out the Rust-based Zed extension (
source/clients/zed/) with feature parity to the VS Code extension.
This is not an officially supported Google product
Canonical Repo :: https://github.com/bryancostanich/Cercano
Google Mirror :: https://github.com/GoogleDevRelExplorations/cercano