Providers

Aar is provider-agnostic — swap between Anthropic, OpenAI, Ollama, Gemini, or any OpenAI-compatible endpoint by changing one config field. No agent code changes required.

Runtime provider switching

You can switch between providers mid-session without losing conversation history.

Configuration

Define named providers in config.json:

{
  "provider": "claude",
  "providers": {
    "claude": {
      "name": "anthropic", "model": "claude-sonnet-4-6",
      "context_window": 1000000, "token_budget": 500000, "cost_limit": 5.0
    },
    "gpt4": {
      "name": "openai", "model": "gpt-4o",
      "context_window": 200000, "token_budget": 500000, "cost_limit": 5.0
    },
    "local": {
      "name": "ollama", "model": "llama3", "base_url": "http://localhost:11434",
      "context_window": 32768, "token_budget": 0, "cost_limit": 0.0
    }
  }
}

Each provider profile can override context_window, token_budget, and cost_limit. These are model-coupled settings — context windows differ across models, and local models have no API cost. When a provider does not set these fields, the global values from AgentConfig apply as fallback.

The provider field can be a string key referencing providers, or an inline object (backward compatible).

Slash command

All interactive transports (CLI, TUI, TUI Fixed) support the /model command:

Command	Effect
`/model`	Show active provider and list available keys
`/model gpt4`	Switch to a named provider key
`/model openai/gpt-4o`	Ad-hoc switch by provider/model

Switching is instant — the next turn uses the new provider. Conversation history is preserved because the internal event model is provider-agnostic.

ACP

ACP stdio already supports set_session_model — it now also resolves named provider keys from the config. ACP HTTP accepts provider in POST /runs to select a named key.

Web API

Pass "provider": "gpt4" in the request body of POST /chat or POST /chat/stream to use a named provider for that request.

Programmatic

from agent.core.agent import Agent
from agent.core.config import AgentConfig, ProviderConfig

config = AgentConfig(
    provider="claude",
    providers={
        "claude": ProviderConfig(name="anthropic", model="claude-sonnet-4-6"),
        "gpt4": ProviderConfig(name="openai", model="gpt-4o"),
    },
)
agent = Agent(config=config)
session = await agent.run("Hello from Claude", session=None)

# Switch mid-session
agent.switch_provider("gpt4")
session = await agent.run("Now using GPT-4o", session=session)

# Ad-hoc switch (no registry key needed)
agent.switch_provider("ollama/llama3")

Anthropic

from agent import AgentConfig, ProviderConfig

config = AgentConfig(provider=ProviderConfig(
    name="anthropic",
    model="claude-sonnet-4-20250514",
    api_key="sk-ant-...",         # or ANTHROPIC_API_KEY env var
))

Supports: tools, streaming, extended thinking (reasoning blocks), prompt caching.

Prompt caching

Anthropic’s prompt caching avoids re-processing the static prefix (system prompt + tool definitions) on every API call. After the first turn the cached prefix is served at 10× lower cost, which is significant because the prefix is re-sent with every step.

Enable — add "prompt_caching": true to the provider’s extra block:

{
  "providers": {
    "claude": {
      "name": "anthropic",
      "model": "claude-sonnet-4-6",
      "extra": {
        "prompt_caching": true
      }
    }
  }
}

How it works — when enabled, Aar adds cache_control: {"type": "ephemeral"} breakpoints to the last system-prompt content block and the last tool definition. Anthropic caches everything from the start of the request up to these breakpoints. On turn 2+ the API returns cache_read_input_tokens instead of re-processing the prefix.

Cost implications:

Turn	Without caching	With caching
Turn 1 (cold)	2,400 tok at full price	2,400 tok at 1.25× (cache write premium)
Turns 2–N (warm)	2,400 tok at full price each	2,400 tok at 0.1× each (cache read)
6-turn session	14,400 full-price tokens	3,000 + 12,000 × 0.1 = 4,200 tokens effective

For a typical 6-step task, prompt caching reduces the overhead from the static prefix by roughly 70–80%.

Metrics — when caching is active, ProviderMeta.usage includes two extra keys:

Key	Meaning
`cache_read_tokens`	Tokens served from cache (cheap)
`cache_write_tokens`	Tokens written to cache on the first call

These are already captured by Aar and used in cost estimation (see Tokens §6). The /inspect command and session JSONL files include them when present.

Requirements:

Anthropic API (direct or via a proxy that preserves cache_control fields)
anthropic Python SDK ≥ 0.40
The cached prefix must be ≥ 1,024 tokens (Anthropic minimum); a typical Aar system prompt + 7 built-in tools comfortably exceeds this

When to leave it off:

Corporate API proxies that strip unknown fields from the request body
Single-turn aar run invocations (no second turn to benefit from the cache)
Providers other than Anthropic (the flag is ignored for OpenAI, Ollama, etc.)

OpenAI

config = AgentConfig(provider=ProviderConfig(
    name="openai",
    model="gpt-4o",
    api_key="sk-...",             # or OPENAI_API_KEY env var
))

Compatible with any OpenAI-compatible API (Azure, Together, etc.) via base_url.

Ollama

config = AgentConfig(provider=ProviderConfig(
    name="ollama",
    model="llama3.2",
    base_url="http://localhost:11434",   # default
    extra={"keep_alive": "10m"},
))

Enable reasoning extraction for models like deepseek-r1:

ProviderConfig(name="ollama", model="deepseek-r1", extra={"supports_reasoning": True})

Enable vision for models with a vision encoder (see Multimodal input):

ProviderConfig(name="ollama", model="qwen2.5vl:7b", extra={"supports_vision": True})

Audio note: Gemma 4 supports audio at the model level, but Ollama's API does not yet expose audio input (as of v0.20). Audio blocks attached via @file will be dropped with a warning. The framework types are ready for when Ollama adds support.

Gemini

See docs/providers_gemini.md for the full Gemini setup guide — SDK mode, HTTP mode, thinking/reasoning, and all extra keys.

Quick start:

from agent import AgentConfig, ProviderConfig

config = AgentConfig(provider=ProviderConfig(
    name="gemini",
    model="gemini-2.5-flash",   # or "gemini-2.5-pro"
    api_key="...",               # or GEMINI_API_KEY env var
))

Install: pip install aar-agent[gemini] (pulls google-genai; HTTP mode only needs httpx, already included).

Supports: tools, streaming, thinking/reasoning (Flash optional, Pro default), vision.

Generic (OpenAI-compatible)

Any OpenAI-compatible HTTP endpoint, using a custom api-key header for authentication.

config = AgentConfig(provider=ProviderConfig(
    name="generic",
    model="gpt-4o-2024-08-06",
    api_key="...",           # or GENERIC_API_KEY env var
    extra={
        "endpoint": "https://api.provider.com/gpt/gpt-5.1",
        # Optional overrides:
        # "extra_headers": {"X-Trace-Id": "abc123"},
        # "timeout": 120.0,
        # "response_format": "json_object",  # "text" | "json_object" | "json_schema"
    },
))

The endpoint URL can also be set via the GENERIC_ENDPOINT environment variable. Supports: tools, streaming, structured output (json_object / json_schema).

Install: pip install aar-agent[generic] (uses httpx, already included in the base install).

Token reporting

Each provider reports token counts differently. Aar normalises them into a single usage dict {"input_tokens": N, "output_tokens": M} on the ProviderMeta event. See Tokens, costs, and budgets for how the counts flow through the system.

Provider	Non-streaming	Streaming
Anthropic	`usage` block in response body — always present	Collected from the `message_stop` SSE event; attached to the final `StreamDelta(done=True)`
OpenAI	`usage` in response body — always present	Requested via `stream_options: {include_usage: true}`; trailing usage chunk attached to final done-delta
Ollama	`prompt_eval_count` / `eval_count` in response body	Same fields on the final `done: true` NDJSON chunk; attached to final done-delta
Gemini	`usageMetadata` (HTTP) / `usage_metadata` (SDK) — always present	Same field on the final SSE chunk; attached to final done-delta. Thought tokens billed separately but not currently surfaced in `ProviderMeta`.
Generic	`usage` in SSE chunks if the upstream emits it	Same — presence depends on the upstream endpoint

Ollama token availability

Ollama includes prompt_eval_count and eval_count in its final streaming chunk for most models and versions. However, if a prompt hits the KV cache entirely, or if the model runtime omits these fields, the usage dict may arrive empty ({}). When the dict is empty:

The tui --fixed header still shows 0in / 0out (the counter starts at zero and simply doesn't increment).
The tui body token line prints 0in / 0out (if token_usage.visible is true).
The chat transport suppresses the line entirely (it only prints when usage is non-empty).

No error is raised; cost is recorded as $0.00 for that step.

Streaming is required for real-time counts

Token counts are only available once the provider's final chunk arrives. With streaming: false (the default), the complete response is returned in one shot and the count is available immediately after. With streaming: true the header in tui --fixed shows streaming… in the state field while the model generates, then snaps to the actual counts when the final chunk arrives. Enable streaming in your config:

{
  "streaming": true
}

Writing a new provider

Subclass Provider in agent/providers/base.py and implement complete(). stream() has a default fallback so adapters without native streaming still work — it calls complete() and replays the response as a short sequence of deltas: one per text chunk, one per reasoning block, one per tool call, then a terminal StreamDelta(done=True, meta=response.meta).

The fallback is faithful: text, tool calls, reasoning, and ProviderMeta all reach the stream consumer. Providers that implement stream() natively should preserve the same invariants — exactly one done=True delta at the end, and meta attached to that final delta so the loop can record usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Providers

Runtime provider switching

Configuration

Slash command

ACP

Web API

Programmatic

Anthropic

Prompt caching

OpenAI

Ollama

Gemini

Generic (OpenAI-compatible)

Token reporting

Ollama token availability

Streaming is required for real-time counts

Writing a new provider

FilesExpand file tree

providers.md

Latest commit

History

providers.md

File metadata and controls

Providers

Runtime provider switching

Configuration

Slash command

ACP

Web API

Programmatic

Anthropic

Prompt caching

OpenAI

Ollama

Gemini

Generic (OpenAI-compatible)

Token reporting

Ollama token availability

Streaming is required for real-time counts

Writing a new provider