FerroCache

A Distributed Semantic Cache Service for LLM Applications

FerroCache is a standalone service that sits in front of your LLM calls and returns cached responses for semantically similar queries. Because it's a compiled Rust binary with an HTTP API, any language can use it — Python, Go, Node.js, Java, Ruby, anything that can make an HTTP request. LLM API calls are expensive; semantically similar queries should reuse cached answers instead of paying for a new completion. Unlike GPTCache, FerroCache is a service, not an in-process library — deploy it once, share the cache across your entire fleet, and the cache survives application restarts.

Documentation · PyPI · Docker · Changelog · Contributing

Features

Cache core

Semantic similarity search via HNSW (approximate nearest neighbor)
Exact-match pre-filter — verbatim queries return in <0.4ms
Configurable cosine similarity threshold (default: 0.92)
Embedding-model agnostic — bring your own vectors
Per-entry TTL with background expiry reaper
LRU eviction with configurable max entries per namespace
DELETE /entry/:uuid — targeted cache invalidation
POST /admin/invalidate — semantic radius invalidation

Namespace isolation

Model namespace partitioning — vectors from different models never compare
Tenant isolation via cache_scope — one cache, many tenants
Conversation scoping with two-level fallback (conversation → global)
Auto-TTL on conversation namespaces

Durability & operations

Write-ahead log (WAL) with fsync — survives process crashes
Atomic snapshots with WAL compaction
Group-commit WAL batching — 2,600+ inserts/sec at concurrency 50
Prometheus /metrics endpoint
Grafana dashboard (docker-compose overlay)
/admin/entry-stats — per-namespace access analytics

Distribution

Multi-node cluster via consistent hashing + chitchat gossip
Synchronous write replication (configurable replication factor)
Phi accrual failure detection (Cassandra-style)
Automatic ring reassignment on node failure (zero data movement)
Read repair — stale nodes heal through traffic

Security

Bearer token auth on HTTP API — opt-in via FERROCACHE_AUTH_TOKEN
Mutual TLS between cluster nodes — opt-in via cluster.tls.enabled
Constant-time token comparison (timing-attack safe)

Integrations

Python client (zero dependencies, stdlib only)
OpenAI SDK drop-in wrapper (wrap_openai)
Anthropic SDK drop-in wrapper (wrap_anthropic)
LangChain cache backend (FerrocacheCache)
LlamaIndex LLM wrapper (FerrocacheLLM)
MCP server for Claude Desktop / Claude Code
Any language via HTTP — Go, Node.js, Java, Ruby, etc.

Quick Install

# Docker (recommended)
docker run -p 3000:3000 ghcr.io/nickleodoen/ferrocache:latest

# Python client
pip install ferrocache
pip install ferrocache[openai]    # + OpenAI middleware
pip install ferrocache[all]       # everything

# Build from source (Rust required)
git clone https://github.com/nickleodoen/ferrocache
cd ferrocache && cargo build --release
./target/release/ferrocache

▶ Example Usage (click to expand)

Example 1 — Python (universal pattern, no framework):

from ferrocache import FerrocacheClient
import openai

client = FerrocacheClient("http://localhost:3000")
your_openai = openai.OpenAI()

def ask(question: str, embedding: list[float]) -> str:
    # Check cache first
    hit = client.query(embedding=embedding, threshold=0.92, model_id="gpt-4o-mini::1536")
    if hit["hit"]:
        return hit["response"]  # no LLM call needed

    # Cache miss — call the LLM
    answer = your_openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
    ).choices[0].message.content

    client.insert(
        embedding=embedding,
        response=answer,
        query_text=question,
        model_id="gpt-4o-mini::1536",
    )
    return answer

Example 2 — Drop-in OpenAI wrapper (one line change):

from openai import OpenAI
from ferrocache.middleware import wrap_openai

client = wrap_openai(OpenAI())  # that's it

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is the refund policy?"}],
)
print(response._ferrocache_hit)  # True on cache hit

Example 3 — Tenant isolation (multi-tenant SaaS):

# Different tenants never share cache entries
client.insert(
    embedding=emb,
    response=answer,
    query_text="...",
    model_id="gpt-4o-mini::1536",
    cache_scope="tenant_abc",
)
result = client.query(embedding=emb, threshold=0.92, model_id="gpt-4o-mini::1536", cache_scope="tenant_abc")  # hits
result = client.query(embedding=emb, threshold=0.92, model_id="gpt-4o-mini::1536", cache_scope="tenant_xyz")  # miss

→ Full documentation with examples for all integrations: nickleodoen.github.io/ferrocache

Architecture

Your app hits any node. Queries route to the correct shard via consistent hashing on the embedding vector. Writes replicate synchronously to N nodes.
Nodes discover each other via gossip (chitchat). Ring membership updates propagate in ~2 seconds. No Zookeeper, no etcd, no coordinator.
Node failures are detected by phi accrual (Cassandra-style). Failed nodes' ring arcs fold to their replica neighbor automatically.

Benchmarks

The right comparison for FerroCache is Redis, not GPTCache.

GPTCache is a Python library — it runs inside your process and its "latency" is a function call, not a network call. FerroCache is a service — like Redis, it has a network boundary by design, which is what lets it be shared across your entire fleet.

FerroCache performance (Apple M4 Pro, release build):

Operation	p50	p95	p99
Query hit (HTTP round-trip)	0.44ms	0.51ms	0.54ms
Query miss (HTTP round-trip)	0.42ms	0.48ms	0.50ms
Insert (includes WAL fsync)	7.95ms	8.36ms	8.71ms
Exact-match pre-filter	0.38ms	—	—

Insert throughput: 2,600+ ops/sec at concurrency 50 (group-commit WAL).

Feature comparison vs GPTCache:

	FerroCache	GPTCache
Architecture	Service (HTTP)	Library (in-process)
Multi-node cluster	✅	❌
Shared across fleet	✅	❌ (per-process)
WAL durability	✅ (fsync)	❌ (in-memory)
Survives app restart	✅	❌
Tenant isolation	✅ `cache_scope`	❌
Conversation scoping	✅	❌
Exact-match pre-filter	✅	❌
TTL per entry	✅	⚠️ partial
LRU eviction	✅	✅
Any language client	✅	❌ (Python only)
Prometheus metrics	✅	❌
Memory (data path)	~7 MB / 50 entries	~7 MB / 50 entries

GPTCache query p50 is 0.082ms because it's an in-process function call. FerroCache query p50 is 0.44ms because it's an HTTP request — the same reason Redis is "slower" than a Python dict.

Configuration

All keys default to single-node mode. Override via ferrocache.toml in the working directory or FERROCACHE_* env vars (env wins). Nested keys use __ as a section separator; lists are comma-separated.

Core

Key	Type	Default	Env var
`port`	u16	`3000`	`FERROCACHE_PORT`
`node_id`	string?	random UUID	`FERROCACHE_NODE_ID`
`wal_path`	string	`./ferrocache.wal`	`FERROCACHE_WAL_PATH`

HNSW

Key	Type	Default	Env var
`hnsw.max_nb_connection`	usize	`16`	`FERROCACHE_HNSW__MAX_NB_CONNECTION`
`hnsw.ef_construction`	usize	`200`	`FERROCACHE_HNSW__EF_CONSTRUCTION`
`hnsw.ef_search`	usize	`32`	`FERROCACHE_HNSW__EF_SEARCH`
`hnsw.default_threshold`	f32	`0.92`	`FERROCACHE_HNSW__DEFAULT_THRESHOLD`
`hnsw.max_entries_per_namespace`	usize?	`None` (unlimited)	`FERROCACHE_HNSW__MAX_ENTRIES_PER_NAMESPACE`

Eviction & TTL

Key	Type	Default	Env var
`expire_scan_interval_secs`	u64	`60`	`FERROCACHE_EXPIRE_SCAN_INTERVAL_SECS`
`conversation_ttl_seconds`	u64?	`None`	`FERROCACHE_CONVERSATION_TTL_SECONDS`

Cluster

Key	Type	Default	Env var
`cluster.enabled`	bool	`false`	`FERROCACHE_CLUSTER__ENABLED`
`cluster.seed_nodes`	list	`[]`	`FERROCACHE_CLUSTER__SEED_NODES`
`cluster.replication_factor`	usize	`2`	`FERROCACHE_CLUSTER__REPLICATION_FACTOR`
`cluster.read_repair_enabled`	bool	`true`	`FERROCACHE_CLUSTER__READ_REPAIR_ENABLED`
`cluster.dead_node_removal_enabled`	bool	`true`	`FERROCACHE_CLUSTER__DEAD_NODE_REMOVAL_ENABLED`

Security

Key	Type	Default	Env var
`auth_token`	string?	`None` (auth off)	`FERROCACHE_AUTH_TOKEN`
`cluster.tls.enabled`	bool	`false`	`FERROCACHE_CLUSTER__TLS__ENABLED`

Performance

Key	Type	Default	Env var
`wal_batch_size`	usize	`256`	`FERROCACHE_WAL_BATCH_SIZE`
`wal_batch_timeout_ms`	u64	`1`	`FERROCACHE_WAL_BATCH_TIMEOUT_MS`

Full reference at Getting Started Docs.

Production Cluster

Run a 3-node cluster when you need the cache to survive a node failure without any application-side changes.

docker compose up -d --build
sleep 5
./tests/cluster_integration.sh   # 44 assertions over the live cluster
docker compose down -v

External ports 3001/3002/3003 map to the three nodes. An insert sent to any node is replicated to replication_factor owners along the ring; a query sent to any node is forwarded to the owning shard.

SDK Integrations

OpenAI — drop-in wrapper proxies attribute access; only the chat-completion method is intercepted.

from openai import OpenAI
from ferrocache.middleware import wrap_openai

client = wrap_openai(OpenAI(), cache_scope="tenant_abc")
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
)
print(resp.choices[0].message.content, resp._ferrocache_hit)

→ OpenAI Integration Docs

Anthropic — same pattern for the Anthropic SDK.

from anthropic import Anthropic
from ferrocache.middleware import wrap_anthropic

client = wrap_anthropic(Anthropic())
resp = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=512,
    messages=[{"role": "user", "content": "Briefly: what is HNSW?"}],
)

→ Anthropic Integration Docs

LangChain — register as the global LLM cache.

from langchain.globals import set_llm_cache
from ferrocache.langchain import FerrocacheCache

set_llm_cache(FerrocacheCache(cache_scope="tenant_abc"))

→ LangChain Integration Docs

LlamaIndex — wrap any LlamaIndex-compatible LLM.

from llama_index.llms.openai import OpenAI
from ferrocache.llamaindex import FerrocacheLLM

llm = FerrocacheLLM(inner=OpenAI(model="gpt-4o-mini"), cache_scope="tenant_abc")

→ LlamaIndex Integration Docs

MCP server — exposes semantic caching as tools for Claude Desktop / Claude Code.

pip install -r clients/python/mcp_requirements.txt
python3 -m ferrocache.mcp_server      # speaks JSON-RPC over stdio

→ MCP Server Docs

Contributing

FerroCache is actively developed and welcomes contributions. Three areas where contributions would have the most impact:

1. Embedding model integrations FerroCache is embedding-agnostic by design — the client computes the vector. But most users want a default that just works. Adding first-class support for Voyage AI, Cohere, and local Ollama models to the Python client's auto-embed path would lower the barrier to adoption significantly. Good first issue: add ferrocache[voyage] extra with a Voyage AI embed_fn.

2. Async Python client The Python client and all middleware wrappers are synchronous. Modern Python LLM applications are async-native (LangChain LCEL, the async Anthropic client, FastAPI). An AsyncFerrocacheClient built on httpx.AsyncClient would unblock this entire class of users. Good first issue: implement AsyncFerrocacheClient mirroring the sync client's API.

3. Load testing and real-world benchmarks The current benchmarks run on synthetic FAQ workloads. Real-world hit rate data on production query distributions (MS MARCO, customer support logs, coding assistant queries) would help users calibrate their threshold and make the project more credible to evaluators. Good first issue: publish a benchmark notebook using the MS MARCO dataset.

→ See CONTRIBUTING.md for setup instructions, code style, and the PR process. → Open issues are labeled good first issue.

Security

# Bearer token auth on the public HTTP API
export FERROCACHE_AUTH_TOKEN="$(openssl rand -hex 32)"

# Mutual TLS between cluster nodes
export FERROCACHE_CLUSTER__TLS__ENABLED=true

With auth on, /health and /metrics stay open; all data routes require Authorization: Bearer <token>. With mTLS on, FerroCache binds a second listener on internal_port (default port + 1000) requiring a client cert chained to the cluster CA. Public-port TLS is expected to be terminated by a reverse proxy. See docs/security.md for the full threat model.

Development

cargo test                        # unit tests (~222 pass)
cargo clippy --all-targets -- -D warnings
make cluster-test                 # docker compose + integration script (44 assertions)
make benchmark-vs-gptcache        # FerroCache vs GPTCache

CI runs check/test/clippy/fmt plus the docker-compose cluster integration on every push (.github/workflows/ci.yml).

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.claude		.claude
.github/workflows		.github/workflows
benches		benches
clients		clients
docs		docs
monitoring		monitoring
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Claude.md		Claude.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
ferrocache.toml		ferrocache.toml
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FerroCache

Features

Quick Install

Architecture

Benchmarks

Configuration

Production Cluster

SDK Integrations

Contributing

Security

Development

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FerroCache

Features

Quick Install

Architecture

Benchmarks

Configuration

Production Cluster

SDK Integrations

Contributing

Security

Development

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages