Point your agent at one URL and cut your frontier LLM bill ~65% — a small local model answers most turns from its own RAM-resident memory and escalates only what it genuinely can't.
Quickstart · How it works · Benchmarks · Docker · RILEY engine
Outpost is a local-first memory + cascade layer for AI agents, packaged as a
drop-in OpenAI-compatible proxy: existing agents change one base_url and nothing
else. A frontier model is exactly that — the frontier. Outpost is the settlement at
its edge that handles everyday traffic locally and sends back only what truly needs
the capital.
Outpost was formerly MothershipMemory. The PyPI package (
mothership-memory), console scripts, andMOTHERSHIP_*env vars keep their names until v0.2, so nothing breaks.
Agent workloads have a shape that frontier pricing punishes:
- Most turns are recall, not reasoning. The agent is asking about things it has already seen — and paying frontier rates to remember them.
- Memory SaaS moves your data off-box. Hosted memory layers ship your corpus to someone else's cloud to answer questions about your own records.
- Cost scales with every turn, whether the turn needed a frontier model or not.
Outpost inverts the default: local first, frontier on demand.
- A small local model (
qwen3:4b, fits a 6 GB GPU) answers from sovereign, RAM-resident memory. Recall and the local answer never leave the box. - The local model's own abstention is the gate — it answers when the memory supports an answer and escalates when it doesn't. No second judge model on the hot path.
- The proxy is stateless and dependency-free — the core install runs on the Python standard library alone.
Two pieces compose:
- The cascade proxy (
tool/cascade_server.py) — a stateless OpenAI-compatible/v1/chat/completionsendpoint. It answers over the records already in the prompt and escalates what it can't support. This is the drop-in piece. - The memory (
tool/mothership.py) — a Python class that ingests a corpus, encodes it with SPLADE, stores it RAM-resident through the RILEY sparse codec (via the engine daemon), and servesrecall()/ask(). Use it directly as a library, or through the Hermes memory provider which wires recall + the proxy into an agent runtime.
Measured on the LoCoMo long-term-memory benchmark (full 117 questions, neutral
Gemini judge — the deployed abstain-gated default; see
experiments/verify_ab.py):
- ~65% less frontier spend at 70.9% blended accuracy — 94% of the all-frontier ceiling (75.2%) at roughly one-third the escalation rate.
- Drop-in. Point any agent at the proxy — no framework change.
- Runs local on modest hardware. The
qwen3:4breasoning answerer fits 100% on a 6 GB GPU; nothing in the retrieval path needs a GPU at all. - Validated end-to-end through the real Hermes agent runtime, not just an
offline harness (
experiments/run_hermes_turn.py).
The numbers above are LoCoMo-specific — a deliberately hard long-term-memory stress test; extractive/factual agent workloads escalate less and cut more.
The cascade has two modes — the spend↔accuracy dial:
| mode | flag | blended acc | spend cut | when |
|---|---|---|---|---|
| abstain-gated (default) | (none) | 70.9% | 65% | the spend thesis — 94% of all-frontier accuracy at ~⅓ the escalation |
| max-accuracy | --verify |
75.2% | 22% | when accuracy is the objective (≈ the all-frontier ceiling) |
For a reasoning answerer like qwen3:4b the verifier barely discriminates, so
--verify mostly just escalates more — the default (trust the model's own
abstention) is the intended point for a spend product.
Outpost run on OpenViking's own datasets — the identical seed-42 questions (sampler verified bit-identical to theirs), their verbatim prompts and 0–4 accuracy rubric, scored by a neutral Gemini judge — against their published numbers. Best config (k = 10; the k-sweep shows accuracy genuinely peaks there, not a cherry-pick):
| Dataset | accuracy — OpenViking | accuracy — Outpost | recall — OpenViking | recall — Outpost |
|---|---|---|---|---|
| Qasper | 52.9% | 63.3% | 61.4% | 73.6% |
| SyllabusQA | 63.6% | 55.6% | 67.5% | 69.1% |
| FinanceBench | 62.5% | 62.5% | 69.4% | 75.0% |
| mean | 59.7% | 60.5% | 66.1% | 72.6% |
Parity-or-better on their own benchmark — and the cost asymmetry is the story:
| OpenViking | Outpost | |
|---|---|---|
| indexing LLM tokens | 8.67M (L0/L1/L2 summarization on write) | 0 (SPLADE encode, no LLM on write) |
| frontier prompt tokens / QA | 3,060 | 2,097 (−31%) |
| generator | doubao (frontier-class) | local 7B-Q3 on a 6 GB GPU |
Full methodology, fairness controls, k-sweep, and per-dataset diagnosis:
ablation.md.
On a constructed 118-question private corpus (substring grading — see
ablation.md for caveats), with escalation going to a real
gemini-2.5-pro:
| arm | accuracy | frontier calls | answered locally |
|---|---|---|---|
| all-frontier RAG ceiling (measured, single-shot retrieval) | 79.7% | 118 | — |
| local 7B alone, no memory | 23.7% | — | — |
| the cascade (decomposition + verifier gate) | 99.2% | 30 | 88 |
→ 74.6% fewer frontier calls, with 88 of 118 questions answered correctly without ever leaving the box. The win is retrieval discipline + routing, not the 7B out-reasoning the frontier: the cascade's grounded decomposition recovers the multi-hop records single-shot retrieval misses, and escalates the hard 30 to the same frontier model.
Prerequisites
- Python 3.10+ with a recent
pip(≥ 21.3, for PEP 660 editable installs) - Ollama serving the local answerer:
ollama pull qwen3:4b - (Optional, for escalation) a frontier API key —
GEMINI_API_KEYin the environment or a local.env. Without one, the proxy runs fully offline and escalations return a flag instead of a frontier answer. - (For memory recall) the RILEY engine daemon from the
riley-c repo — build
mothership_servethere and either setRILEY_DAEMON=/path/to/mothership_serveor keep ariley-ccheckout beside this one (../riley-c/build/mothership_serve). The proxy alone does not need it; the memory library does.
Install
pip install -e . # the drop-in proxy — no ML deps, runs on the standard library
pip install -e .[memory] # + the memory library (SPLADE encode / recall): numpy, torch, transformersThe proxy needs nothing heavy; the [memory] extra pulls the ML stack (~2–3 GB)
only when you want to build and query a corpus locally. Multilingual corpora:
pip install -e ".[memory,multilingual]". Either install exposes two console
scripts — mothership-server (the proxy) and mothership (the CLI over the
memory library).
Run the proxy
mothership-server --port 8000
# or: python tool/cascade_server.py --port 8000Point your agent at it — anywhere you'd set an OpenAI base URL:
base_url = http://localhost:8000/v1
model = mothership-cascade
Endpoints: POST /v1/chat/completions, GET /health, GET /stats (live
escalation-rate / spend-reduction metrics), GET /v1/models.
The proxy image is small (stdlib-only — no ML stack). Ollama stays on the host:
docker build -t outpost .
docker run --rm -p 8000:8000 \
--add-host=host.docker.internal:host-gateway \
-e GEMINI_API_KEY=your-key \
outposthost.docker.internal reaches the host's Ollama; the --add-host flag is needed
on Linux (Docker Desktop on macOS/Windows provides it automatically). Point the
container elsewhere with -e OLLAMA_HOST=http://some-host:11434.
| Env var | Purpose |
|---|---|
OLLAMA_HOST |
base URL of the Ollama serving the local answerer (default http://localhost:11434) |
RILEY_DAEMON |
path to the built mothership_serve engine binary (else a sibling riley-c/build checkout is auto-detected) |
GEMINI_API_KEY / GOOGLE_API_KEY |
frontier key for escalation (default provider: Gemini) |
MOTHERSHIP_MODEL |
local answerer (default qwen3:4b) |
MOTHERSHIP_ENCODER |
splade (default, English) or bge-m3 (multilingual) |
MOTHERSHIP_DEVICE |
auto | cuda | cpu for SPLADE encoding |
MOTHERSHIP_CORPUS |
env-lobe corpus JSON for the library / Hermes provider |
MOTHERSHIP_HOT_PATH |
opt-in on-disk persistence for the hot tier (else RAM-only) |
Key proxy flags: --model, --verify, --accept-conf (higher = more escalation =
more accuracy, more spend), --frontier-provider {gemini,openai,openrouter},
--num-ctx, --num-predict.
The memory layer stores each passage as a compact learned-sparse vector through
RILEY — a lossless, parameter-free, CPU-native codec for sparse vectors — kept
RAM-resident and served by the mothership_serve daemon. The engine, its
benchmarks, and the technical paper live in the separate
riley-c repository. This repo
depends on the daemon it produces; nothing in the retrieval path needs a GPU.
tool/ the product
mothership.py memory: ingest + SPLADE encode + recall/ask cascade
cascade_server.py OpenAI-compatible cascade proxy
frontier.py escalation provider (Gemini / OpenAI / OpenRouter)
encoders.py pluggable sparse encoder (SPLADE | BGE-M3)
chunkers.py pluggable chunker
evaluate.py A/B/C evaluation harness
*_adapter.py LoCoMo / LongMemEval / RAG dataset adapters
integration/hermes/ Hermes memory-provider plugin (recall + proxy wiring)
experiments/ measurement scripts + frozen design/result docs
Dockerfile the proxy, containerized
pyproject.toml pip-installable; mothership + mothership-server scripts
The evaluation harness and dataset adapters that produce the numbers above live in the separate memory-bench repository (adapter-pluggable; Outpost is one registered system under test).
Issues and PRs are welcome — especially benchmark reproductions, new frontier providers, and agent-runtime integrations. If the spend-cut thesis is useful to you, a star helps others find the project. ⭐
- Nikolas Bielski — Diffraction Logic Labs
- Rose Wu — University of British Columbia
- Matthew Ireland — Independent
MIT — see LICENSE.
