GitHub - NikolasBielski/OutpostTokenProxy: Point your agent at one URL and cut your frontier LLM bill ~65% — a small local model answers most turns from its own RAM-resident memory and escalates only what it genuinely can't.

The outpost that keeps most of your queries off the frontier.

Point your agent at one URL and cut your frontier LLM bill ~65% — a small local model answers most turns from its own RAM-resident memory and escalates only what it genuinely can't.

Quickstart · How it works · Benchmarks · Docker · RILEY engine

Outpost is a local-first memory + cascade layer for AI agents, packaged as a drop-in OpenAI-compatible proxy: existing agents change one base_url and nothing else. A frontier model is exactly that — the frontier. Outpost is the settlement at its edge that handles everyday traffic locally and sends back only what truly needs the capital.

Outpost was formerly MothershipMemory. The PyPI package (mothership-memory), console scripts, and MOTHERSHIP_* env vars keep their names until v0.2, so nothing breaks.

Why Outpost

Agent workloads have a shape that frontier pricing punishes:

Most turns are recall, not reasoning. The agent is asking about things it has already seen — and paying frontier rates to remember them.
Memory SaaS moves your data off-box. Hosted memory layers ship your corpus to someone else's cloud to answer questions about your own records.
Cost scales with every turn, whether the turn needed a frontier model or not.

Outpost inverts the default: local first, frontier on demand.

A small local model (qwen3:4b, fits a 6 GB GPU) answers from sovereign, RAM-resident memory. Recall and the local answer never leave the box.
The local model's own abstention is the gate — it answers when the memory supports an answer and escalates when it doesn't. No second judge model on the hot path.
The proxy is stateless and dependency-free — the core install runs on the Python standard library alone.

How it works

Agent turn → memory recall → cascade proxy → answered locally or escalated to the frontier

Two pieces compose:

The cascade proxy (tool/cascade_server.py) — a stateless OpenAI-compatible /v1/chat/completions endpoint. It answers over the records already in the prompt and escalates what it can't support. This is the drop-in piece.
The memory (tool/mothership.py) — a Python class that ingests a corpus, encodes it with SPLADE, stores it RAM-resident through the RILEY sparse codec (via the engine daemon), and serves recall() / ask(). Use it directly as a library, or through the Hermes memory provider which wires recall + the proxy into an agent runtime.

What's proven

Measured on the LoCoMo long-term-memory benchmark (full 117 questions, neutral Gemini judge — the deployed abstain-gated default; see experiments/verify_ab.py):

~65% less frontier spend at 70.9% blended accuracy — 94% of the all-frontier ceiling (75.2%) at roughly one-third the escalation rate.
Drop-in. Point any agent at the proxy — no framework change.
Runs local on modest hardware. The qwen3:4b reasoning answerer fits 100% on a 6 GB GPU; nothing in the retrieval path needs a GPU at all.
Validated end-to-end through the real Hermes agent runtime, not just an offline harness (experiments/run_hermes_turn.py).

The numbers above are LoCoMo-specific — a deliberately hard long-term-memory stress test; extractive/factual agent workloads escalate less and cut more.

Operating points

The cascade has two modes — the spend↔accuracy dial:

mode	flag	blended acc	spend cut	when
abstain-gated (default)	(none)	70.9%	65%	the spend thesis — 94% of all-frontier accuracy at ~⅓ the escalation
max-accuracy	`--verify`	75.2%	22%	when accuracy is the objective (≈ the all-frontier ceiling)

For a reasoning answerer like qwen3:4b the verifier barely discriminates, so --verify mostly just escalates more — the default (trust the model's own abstention) is the intended point for a spend product.

Head-to-head: OpenViking's own RAG benchmark

Outpost run on OpenViking's own datasets — the identical seed-42 questions (sampler verified bit-identical to theirs), their verbatim prompts and 0–4 accuracy rubric, scored by a neutral Gemini judge — against their published numbers. Best config (k = 10; the k-sweep shows accuracy genuinely peaks there, not a cherry-pick):

Dataset	accuracy — OpenViking	accuracy — Outpost	recall — OpenViking	recall — Outpost
Qasper	52.9%	63.3%	61.4%	73.6%
SyllabusQA	63.6%	55.6%	67.5%	69.1%
FinanceBench	62.5%	62.5%	69.4%	75.0%
mean	59.7%	60.5%	66.1%	72.6%

Parity-or-better on their own benchmark — and the cost asymmetry is the story:

	OpenViking	Outpost
indexing LLM tokens	8.67M (L0/L1/L2 summarization on write)	0 (SPLADE encode, no LLM on write)
frontier prompt tokens / QA	3,060	2,097 (−31%)
generator	doubao (frontier-class)	local 7B-Q3 on a 6 GB GPU

Full methodology, fairness controls, k-sweep, and per-dataset diagnosis: ablation.md.

Spend routing: the cascade in numbers

On a constructed 118-question private corpus (substring grading — see ablation.md for caveats), with escalation going to a real gemini-2.5-pro:

arm	accuracy	frontier calls	answered locally
all-frontier RAG ceiling (measured, single-shot retrieval)	79.7%	118	—
local 7B alone, no memory	23.7%	—	—
the cascade (decomposition + verifier gate)	99.2%	30	88

→ 74.6% fewer frontier calls, with 88 of 118 questions answered correctly without ever leaving the box. The win is retrieval discipline + routing, not the 7B out-reasoning the frontier: the cascade's grounded decomposition recovers the multi-hop records single-shot retrieval misses, and escalates the hard 30 to the same frontier model.

Quickstart

Prerequisites

Python 3.10+ with a recent pip (≥ 21.3, for PEP 660 editable installs)
Ollama serving the local answerer: ollama pull qwen3:4b
(Optional, for escalation) a frontier API key — GEMINI_API_KEY in the environment or a local .env. Without one, the proxy runs fully offline and escalations return a flag instead of a frontier answer.
(For memory recall) the RILEY engine daemon from the riley-c repo — build mothership_serve there and either set RILEY_DAEMON=/path/to/mothership_serve or keep a riley-c checkout beside this one (../riley-c/build/mothership_serve). The proxy alone does not need it; the memory library does.

Install

pip install -e .            # the drop-in proxy — no ML deps, runs on the standard library
pip install -e .[memory]    # + the memory library (SPLADE encode / recall): numpy, torch, transformers

The proxy needs nothing heavy; the [memory] extra pulls the ML stack (~2–3 GB) only when you want to build and query a corpus locally. Multilingual corpora: pip install -e ".[memory,multilingual]". Either install exposes two console scripts — mothership-server (the proxy) and mothership (the CLI over the memory library).

Run the proxy

mothership-server --port 8000
# or: python tool/cascade_server.py --port 8000

Point your agent at it — anywhere you'd set an OpenAI base URL:

base_url = http://localhost:8000/v1
model    = mothership-cascade

Endpoints: POST /v1/chat/completions, GET /health, GET /stats (live escalation-rate / spend-reduction metrics), GET /v1/models.

Run with Docker

The proxy image is small (stdlib-only — no ML stack). Ollama stays on the host:

docker build -t outpost .
docker run --rm -p 8000:8000 \
  --add-host=host.docker.internal:host-gateway \
  -e GEMINI_API_KEY=your-key \
  outpost

host.docker.internal reaches the host's Ollama; the --add-host flag is needed on Linux (Docker Desktop on macOS/Windows provides it automatically). Point the container elsewhere with -e OLLAMA_HOST=http://some-host:11434.

Configuration

Env var	Purpose
`OLLAMA_HOST`	base URL of the Ollama serving the local answerer (default `http://localhost:11434`)
`RILEY_DAEMON`	path to the built `mothership_serve` engine binary (else a sibling `riley-c/build` checkout is auto-detected)
`GEMINI_API_KEY` / `GOOGLE_API_KEY`	frontier key for escalation (default provider: Gemini)
`MOTHERSHIP_MODEL`	local answerer (default `qwen3:4b`)
`MOTHERSHIP_ENCODER`	`splade` (default, English) or `bge-m3` (multilingual)
`MOTHERSHIP_DEVICE`	`auto` \| `cuda` \| `cpu` for SPLADE encoding
`MOTHERSHIP_CORPUS`	env-lobe corpus JSON for the library / Hermes provider
`MOTHERSHIP_HOT_PATH`	opt-in on-disk persistence for the hot tier (else RAM-only)

Key proxy flags: --model, --verify, --accept-conf (higher = more escalation = more accuracy, more spend), --frontier-provider {gemini,openai,openrouter}, --num-ctx, --num-predict.

The RILEY engine

The memory layer stores each passage as a compact learned-sparse vector through RILEY — a lossless, parameter-free, CPU-native codec for sparse vectors — kept RAM-resident and served by the mothership_serve daemon. The engine, its benchmarks, and the technical paper live in the separate riley-c repository. This repo depends on the daemon it produces; nothing in the retrieval path needs a GPU.

Repository structure

tool/                 the product
  mothership.py         memory: ingest + SPLADE encode + recall/ask cascade
  cascade_server.py     OpenAI-compatible cascade proxy
  frontier.py           escalation provider (Gemini / OpenAI / OpenRouter)
  encoders.py           pluggable sparse encoder (SPLADE | BGE-M3)
  chunkers.py           pluggable chunker
  evaluate.py           A/B/C evaluation harness
  *_adapter.py          LoCoMo / LongMemEval / RAG dataset adapters
integration/hermes/   Hermes memory-provider plugin (recall + proxy wiring)
experiments/          measurement scripts + frozen design/result docs
Dockerfile            the proxy, containerized
pyproject.toml        pip-installable; mothership + mothership-server scripts

The evaluation harness and dataset adapters that produce the numbers above live in the separate memory-bench repository (adapter-pluggable; Outpost is one registered system under test).

Contributing

Issues and PRs are welcome — especially benchmark reproductions, new frontier providers, and agent-runtime integrations. If the spend-cut thesis is useful to you, a star helps others find the project. ⭐

Authors

Nikolas Bielski — Diffraction Logic Labs
Rose Wu — University of British Columbia
Matthew Ireland — Independent

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
assets		assets
experiments		experiments
integration/hermes		integration/hermes
tool		tool
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
ablation.md		ablation.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The outpost that keeps most of your queries off the frontier.

Why Outpost

How it works

What's proven

Operating points

Head-to-head: OpenViking's own RAG benchmark

Spend routing: the cascade in numbers

Quickstart

Run with Docker

Configuration

The RILEY engine

Repository structure

Contributing

Authors

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The outpost that keeps most of your queries off the frontier.

Why Outpost

How it works

What's proven

Operating points

Head-to-head: OpenViking's own RAG benchmark

Spend routing: the cascade in numbers

Quickstart

Run with Docker

Configuration

The RILEY engine

Repository structure

Contributing

Authors

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages