Skip to content

NikolasBielski/OutpostTokenProxy

Repository files navigation

Outpost

The outpost that keeps most of your queries off the frontier.

Point your agent at one URL and cut your frontier LLM bill ~65% — a small local model answers most turns from its own RAM-resident memory and escalates only what it genuinely can't.

License: MIT Python OpenAI-compatible Docker PRs Welcome

Quickstart · How it works · Benchmarks · Docker · RILEY engine


Outpost is a local-first memory + cascade layer for AI agents, packaged as a drop-in OpenAI-compatible proxy: existing agents change one base_url and nothing else. A frontier model is exactly that — the frontier. Outpost is the settlement at its edge that handles everyday traffic locally and sends back only what truly needs the capital.

Outpost was formerly MothershipMemory. The PyPI package (mothership-memory), console scripts, and MOTHERSHIP_* env vars keep their names until v0.2, so nothing breaks.

Why Outpost

Agent workloads have a shape that frontier pricing punishes:

  • Most turns are recall, not reasoning. The agent is asking about things it has already seen — and paying frontier rates to remember them.
  • Memory SaaS moves your data off-box. Hosted memory layers ship your corpus to someone else's cloud to answer questions about your own records.
  • Cost scales with every turn, whether the turn needed a frontier model or not.

Outpost inverts the default: local first, frontier on demand.

  • A small local model (qwen3:4b, fits a 6 GB GPU) answers from sovereign, RAM-resident memory. Recall and the local answer never leave the box.
  • The local model's own abstention is the gate — it answers when the memory supports an answer and escalates when it doesn't. No second judge model on the hot path.
  • The proxy is stateless and dependency-free — the core install runs on the Python standard library alone.

How it works

Agent turn → memory recall → cascade proxy → answered locally or escalated to the frontier

Two pieces compose:

  • The cascade proxy (tool/cascade_server.py) — a stateless OpenAI-compatible /v1/chat/completions endpoint. It answers over the records already in the prompt and escalates what it can't support. This is the drop-in piece.
  • The memory (tool/mothership.py) — a Python class that ingests a corpus, encodes it with SPLADE, stores it RAM-resident through the RILEY sparse codec (via the engine daemon), and serves recall() / ask(). Use it directly as a library, or through the Hermes memory provider which wires recall + the proxy into an agent runtime.

What's proven

Measured on the LoCoMo long-term-memory benchmark (full 117 questions, neutral Gemini judge — the deployed abstain-gated default; see experiments/verify_ab.py):

  • ~65% less frontier spend at 70.9% blended accuracy94% of the all-frontier ceiling (75.2%) at roughly one-third the escalation rate.
  • Drop-in. Point any agent at the proxy — no framework change.
  • Runs local on modest hardware. The qwen3:4b reasoning answerer fits 100% on a 6 GB GPU; nothing in the retrieval path needs a GPU at all.
  • Validated end-to-end through the real Hermes agent runtime, not just an offline harness (experiments/run_hermes_turn.py).

The numbers above are LoCoMo-specific — a deliberately hard long-term-memory stress test; extractive/factual agent workloads escalate less and cut more.

Operating points

The cascade has two modes — the spend↔accuracy dial:

mode flag blended acc spend cut when
abstain-gated (default) (none) 70.9% 65% the spend thesis — 94% of all-frontier accuracy at ~⅓ the escalation
max-accuracy --verify 75.2% 22% when accuracy is the objective (≈ the all-frontier ceiling)

For a reasoning answerer like qwen3:4b the verifier barely discriminates, so --verify mostly just escalates more — the default (trust the model's own abstention) is the intended point for a spend product.

Head-to-head: OpenViking's own RAG benchmark

Outpost run on OpenViking's own datasets — the identical seed-42 questions (sampler verified bit-identical to theirs), their verbatim prompts and 0–4 accuracy rubric, scored by a neutral Gemini judge — against their published numbers. Best config (k = 10; the k-sweep shows accuracy genuinely peaks there, not a cherry-pick):

Dataset accuracy — OpenViking accuracy — Outpost recall — OpenViking recall — Outpost
Qasper 52.9% 63.3% 61.4% 73.6%
SyllabusQA 63.6% 55.6% 67.5% 69.1%
FinanceBench 62.5% 62.5% 69.4% 75.0%
mean 59.7% 60.5% 66.1% 72.6%

Parity-or-better on their own benchmark — and the cost asymmetry is the story:

OpenViking Outpost
indexing LLM tokens 8.67M (L0/L1/L2 summarization on write) 0 (SPLADE encode, no LLM on write)
frontier prompt tokens / QA 3,060 2,097 (−31%)
generator doubao (frontier-class) local 7B-Q3 on a 6 GB GPU

Full methodology, fairness controls, k-sweep, and per-dataset diagnosis: ablation.md.

Spend routing: the cascade in numbers

On a constructed 118-question private corpus (substring grading — see ablation.md for caveats), with escalation going to a real gemini-2.5-pro:

arm accuracy frontier calls answered locally
all-frontier RAG ceiling (measured, single-shot retrieval) 79.7% 118
local 7B alone, no memory 23.7%
the cascade (decomposition + verifier gate) 99.2% 30 88

74.6% fewer frontier calls, with 88 of 118 questions answered correctly without ever leaving the box. The win is retrieval discipline + routing, not the 7B out-reasoning the frontier: the cascade's grounded decomposition recovers the multi-hop records single-shot retrieval misses, and escalates the hard 30 to the same frontier model.

Quickstart

Prerequisites

  • Python 3.10+ with a recent pip (≥ 21.3, for PEP 660 editable installs)
  • Ollama serving the local answerer: ollama pull qwen3:4b
  • (Optional, for escalation) a frontier API key — GEMINI_API_KEY in the environment or a local .env. Without one, the proxy runs fully offline and escalations return a flag instead of a frontier answer.
  • (For memory recall) the RILEY engine daemon from the riley-c repo — build mothership_serve there and either set RILEY_DAEMON=/path/to/mothership_serve or keep a riley-c checkout beside this one (../riley-c/build/mothership_serve). The proxy alone does not need it; the memory library does.

Install

pip install -e .            # the drop-in proxy — no ML deps, runs on the standard library
pip install -e .[memory]    # + the memory library (SPLADE encode / recall): numpy, torch, transformers

The proxy needs nothing heavy; the [memory] extra pulls the ML stack (~2–3 GB) only when you want to build and query a corpus locally. Multilingual corpora: pip install -e ".[memory,multilingual]". Either install exposes two console scripts — mothership-server (the proxy) and mothership (the CLI over the memory library).

Run the proxy

mothership-server --port 8000
# or: python tool/cascade_server.py --port 8000

Point your agent at it — anywhere you'd set an OpenAI base URL:

base_url = http://localhost:8000/v1
model    = mothership-cascade

Endpoints: POST /v1/chat/completions, GET /health, GET /stats (live escalation-rate / spend-reduction metrics), GET /v1/models.

Run with Docker

The proxy image is small (stdlib-only — no ML stack). Ollama stays on the host:

docker build -t outpost .
docker run --rm -p 8000:8000 \
  --add-host=host.docker.internal:host-gateway \
  -e GEMINI_API_KEY=your-key \
  outpost

host.docker.internal reaches the host's Ollama; the --add-host flag is needed on Linux (Docker Desktop on macOS/Windows provides it automatically). Point the container elsewhere with -e OLLAMA_HOST=http://some-host:11434.

Configuration

Env var Purpose
OLLAMA_HOST base URL of the Ollama serving the local answerer (default http://localhost:11434)
RILEY_DAEMON path to the built mothership_serve engine binary (else a sibling riley-c/build checkout is auto-detected)
GEMINI_API_KEY / GOOGLE_API_KEY frontier key for escalation (default provider: Gemini)
MOTHERSHIP_MODEL local answerer (default qwen3:4b)
MOTHERSHIP_ENCODER splade (default, English) or bge-m3 (multilingual)
MOTHERSHIP_DEVICE auto | cuda | cpu for SPLADE encoding
MOTHERSHIP_CORPUS env-lobe corpus JSON for the library / Hermes provider
MOTHERSHIP_HOT_PATH opt-in on-disk persistence for the hot tier (else RAM-only)

Key proxy flags: --model, --verify, --accept-conf (higher = more escalation = more accuracy, more spend), --frontier-provider {gemini,openai,openrouter}, --num-ctx, --num-predict.

The RILEY engine

RILEY

The memory layer stores each passage as a compact learned-sparse vector through RILEY — a lossless, parameter-free, CPU-native codec for sparse vectors — kept RAM-resident and served by the mothership_serve daemon. The engine, its benchmarks, and the technical paper live in the separate riley-c repository. This repo depends on the daemon it produces; nothing in the retrieval path needs a GPU.


Repository structure

tool/                 the product
  mothership.py         memory: ingest + SPLADE encode + recall/ask cascade
  cascade_server.py     OpenAI-compatible cascade proxy
  frontier.py           escalation provider (Gemini / OpenAI / OpenRouter)
  encoders.py           pluggable sparse encoder (SPLADE | BGE-M3)
  chunkers.py           pluggable chunker
  evaluate.py           A/B/C evaluation harness
  *_adapter.py          LoCoMo / LongMemEval / RAG dataset adapters
integration/hermes/   Hermes memory-provider plugin (recall + proxy wiring)
experiments/          measurement scripts + frozen design/result docs
Dockerfile            the proxy, containerized
pyproject.toml        pip-installable; mothership + mothership-server scripts

The evaluation harness and dataset adapters that produce the numbers above live in the separate memory-bench repository (adapter-pluggable; Outpost is one registered system under test).

Contributing

Issues and PRs are welcome — especially benchmark reproductions, new frontier providers, and agent-runtime integrations. If the spend-cut thesis is useful to you, a star helps others find the project. ⭐

Authors

  • Nikolas Bielski — Diffraction Logic Labs
  • Rose Wu — University of British Columbia
  • Matthew Ireland — Independent

License

MIT — see LICENSE.

About

Point your agent at one URL and cut your frontier LLM bill ~65% — a small local model answers most turns from its own RAM-resident memory and escalates only what it genuinely can't.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors