The Context Optimization Layer for LLM Applications
Tool outputs are 70-95% redundant boilerplate. Headroom compresses that away.
pip install "headroom-ai[all]"headroom proxy --port 8787# Claude Code — just set the base URL
ANTHROPIC_BASE_URL=http://localhost:8787 claude
# Cursor, Continue, any OpenAI-compatible tool
OPENAI_BASE_URL=http://localhost:8787/v1 cursorWorks with any language, any tool, any framework. One env var. Proxy docs
from headroom import compress
result = compress(messages, model="claude-sonnet-4-5-20250929")
response = client.messages.create(model="claude-sonnet-4-5-20250929", messages=result.messages)
print(f"Saved {result.tokens_saved} tokens ({result.compression_ratio:.0%})")Works with any Python LLM client — Anthropic, OpenAI, LiteLLM, httpx, anything.
You don't need to replace it. Drop Headroom into your existing stack:
| Your setup | Add Headroom | One-liner |
|---|---|---|
| LiteLLM | Callback | litellm.callbacks = [HeadroomCallback()] |
| Any Python proxy | ASGI Middleware | app.add_middleware(CompressionMiddleware) |
| Any Python app | compress() |
result = compress(messages, model="gpt-4o") |
| Agno agents | Wrap model | HeadroomAgnoModel(your_model) |
| LangChain | Wrap model | HeadroomChatModel(your_llm) (experimental) |
Full Integration Guide — detailed setup for LiteLLM, ASGI middleware, compress(), and every framework.
100 production log entries. One critical error buried at position 67.
| Baseline | Headroom | |
|---|---|---|
| Input tokens | 10,144 | 1,260 |
| Correct answers | 4/4 | 4/4 |
Both responses: "payment-gateway, error PG-5523, fix: Increase max_connections to 500, 1,847 transactions affected."
87.6% fewer tokens. Same answer. Run it: python examples/needle_in_haystack_test.py
What Headroom kept
From 100 log entries, SmartCrusher kept 6: first 3 (boundary), the FATAL error at position 67 (anomaly detection), and last 2 (recency). The error was automatically preserved — not by keyword matching, but by statistical analysis of field variance.
Headroom is evaluated on real OSS benchmarks — compression preserves accuracy.
Standard Benchmarks — Baseline (direct to API) vs Headroom (through proxy):
| Benchmark | Category | N | Baseline | Headroom | Delta |
|---|---|---|---|---|---|
| GSM8K | Math | 100 | 0.870 | 0.870 | 0.000 |
| TruthfulQA | Factual | 100 | 0.530 | 0.560 | +0.030 |
Compression Benchmarks — Accuracy after compression + CCR (full stack):
| Benchmark | Category | N | Accuracy | Compression | Method |
|---|---|---|---|---|---|
| SQuAD v2 | QA | 100 | 97% | 19% | Before/After |
| BFCL | Tool/Function | 100 | 97% | 32% | LLM-as-Judge |
| Tool Outputs (built-in) | Agent | 8 | 100% | 20% | Before/After |
| CCR Needle Retention | Lossless | 50 | 100% | 77% | Exact Match |
Run it yourself:
# Quick smoke test (8 cases, ~10s)
python -m headroom.evals quick -n 8 --provider openai --model gpt-4o-mini
# Full Tier 1 suite (~$3, ~15 min)
python -m headroom.evals suite --tier 1 -o eval_results/
# CI mode (exit 1 on regression)
python -m headroom.evals suite --tier 1 --ciFull methodology: Benchmarks | Evals Framework
flowchart LR
App["Your App"] --> H["Headroom"] --> LLM["LLM Provider"]
LLM --> Resp["Response"]
flowchart TB
subgraph Pipeline["Transform Pipeline"]
CA["1. CacheAligner\nStabilizes prefix for KV cache"]
CR["2. ContentRouter\nDetects content type, picks compressor"]
IC["3. IntelligentContext\nScore-based token fitting"]
QE["4. Query Echo\nRe-injects user question"]
CA --> CR --> IC --> QE
end
subgraph Compressors["ContentRouter dispatches to"]
SC["SmartCrusher\nAny JSON type"]
CC["CodeCompressor\nAST-aware code"]
LL["LLMLingua\nML-based text"]
end
subgraph CCR["CCR: Compress-Cache-Retrieve"]
Store[("Compressed\nStore")]
Tool["headroom_retrieve"]
Tool <--> Store
end
CR --> Compressors
SC -. "stores originals +\nsummary of what's omitted" .-> Store
QE --> LLM["LLM Provider"]
LLM -. "retrieves when\nit needs more" .-> Tool
Headroom never throws data away. It compresses aggressively and retrieves precisely. When it compresses 500 items to 20, it tells the LLM what was omitted ("87 passed, 2 failed, 1 error") so the LLM knows when to ask for more.
| Scenario | Before | After | Savings |
|---|---|---|---|
| Code search (100 results) | 17,765 | 1,408 | 92% |
| SRE incident debugging | 65,694 | 5,118 | 92% |
| Codebase exploration | 78,502 | 41,254 | 47% |
| GitHub issue triage | 54,174 | 14,761 | 73% |
Overhead: 15-200ms compression latency (net positive for Sonnet/Opus). Full data: Latency Benchmarks
| Integration | Status | Docs |
|---|---|---|
compress() — one function |
Stable | Integration Guide |
| LiteLLM callback | Stable | Integration Guide |
| ASGI middleware | Stable | Integration Guide |
| Proxy server | Stable | Proxy Docs |
| Agno | Stable | Agno Guide |
| MCP (Claude Code) | Stable | MCP Guide |
| Strands | Stable | Strands Guide |
| LangChain | Experimental | LangChain Guide |
| Feature | What it does |
|---|---|
| Content Router | Auto-detects content type, routes to optimal compressor |
| SmartCrusher | Universal JSON compression — arrays of dicts, strings, numbers, mixed types, nested objects |
| CodeCompressor | AST-aware compression for Python, JS, Go, Rust, Java, C++ |
| LLMLingua-2 | ML-based 20x text compression |
| CCR | Reversible compression — LLM retrieves originals when needed |
| Compression Summaries | Tells the LLM what was omitted ("3 errors, 12 failures") |
| Query Echo | Re-injects user question after compressed data for better attention |
| CacheAligner | Stabilizes prefixes for provider KV cache hits |
| IntelligentContext | Score-based context management with learned importance |
| Image Compression | 40-90% token reduction via trained ML router |
| Memory | Persistent memory across conversations |
| Compression Hooks | Customize compression with pre/post hooks |
headroom proxy --backend bedrock --region us-east-1 # AWS Bedrock
headroom proxy --backend vertex_ai --region us-central1 # Google Vertex
headroom proxy --backend azure # Azure OpenAI
headroom proxy --backend openrouter # OpenRouter (400+ models)pip install headroom-ai # Core library
pip install "headroom-ai[all]" # Everything including evals (recommended)
pip install "headroom-ai[proxy]" # Proxy server
pip install "headroom-ai[mcp]" # MCP for Claude Code
pip install "headroom-ai[agno]" # Agno integration
pip install "headroom-ai[langchain]" # LangChain (experimental)
pip install "headroom-ai[evals]" # Evaluation framework onlyPython 3.10+
| Integration Guide | LiteLLM, ASGI, compress(), proxy |
| Proxy Docs | Proxy server configuration |
| Architecture | How the pipeline works |
| CCR Guide | Reversible compression |
| Benchmarks | Accuracy validation |
| Latency Benchmarks | Compression overhead & cost-benefit analysis |
| Limitations | When compression helps, when it doesn't |
| Evals Framework | Prove compression preserves accuracy |
| Memory | Persistent memory |
| Agno | Agno agent framework |
| MCP | Claude Code subscriptions |
| Configuration | All options |
git clone https://github.com/chopratejas/headroom.git && cd headroom
pip install -e ".[dev]" && pytestApache License 2.0 — see LICENSE.
