Experimental context extension for local LLMs via hierarchical retrieval.
Status: Early-Stage Research Prototype / Under Active Development
This project explores extending LLM effective context using the Hierarchical Attention Tree (HAT) for retrieval-augmented memory. Current benchmarks measure retrieval recall on synthetic data — not end-to-end task accuracy. The "100% retrieval" figures below mean HAT finds the right chunks in controlled tests, not that the LLM produces correct answers 100% of the time. Real-world performance depends on many factors (query quality, chunk boundaries, model capability) that have not been rigorously evaluated.
This is a research prototype, not production-ready software. Use at your own risk. Rigorous benchmarking is in progress.
# Docker (one command, works everywhere)
docker run -it --rm --network host andrewmang/infinite-context
# Or with docker-compose for full stack
curl -O https://raw.githubusercontent.com/Lumi-node/infinite-context/main/docker-compose.yml
docker-compose up -dTry it on Hugging Face Spaces - See HAT in action right in your browser!
# Linux/macOS - installs everything automatically
curl -sSL https://raw.githubusercontent.com/Lumi-node/infinite-context/main/install.sh | bash# Clone the repo
git clone https://github.com/Lumi-node/infinite-context
cd infinite-context
# Install Python package (recommended - full HAT support)
pip install maturin sentence-transformers
maturin develop --release
# Or build Rust CLI (benchmarks only)
cargo build --release| Model | Native Context | Addressable via HAT | Extension (retrieval only) |
|---|---|---|---|
| gemma3:1b | 8K | 11.3M+ | 1,413x |
| phi4 | 16K | 11.3M+ | 706x |
| llama3.2 | 8K | 11.3M+ | 1,413x |
These figures represent the amount of stored text HAT can search through — not that the model "understands" all 11M tokens simultaneously. Retrieved chunks are injected into the model's native context window. End-to-end task accuracy (does the model answer correctly?) has not been formally benchmarked.
Local models like Gemma 3 (8K) and Phi 4 (16K) are powerful — but they forget everything outside their tiny context window. RAG systems try to help but deliver ~70% accuracy at best, losing critical information.
Hierarchical Attention Tree (HAT) — exploits the natural hierarchy of conversations:
Instead of searching all chunks O(n), HAT does O(log n) beam search through the hierarchy — achieving high retrieval recall in synthetic benchmarks. Real-world accuracy depends on data structure, embedding quality, and query characteristics.
# Pull and run immediately
docker run -it --rm --network host andrewmang/infinite-context
# Run benchmark
docker run -it --rm andrewmang/infinite-context infinite-context bench --chunks 100000
# Full stack with Ollama
docker-compose up -d
docker-compose exec infinite-context infinite-context chat --model gemma3:1bThe Python API uses real embeddings + HAT retrieval + Ollama. Note: This is experimental research software, not a production-ready system.
# From the repo (after cloning)
pip install maturin sentence-transformers
maturin develop --releasefrom infinite_context import InfiniteContext
# Initialize - connects to Ollama
ctx = InfiniteContext(model="gemma3:1b")
# Add information (automatically embedded with sentence-transformers and indexed in HAT)
ctx.add("My name is Alex and I work on quantum computing.")
ctx.add("The latest experiment showed 47% improvement in coherence.")
# Chat - HAT retrieves relevant context, injects it into prompt, queries Ollama
response = ctx.chat("What were the quantum experiment results?")
print(response) # References the 47% improvement
# Save memory to disk
ctx.save("my_memory.hat")
# Load later
ctx = InfiniteContext.load("my_memory.hat", model="gemma3:1b")from infinite_context import HatIndex
from sentence_transformers import SentenceTransformer
# Setup
embedder = SentenceTransformer('all-MiniLM-L6-v2')
index = HatIndex.cosine(384)
# Add embeddings
embedding = embedder.encode("Important info", normalize_embeddings=True)
index.add(embedding.tolist())
# Query
query_emb = embedder.encode("What's important?", normalize_embeddings=True)
results = index.near(query_emb.tolist(), k=10)
# Persist
index.save("index.hat")The Rust CLI is useful for benchmarking HAT performance and testing Ollama connectivity.
Note: For actual chat with HAT memory retrieval, use the Python API above.
# Build the CLI
cargo build --release
# Run HAT performance benchmark
./target/release/infinite-context bench --chunks 100000
# Test Ollama connection
./target/release/infinite-context test --model gemma3:1b
# List available models
./target/release/infinite-context models- Rust: 1.70+ (for CLI)
- Python: 3.9+ (for Python API)
- Ollama: Any version
- RAM: 4GB minimum
git clone https://github.com/Lumi-node/infinite-context
cd infinite-context
# Rust CLI
cargo build --release
./target/release/infinite-context --help
# Python wheel
pip install maturin
maturin develop --releaseWe're exploring whether local, hierarchical retrieval can meaningfully extend context for small LLMs — without sending data to cloud APIs.
Design goals:
- Local: Runs on your hardware, data stays on your machine
- Free: No API costs
- Fast retrieval: Sub-millisecond HAT queries in synthetic benchmarks
- High retrieval recall: 100% on synthetic hierarchical test data (real-world accuracy not yet validated)
Note: This is a research project exploring an idea, not a finished product. The retrieval layer works well in controlled tests, but end-to-end quality (does the LLM actually give better answers?) needs rigorous evaluation. We are actively working on this.
Based on the Hierarchical Attention Tree (HAT) algorithm. Key hypothesis: conversations naturally form hierarchies (sessions → documents → chunks), and exploiting this structure may enable O(log n) retrieval with high recall. Validating this hypothesis rigorously is ongoing work.
MIT
| Method | Command | Notes |
|---|---|---|
| Docker | docker run -it --rm --network host andrewmang/infinite-context |
Full setup |
| Browser | Hugging Face Spaces | Try HAT live |
| Source | git clone ... && maturin develop --release |
Python API (recommended) |
An experiment in local, hierarchical AI memory. Contributions and feedback welcome.






