Brain Ingestion Benchmark

The first quantitative benchmark for measuring how well AI agents turn messy real-world data into structured knowledge.

There's no standardized way to measure personal AI agent ingestion quality. Agents claim to "remember everything" — but how much do they actually extract? How accurately do they connect the dots between people, projects, and timelines scattered across emails and documents?

This benchmark answers that with numbers.

Results

Head-to-head: OpenClaw agent (with persistent brain) vs raw Claude API (no memory):

-------------------------------------------------------------------
BRAIN INGESTION BENCHMARK RESULTS
-------------------------------------------------------------------
Agent          Ent Recall   Ent Prec     Rel F1   Friction     Time
-------------------------------------------------------------------
openclaw          100.00%     26.42%     27.03%          3   629.9s
baseline           92.86%     28.26%     12.90%          6    36.9s
-------------------------------------------------------------------

Ground truth: 14 entities, 13 relationships
Fixture: 3 emails + 1 technical spec (Project Aurora @ Nexus Corp)

Key findings:

The brain-backed agent finds 100% of entities vs 93% for raw Claude
2x better relationship extraction (27% vs 13% F1) — cross-document context matters
Half the friction (3 interactions vs 6) — the brain handles state management
Raw Claude is faster but misses entities and struggles to connect information across documents

What It Measures

Metric	What It Tests	Best
Entity Recall	Did the agent find all the people, projects, dates, orgs, and technologies?	Higher
Entity Precision	How much noise did it extract beyond the ground truth?	Higher
Relationship F1	Can it connect the dots? (e.g., "Alice manages Project Aurora")	Higher
Setup Friction	How many commands/API calls to complete the workflow?	Lower

Relationship scoring includes predicate synonym matching — "leads" matches "manages", "built_with" matches "uses", "target_launch" matches "deadline_is" — so agents aren't penalized for using different but semantically equivalent language.

Test Data

Synthetic enterprise data designed to test cross-document entity resolution — the hard part of ingestion:

3 emails between team members discussing Project Aurora (a platform migration at Nexus Corp)
1 technical specification that cross-references all email entities and adds new ones

Ground truth: 14 entities (4 people, 1 project, 4 technologies, 2 organizations, 3 dates) and 13 relationships (manages, works_on, reports_to, uses, sponsors, deadline_is).

The challenge: entities and relationships are scattered across documents. "Alice Chen" appears in all 4 documents with different context. "PostgreSQL" is discussed in 2 emails and the spec. An agent needs cross-document reasoning to build the full picture.

Quick Start

# Clone
git clone https://github.com/toddegray/BrainIngestionBenchmark.git
cd BrainIngestionBenchmark

# Set up
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Configure
cp .env.example .env
# Edit .env with your ANTHROPIC_API_KEY

# Run
source .env && export ANTHROPIC_API_KEY
python benchmark.py --agent baseline     # Claude API only
python benchmark.py                       # All agents (needs openclaw CLI)
python benchmark.py --json                # JSON output for programmatic use

Running Tests

python -m unittest discover tests/

53 unit tests covering scoring metrics, fixture determinism, and LLM response parsing. No external dependencies or API keys needed.

Architecture

BrainIngestionBenchmark/
├── core/
│   ├── fixtures.py        # Deterministic synthetic data + ground truth
│   ├── evaluator.py       # Abstract agent interface (3 methods)
│   └── scoring.py         # Pure metric functions (zero dependencies)
├── agents/
│   ├── openclaw.py        # OpenClaw agent via CLI subprocess
│   ├── baseline.py        # Claude API multi-turn conversation
│   └── response_parser.py # Robust LLM output parsing
├── tests/                 # 53 unit tests
├── benchmark.py           # CLI runner
└── requirements.txt       # Just anthropic SDK

Design principles:

Pure scoring functions with zero dependencies — fully unit-testable, following gbrain's eval pattern
Agent-agnostic interface — implement 3 methods (ingest, query_entities, query_relationships) to add any agent
Deterministic fixtures — no randomness, no file I/O, same data every run
Minimal dependencies — only the anthropic SDK; everything else is Python stdlib

Adding Your Own Agent

Create agents/your_agent.py implementing AgentEvaluator from core/evaluator.py:

class YourAgentEvaluator(AgentEvaluator):
    def ingest(self, fixture_bundle: dict) -> None: ...
    def query_entities(self) -> list[str]: ...
    def query_relationships(self) -> list[tuple[str, str, str]]: ...

Add discovery logic in benchmark.py:discover_agents()
Run: python benchmark.py --agent your_agent

Why This Matters

Every personal AI agent — whether it's a second brain, a knowledge base, or a memory layer — needs to solve the same fundamental problem: turning messy, unstructured data into something it can reason about later.

Without a benchmark, we're comparing marketing claims. With one, we can measure what actually works.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Brain Ingestion Benchmark

Results

What It Measures

Test Data

Quick Start

Running Tests

Architecture

Adding Your Own Agent

Why This Matters

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
agents		agents
core		core
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Brain Ingestion Benchmark

Results

What It Measures

Test Data

Quick Start

Running Tests

Architecture

Adding Your Own Agent

Why This Matters

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages