The first quantitative benchmark for measuring how well AI agents turn messy real-world data into structured knowledge.
There's no standardized way to measure personal AI agent ingestion quality. Agents claim to "remember everything" — but how much do they actually extract? How accurately do they connect the dots between people, projects, and timelines scattered across emails and documents?
This benchmark answers that with numbers.
Head-to-head: OpenClaw agent (with persistent brain) vs raw Claude API (no memory):
-------------------------------------------------------------------
BRAIN INGESTION BENCHMARK RESULTS
-------------------------------------------------------------------
Agent Ent Recall Ent Prec Rel F1 Friction Time
-------------------------------------------------------------------
openclaw 100.00% 26.42% 27.03% 3 629.9s
baseline 92.86% 28.26% 12.90% 6 36.9s
-------------------------------------------------------------------
Ground truth: 14 entities, 13 relationships
Fixture: 3 emails + 1 technical spec (Project Aurora @ Nexus Corp)
Key findings:
- The brain-backed agent finds 100% of entities vs 93% for raw Claude
- 2x better relationship extraction (27% vs 13% F1) — cross-document context matters
- Half the friction (3 interactions vs 6) — the brain handles state management
- Raw Claude is faster but misses entities and struggles to connect information across documents
| Metric | What It Tests | Best |
|---|---|---|
| Entity Recall | Did the agent find all the people, projects, dates, orgs, and technologies? | Higher |
| Entity Precision | How much noise did it extract beyond the ground truth? | Higher |
| Relationship F1 | Can it connect the dots? (e.g., "Alice manages Project Aurora") | Higher |
| Setup Friction | How many commands/API calls to complete the workflow? | Lower |
Relationship scoring includes predicate synonym matching — "leads" matches "manages", "built_with" matches "uses", "target_launch" matches "deadline_is" — so agents aren't penalized for using different but semantically equivalent language.
Synthetic enterprise data designed to test cross-document entity resolution — the hard part of ingestion:
- 3 emails between team members discussing Project Aurora (a platform migration at Nexus Corp)
- 1 technical specification that cross-references all email entities and adds new ones
Ground truth: 14 entities (4 people, 1 project, 4 technologies, 2 organizations, 3 dates) and 13 relationships (manages, works_on, reports_to, uses, sponsors, deadline_is).
The challenge: entities and relationships are scattered across documents. "Alice Chen" appears in all 4 documents with different context. "PostgreSQL" is discussed in 2 emails and the spec. An agent needs cross-document reasoning to build the full picture.
# Clone
git clone https://github.com/toddegray/BrainIngestionBenchmark.git
cd BrainIngestionBenchmark
# Set up
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Configure
cp .env.example .env
# Edit .env with your ANTHROPIC_API_KEY
# Run
source .env && export ANTHROPIC_API_KEY
python benchmark.py --agent baseline # Claude API only
python benchmark.py # All agents (needs openclaw CLI)
python benchmark.py --json # JSON output for programmatic usepython -m unittest discover tests/53 unit tests covering scoring metrics, fixture determinism, and LLM response parsing. No external dependencies or API keys needed.
BrainIngestionBenchmark/
├── core/
│ ├── fixtures.py # Deterministic synthetic data + ground truth
│ ├── evaluator.py # Abstract agent interface (3 methods)
│ └── scoring.py # Pure metric functions (zero dependencies)
├── agents/
│ ├── openclaw.py # OpenClaw agent via CLI subprocess
│ ├── baseline.py # Claude API multi-turn conversation
│ └── response_parser.py # Robust LLM output parsing
├── tests/ # 53 unit tests
├── benchmark.py # CLI runner
└── requirements.txt # Just anthropic SDK
Design principles:
- Pure scoring functions with zero dependencies — fully unit-testable, following gbrain's eval pattern
- Agent-agnostic interface — implement 3 methods (
ingest,query_entities,query_relationships) to add any agent - Deterministic fixtures — no randomness, no file I/O, same data every run
- Minimal dependencies — only the
anthropicSDK; everything else is Python stdlib
- Create
agents/your_agent.pyimplementingAgentEvaluatorfromcore/evaluator.py:
class YourAgentEvaluator(AgentEvaluator):
def ingest(self, fixture_bundle: dict) -> None: ...
def query_entities(self) -> list[str]: ...
def query_relationships(self) -> list[tuple[str, str, str]]: ...- Add discovery logic in
benchmark.py:discover_agents() - Run:
python benchmark.py --agent your_agent
Every personal AI agent — whether it's a second brain, a knowledge base, or a memory layer — needs to solve the same fundamental problem: turning messy, unstructured data into something it can reason about later.
Without a benchmark, we're comparing marketing claims. With one, we can measure what actually works.
MIT