Skip to content

toddegray/BrainIngestionBenchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Brain Ingestion Benchmark

The first quantitative benchmark for measuring how well AI agents turn messy real-world data into structured knowledge.

There's no standardized way to measure personal AI agent ingestion quality. Agents claim to "remember everything" — but how much do they actually extract? How accurately do they connect the dots between people, projects, and timelines scattered across emails and documents?

This benchmark answers that with numbers.

Results

Head-to-head: OpenClaw agent (with persistent brain) vs raw Claude API (no memory):

-------------------------------------------------------------------
BRAIN INGESTION BENCHMARK RESULTS
-------------------------------------------------------------------
Agent          Ent Recall   Ent Prec     Rel F1   Friction     Time
-------------------------------------------------------------------
openclaw          100.00%     26.42%     27.03%          3   629.9s
baseline           92.86%     28.26%     12.90%          6    36.9s
-------------------------------------------------------------------

Ground truth: 14 entities, 13 relationships
Fixture: 3 emails + 1 technical spec (Project Aurora @ Nexus Corp)

Key findings:

  • The brain-backed agent finds 100% of entities vs 93% for raw Claude
  • 2x better relationship extraction (27% vs 13% F1) — cross-document context matters
  • Half the friction (3 interactions vs 6) — the brain handles state management
  • Raw Claude is faster but misses entities and struggles to connect information across documents

What It Measures

Metric What It Tests Best
Entity Recall Did the agent find all the people, projects, dates, orgs, and technologies? Higher
Entity Precision How much noise did it extract beyond the ground truth? Higher
Relationship F1 Can it connect the dots? (e.g., "Alice manages Project Aurora") Higher
Setup Friction How many commands/API calls to complete the workflow? Lower

Relationship scoring includes predicate synonym matching — "leads" matches "manages", "built_with" matches "uses", "target_launch" matches "deadline_is" — so agents aren't penalized for using different but semantically equivalent language.

Test Data

Synthetic enterprise data designed to test cross-document entity resolution — the hard part of ingestion:

  • 3 emails between team members discussing Project Aurora (a platform migration at Nexus Corp)
  • 1 technical specification that cross-references all email entities and adds new ones

Ground truth: 14 entities (4 people, 1 project, 4 technologies, 2 organizations, 3 dates) and 13 relationships (manages, works_on, reports_to, uses, sponsors, deadline_is).

The challenge: entities and relationships are scattered across documents. "Alice Chen" appears in all 4 documents with different context. "PostgreSQL" is discussed in 2 emails and the spec. An agent needs cross-document reasoning to build the full picture.

Quick Start

# Clone
git clone https://github.com/toddegray/BrainIngestionBenchmark.git
cd BrainIngestionBenchmark

# Set up
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Configure
cp .env.example .env
# Edit .env with your ANTHROPIC_API_KEY

# Run
source .env && export ANTHROPIC_API_KEY
python benchmark.py --agent baseline     # Claude API only
python benchmark.py                       # All agents (needs openclaw CLI)
python benchmark.py --json                # JSON output for programmatic use

Running Tests

python -m unittest discover tests/

53 unit tests covering scoring metrics, fixture determinism, and LLM response parsing. No external dependencies or API keys needed.

Architecture

BrainIngestionBenchmark/
├── core/
│   ├── fixtures.py        # Deterministic synthetic data + ground truth
│   ├── evaluator.py       # Abstract agent interface (3 methods)
│   └── scoring.py         # Pure metric functions (zero dependencies)
├── agents/
│   ├── openclaw.py        # OpenClaw agent via CLI subprocess
│   ├── baseline.py        # Claude API multi-turn conversation
│   └── response_parser.py # Robust LLM output parsing
├── tests/                 # 53 unit tests
├── benchmark.py           # CLI runner
└── requirements.txt       # Just anthropic SDK

Design principles:

  • Pure scoring functions with zero dependencies — fully unit-testable, following gbrain's eval pattern
  • Agent-agnostic interface — implement 3 methods (ingest, query_entities, query_relationships) to add any agent
  • Deterministic fixtures — no randomness, no file I/O, same data every run
  • Minimal dependencies — only the anthropic SDK; everything else is Python stdlib

Adding Your Own Agent

  1. Create agents/your_agent.py implementing AgentEvaluator from core/evaluator.py:
class YourAgentEvaluator(AgentEvaluator):
    def ingest(self, fixture_bundle: dict) -> None: ...
    def query_entities(self) -> list[str]: ...
    def query_relationships(self) -> list[tuple[str, str, str]]: ...
  1. Add discovery logic in benchmark.py:discover_agents()
  2. Run: python benchmark.py --agent your_agent

Why This Matters

Every personal AI agent — whether it's a second brain, a knowledge base, or a memory layer — needs to solve the same fundamental problem: turning messy, unstructured data into something it can reason about later.

Without a benchmark, we're comparing marketing claims. With one, we can measure what actually works.

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages