GitHub - campfirein/brv-bench: Benchmark suite for evaluating retrieval quality and latency of AI agent context systems

Benchmark suite for evaluating retrieval quality, latency, and diversity of AI agent context systems. Powered for ByteRover, engineered by ByteRover.

Blog Posts

LongMemEval-S Deep Dive: Finding the Needle: ByteRover 2.1.5 Takes On LongMemEval-S
Benchmarking Breakdown: Benchmarking AI agent memory: ByteRover 2.0 Scores 92.2% and Rewrites the LoCoMo Leaderboard
Architecture Analysis: Architecture Deep Dive: ByteRover CLI 2.0 - Memory For Autonomous Agents

Overall Accuracy

Setup

source scripts/source_env.sh
python -m brv_bench --help

Supported Datasets

Dataset	Description	Corpus	Queries	Download	Context Tree
LoCoMo	Long-term conversation memory QA (10 conversations, 272 sessions)	272 docs	1982	locomo10.json	download
LongMemEval-S	Long-term interactive memory benchmark (ICLR 2025, 6 memory abilities, ~48 sessions/question)	23,867 docs	500	HuggingFace	download

Usage

1. Transform dataset

Pre-transformed datasets are provided in assets/ (locomo_sample.json, longmemeval_s.json) — you can skip this step and pass those files directly to curate/evaluate.

To transform from raw sources:

# LoCoMo → produces assets/sample_data/locomo.json (already provided)
python scripts/transform_locomo.py locomo10.json assets/sample_data/locomo.json

# LongMemEval (three variants: oracle / s_cleaned ~40 sessions / m_cleaned ~500 sessions)
# → produces assets/longmemeval_s.json (already provided)
python scripts/transform_longmemeval.py longmemeval_oracle.json assets/longmemeval_s.json

2. Curate (populate context tree)

python -m brv_bench curate --ground-truth assets/sample_data/locomo.json

3. Evaluate

export GEMINI_API_KEY="your-api-key"

python -m brv_bench evaluate \
  --ground-truth assets/sample_data/locomo.json \
  --judge \
  --judge-cache report/judge_cache_locomo_gemini.json

The justifier is automatically enabled for LoCoMo and LongMemEval (no extra flag needed). See LLM-as-Judge and Justifier below for detailed configuration options.

Results are saved to report/{yyyymmdd}_{dataset}_{memory_system}.json/.txt. Per-query results are written incrementally (crash-safe).

LLM-as-Judge

Install deps and set an API key, then pass --judge:

pip install 'brv-bench[judge]'
export GEMINI_API_KEY="your-api-key"   # or ANTHROPIC_API_KEY / OPENAI_API_KEY

python -m brv_bench evaluate \
  --ground-truth assets/sample_data/locomo.json \
  --judge --judge-cache report/judge_cache_locomo_gemini.json

Flag	Default	Description
`--judge`	off	Enable LLM-as-Judge metric
`--judge-backend`	`gemini`	`gemini`, `anthropic`, or `openai`
`--judge-model`	`gemini-2.5-flash` / `claude-sonnet-4-6` / `gpt-4o-2024-08-06`	Model name override (default varies by backend)
`--judge-concurrency`	`5`	Max parallel judge API calls
`--judge-cache`	none	Path to JSON cache file
`--context-tree-source`	none	Path to pre-curated context tree for isolated mode

Isolated Mode

Scopes the context tree to one question at a time to prevent cross-question contamination. Requires a pre-curated source directory.

python -m brv_bench evaluate \
  --ground-truth assets/longmemeval_s.json \
  --context-tree-source path/to/full-context-tree \
  --judge --judge-cache report/judge_cache_longmemeval_gemini.json

Source layout: {context-tree-source}/{question_id}/{session_id}/key_facts.md

Justifier

Automatically enabled for datasets with a justifier_template (LoCoMo and LongMemEval). Uses the same API key as the judge.

Flag	Default	Description
`--justifier-backend`	`gemini`	`gemini`, `anthropic`, or `openai`
`--justifier-model`	`gemini-2.5-flash` / `claude-sonnet-4-6` / `gpt-4o-2024-08-06`	Model name override (default varies by backend)

Ground Truth Format

{
  "name": "locomo",
  "corpus": [{ "doc_id": "session_1", "content": "...", "source": "conv-26" }],
  "entries": [{
    "query": "What career path has Caroline decided to pursue?",
    "expected_doc_ids": ["session_1", "session_4"],
    "expected_answer": "counseling or mental health for transgender people",
    "category": "multi-hop"
  }]
}

Metrics

Metric	What It Measures
LLM Judge	LLM-as-Judge binary correctness (requires `--judge`)
Precision@K	Fraction of top-K results that are relevant
Recall@K	Fraction of relevant documents found in top-K
NDCG@K	Ranking quality of top-K
MRR	Reciprocal rank of the first relevant result
Cold Latency	Query time with no cache (p50/p95/p99)

Results (LLM Judge Accuracy %)

LongMemEval-S

Category	Queries	ByteRover 2.1.5 Run 1 (Gemini 3 Flash)	ByteRover 2.1.5 Run 2 (Gemini 3.1 Pro)	Hindsight (Gemini 3 Pro)	HonCho
Knowledge Update	78	98.7% (77/78)	94.9% (74/78)	94.9%	94.9%
Single-Session User	70	98.6% (69/70)	100% (70/70)	97.1%	94.3%
Single-Session Assistant	56	98.2% (55/56)	94.6% (53/56)	96.4%	96.4%
Single-Session Preference	30	96.7% (29/30)	86.7% (26/30)	80.0%	90.0%
Temporal Reasoning	133	91.7% (122/133)	94.0% (125/133)	91.0%	88.7%
Multi-Session	133	84.2% (112/133)	85.0% (113/133)	87.2%	85%
Overall	500	92.8% (464/500)	92.2% (461/500)	91.4%	90.4%

LoCoMo

Category	ByteRover 2.1.5	ByteRover 2.0 (Run 2)	Hindsight	HonCho	Memobase v0.0.37	Zep	Mem0	OpenAI Memory
Single-Hop	97.5% (820/841)	95.4%	86.2%	93.2%	70.9%	74.1%	67.1%	63.8%
Multi-Hop	93.3% (263/282)	85.1%	70.8%	84.0%	46.9%	66.0%	51.2%	42.9%
Open Domain	85.9% (79/92)	77.2%	95.1%	77.1%	77.2%	67.7%	72.9%	62.3%
Temporal	97.8% (314/321)	94.4%	83.8%	88.2%	85.1%	79.8%	55.5%	21.7%
Overall	96.1%	92.2%	89.6%	89.9%	75.8%	75.1%	66.9%	52.9%

Reproduction

To reproduce our best runs, we highly recommend using the pre-curated context trees from Supported Datasets for consistent performance.

LoCoMo: Place the curated context tree inside .brv/context-tree/.
LongMemEval-S: Place the curated context tree outside .brv/context-tree/ and pass it via --context-tree-source (isolated mode).

# For LongMemEval-S (ByteRover 2.1.5 Run 1)
python -m brv_bench evaluate \
  --ground-truth output/longmemeval_s_benchmark.json \
  --judge \
  --judge-model "gemini-3-flash" \
  --justifier-model "gemini-3.1-pro-preview" \
  --context-tree-source LME-S-context-tree \
  --limit 32

# For LoCoMo (ByteRover 2.1.5)
python -m brv_bench evaluate \
  --ground-truth output/locomo.json \
  --judge \
  --judge-model "gemini-3-flash-preview" \
  --justifier-model "gemini-3.1-pro-preview"

Requirements

byterover-cli >= 2.1.5
Python >= 3.12
A project with brv initialized (.brv/ directory exists)

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
assets		assets
brv_bench		brv_bench
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Blog Posts

Overall Accuracy

Setup

Supported Datasets

Usage

1. Transform dataset

2. Curate (populate context tree)

3. Evaluate

LLM-as-Judge

Isolated Mode

Justifier

Ground Truth Format

Metrics

Results (LLM Judge Accuracy %)

LongMemEval-S

LoCoMo

Reproduction

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Blog Posts

Overall Accuracy

Setup

Supported Datasets

Usage

1. Transform dataset

2. Curate (populate context tree)

3. Evaluate

LLM-as-Judge

Isolated Mode

Justifier

Ground Truth Format

Metrics

Results (LLM Judge Accuracy %)

LongMemEval-S

LoCoMo

Reproduction

Requirements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages