cognitive-memory-benchmarks

Benchmark suite for cognitive-memory - a biologically-inspired agent memory system. The active paper headline combines the Phase 5 full-LoCoMo v0.5 tuned-default run with the May 2026 current_sdk_20260505 refresh for LongMemEval-S, LTI-Bench, oracle, retrieval, and ablation artifacts. Legacy v6/ artifacts remain on disk for provenance.

Headline Results

Benchmark	Status	Our Result	Comparison
LoCoMo (10 conv, 1540 QA)	Complete	46.2% overall F1, 51.3% multi-hop F1	Mem0 28.4% multi-hop; v0.4 paper-faithful row 44.4% overall / 48.5% multi-hop; 72% of LoCoMo oracle evidence context condition (63.9% F1)
LongMemEval-S (500 questions)	Complete	71.6% task-averaged accuracy, 72.6% overall accuracy	ENGRAM 71.4% (concurrent) · Full-context 56.2% · TiMem 76.88% / EverMemOS 83.0% (post-dating)
LTI-Bench v2 (controlled, 42 probes)	Complete	88.1% accuracy, 69.7% F1, 100% critical-fact retention	FadeMem 82.1% critical retention
MemoryBench (2025)	Scaffolded	—	Future work

Auxiliary current-refresh measurements: LoCoMo oracle evidence context condition 63.9% F1 (61.1% under Mem0 scoring), evidence Recall@60 35.6%, decay model power-law +3.2pp over exponential, rerank +1.9pp, hybrid search +1.7pp, judge agreement 94% (Cohen's kappa 0.879).

v0.5 Empirical Default Tuning (Phase 0g→5, 2026-05)

A systematic tuning campaign on LTI-Bench (~~$30 / 12h) followed by validation on full LoCoMo (~~$100 / 3.4h) shipped three new SDK defaults in cognitive-memory v0.5.0:

associative_boost: 0.03 → 0.05
base_decay_rates.semantic: 120 → 240 (days)
core_session_threshold: 3 → 2

Bench (production flags)	v0.4 (paper defaults)	v0.5 (tuned)	Δ
LoCoMo full (1540 QA) F1	0.4437	0.4624	+1.87pp
LoCoMo full (1540 QA) LLM accuracy	0.5857	0.6130	+2.73pp
LoCoMo conv0 F1	0.4310	0.4601	+2.92pp
LongMemEval-S 500 QA accuracy	attempted; OpenAI billing-cap blocked at 30%	inconclusive	n/a

Methodology, per-phase milestones, full provenance: experimentlog_v2.md, tuning/runs/runs.jsonl, and docs/milestones/phase-{0-harness-extension,1-sensitivity-analysis,2-optuna-tuning,4-locomo-reality-check,5-full-locomo,7-longmemeval-validation}.md. Single-author campaign; ~$245 spend / ~28h compute total. Phase 7 (LongMemEval-S validation) hit an account billing cap twice at 30% completion; partial data is inconclusive but consistent across both attempts. Phase 5 (full LoCoMo) is the load-bearing v0.5 validation.

Full per-run details, parameters, and per-category breakdowns: experimentlog_v2.md and experimentlog.md. Operator notes: docs/. Paper: paper/cognitive-memory-arxiv-paper-v2.pdf.

Setup

# Recreate venv (uv recommended)
uv venv --python 3.10 .venv  # any Python >=3.10 works
uv pip install -e . -e ../cognitive-memory-sdk/sdks/python

# Set API key
export OPENAI_API_KEY=your-key

Running benchmarks

The current_sdk_20260505 runs use the v6 retrieval pipeline with deep recall, LLM rerank (rerank-factor 3), and top-k 60 for LoCoMo (top-k 20 for LongMemEval). Hybrid search is off in the headline run and measured separately in ablations.

LoCoMo (Run CR-A reproduction)

.venv/bin/python -m locomo.locomo_eval \
  --data locomo/data/locomo10.json \
  --adapter cognitive_memory \
  --prompt-mode mem0 \
  --dual-perspective \
  --deep-recall \
  --rerank --rerank-factor 3 \
  --top-k 60 \
  --use-judge \
  --output locomo/results/current_sdk_20260505/primary.json

For parallel execution per conversation, see locomo/README.md.

LongMemEval-S (Run CR-B reproduction)

.venv/bin/python longmemeval/run_longmemeval.py \
  --data longmemeval/data/longmemeval_s_cleaned.json \
  --adapter cognitive_memory \
  --model gpt-4o-mini \
  --top-k 20 --deep-recall --rerank --rerank-factor 3 \
  --max-workers 53 \
  --output longmemeval/results/current_sdk_20260505/primary.json

LTI-Bench v2 (Run CR-C reproduction)

.venv/bin/python -m lti.lti_bench \
  --adapter cognitive_memory \
  --model gpt-4o-mini \
  --judge-model gpt-4o-2024-08-06 \
  --output lti/results/current_sdk_20260505/run_l_v2.json

Run Philosophy

For each benchmark, we run up to three configurations:

Apples-to-apples: Match competitor's exact model, k, embeddings, prompt
Benchmark pure: Follow official evaluation protocol exactly
Best tuned: Our optimal config (Mem0 prompt, k=60, deep recall, hybrid search, LLM rerank)

The LoCoMo headline above uses tuning/runs/phase5/v05_tuned/aggregate.json; LongMemEval-S, LTI-Bench, and auxiliary analyses use the current_sdk_20260505 configurations shown in each benchmark README. See experimentlog_v2.md, experimentlog.md, and docs/current-refresh-20260505.md for exact parameters, artifact paths, and caveats.

Directory Structure

shared/           # Adapter interface, metrics (token_f1, llm_judge)
locomo/           # LoCoMo benchmark (Runs A, CR-A, ablations, oracle)
longmemeval/      # LongMemEval-S (Run B, CR-B)
lti/              # LTI-Bench (Run L, CR-C) — controlled architectural test
memorybench/      # MemoryBench 2025 (scaffolded)
analysis/         # Post-processing scripts, ablation runner
simulations/      # Monte Carlo, boosting, cold storage sims
paper/            # arXiv paper.tex, references.bib, build artifacts
docs/             # Operator notes (architecture walkthrough, lessons, next steps)

SDK version / provenance

Runs	SDK	Provenance
Phase 5 LoCoMo v0.4/v0.5 head-to-head	Python SDK v0.5.0 tuned defaults, SDK commit `707758d`; benchmark commit `82f08c2`	`tuning/runs/phase5/v05_tuned/aggregate.json`, `tuning/runs/phase5/summary.json`, `experimentlog_v2.md`, and `docs/milestones/phase-5-full-locomo.md`
`current_sdk_20260505` (LoCoMo baseline, LongMemEval-S, LTI-Bench v2, oracle, ablations, decay, recall, judge reliability)	package version `0.3.0` from editable `../cognitive-memory-sdk/sdks/python`	See `experimentlog.md` and `docs/current-refresh-20260505.md` for exact commands, timestamps, output paths, and worktree state
LongMemEval-S 500-question headline	current-refresh completed artifact	`longmemeval/results/current_sdk_20260505/primary.json`
Historical March runs (`v6/` namespace)	v0.2.0 / v0.3.0 snapshots	Retained for provenance only

See docs/benchmarks-overview.md for full methodology and docs/lessons-and-gotchas.md for what we learned the hard way.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cognitive-memory-benchmarks

Headline Results

v0.5 Empirical Default Tuning (Phase 0g→5, 2026-05)

Setup

Running benchmarks

LoCoMo (Run CR-A reproduction)

LongMemEval-S (Run CR-B reproduction)

LTI-Bench v2 (Run CR-C reproduction)

Run Philosophy

Directory Structure

SDK version / provenance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
analysis		analysis
docs		docs
locomo		locomo
longmemeval		longmemeval
lti		lti
memorybench		memorybench
paper		paper
shared		shared
simulations		simulations
tuning		tuning
.gitignore		.gitignore
README.md		README.md
check_status.py		check_status.py
experimentlog.md		experimentlog.md
experimentlog_v2.md		experimentlog_v2.md
metrics.py		metrics.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

cognitive-memory-benchmarks

Headline Results

v0.5 Empirical Default Tuning (Phase 0g→5, 2026-05)

Setup

Running benchmarks

LoCoMo (Run CR-A reproduction)

LongMemEval-S (Run CR-B reproduction)

LTI-Bench v2 (Run CR-C reproduction)

Run Philosophy

Directory Structure

SDK version / provenance

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages