Benchmark suite for cognitive-memory - a biologically-inspired agent memory system. The active paper headline combines the Phase 5 full-LoCoMo v0.5 tuned-default run with the May 2026 current_sdk_20260505 refresh for LongMemEval-S, LTI-Bench, oracle, retrieval, and ablation artifacts. Legacy v6/ artifacts remain on disk for provenance.
| Benchmark | Status | Our Result | Comparison |
|---|---|---|---|
| LoCoMo (10 conv, 1540 QA) | Complete | 46.2% overall F1, 51.3% multi-hop F1 | Mem0 28.4% multi-hop; v0.4 paper-faithful row 44.4% overall / 48.5% multi-hop; 72% of LoCoMo oracle evidence context condition (63.9% F1) |
| LongMemEval-S (500 questions) | Complete | 71.6% task-averaged accuracy, 72.6% overall accuracy | ENGRAM 71.4% (concurrent) · Full-context 56.2% · TiMem 76.88% / EverMemOS 83.0% (post-dating) |
| LTI-Bench v2 (controlled, 42 probes) | Complete | 88.1% accuracy, 69.7% F1, 100% critical-fact retention | FadeMem 82.1% critical retention |
| MemoryBench (2025) | Scaffolded | — | Future work |
Auxiliary current-refresh measurements: LoCoMo oracle evidence context condition 63.9% F1 (61.1% under Mem0 scoring), evidence Recall@60 35.6%, decay model power-law +3.2pp over exponential, rerank +1.9pp, hybrid search +1.7pp, judge agreement 94% (Cohen's kappa 0.879).
A systematic tuning campaign on LTI-Bench ($30 / 12h) followed by validation on full LoCoMo ($100 / 3.4h) shipped three new SDK defaults in cognitive-memory v0.5.0:
associative_boost: 0.03 → 0.05base_decay_rates.semantic: 120 → 240 (days)core_session_threshold: 3 → 2
| Bench (production flags) | v0.4 (paper defaults) | v0.5 (tuned) | Δ |
|---|---|---|---|
| LoCoMo full (1540 QA) F1 | 0.4437 | 0.4624 | +1.87pp |
| LoCoMo full (1540 QA) LLM accuracy | 0.5857 | 0.6130 | +2.73pp |
| LoCoMo conv0 F1 | 0.4310 | 0.4601 | +2.92pp |
| LongMemEval-S 500 QA accuracy | attempted; OpenAI billing-cap blocked at 30% | inconclusive | n/a |
Methodology, per-phase milestones, full provenance: experimentlog_v2.md, tuning/runs/runs.jsonl, and docs/milestones/phase-{0-harness-extension,1-sensitivity-analysis,2-optuna-tuning,4-locomo-reality-check,5-full-locomo,7-longmemeval-validation}.md. Single-author campaign; ~$245 spend / ~28h compute total. Phase 7 (LongMemEval-S validation) hit an account billing cap twice at 30% completion; partial data is inconclusive but consistent across both attempts. Phase 5 (full LoCoMo) is the load-bearing v0.5 validation.
Full per-run details, parameters, and per-category breakdowns: experimentlog_v2.md and experimentlog.md. Operator notes: docs/. Paper: paper/cognitive-memory-arxiv-paper-v2.pdf.
# Recreate venv (uv recommended)
uv venv --python 3.10 .venv # any Python >=3.10 works
uv pip install -e . -e ../cognitive-memory-sdk/sdks/python
# Set API key
export OPENAI_API_KEY=your-keyThe current_sdk_20260505 runs use the v6 retrieval pipeline with deep recall, LLM rerank (rerank-factor 3), and top-k 60 for LoCoMo (top-k 20 for LongMemEval). Hybrid search is off in the headline run and measured separately in ablations.
.venv/bin/python -m locomo.locomo_eval \
--data locomo/data/locomo10.json \
--adapter cognitive_memory \
--prompt-mode mem0 \
--dual-perspective \
--deep-recall \
--rerank --rerank-factor 3 \
--top-k 60 \
--use-judge \
--output locomo/results/current_sdk_20260505/primary.jsonFor parallel execution per conversation, see locomo/README.md.
.venv/bin/python longmemeval/run_longmemeval.py \
--data longmemeval/data/longmemeval_s_cleaned.json \
--adapter cognitive_memory \
--model gpt-4o-mini \
--top-k 20 --deep-recall --rerank --rerank-factor 3 \
--max-workers 53 \
--output longmemeval/results/current_sdk_20260505/primary.json.venv/bin/python -m lti.lti_bench \
--adapter cognitive_memory \
--model gpt-4o-mini \
--judge-model gpt-4o-2024-08-06 \
--output lti/results/current_sdk_20260505/run_l_v2.jsonFor each benchmark, we run up to three configurations:
- Apples-to-apples: Match competitor's exact model, k, embeddings, prompt
- Benchmark pure: Follow official evaluation protocol exactly
- Best tuned: Our optimal config (Mem0 prompt, k=60, deep recall, hybrid search, LLM rerank)
The LoCoMo headline above uses tuning/runs/phase5/v05_tuned/aggregate.json; LongMemEval-S, LTI-Bench, and auxiliary analyses use the current_sdk_20260505 configurations shown in each benchmark README. See experimentlog_v2.md, experimentlog.md, and docs/current-refresh-20260505.md for exact parameters, artifact paths, and caveats.
shared/ # Adapter interface, metrics (token_f1, llm_judge)
locomo/ # LoCoMo benchmark (Runs A, CR-A, ablations, oracle)
longmemeval/ # LongMemEval-S (Run B, CR-B)
lti/ # LTI-Bench (Run L, CR-C) — controlled architectural test
memorybench/ # MemoryBench 2025 (scaffolded)
analysis/ # Post-processing scripts, ablation runner
simulations/ # Monte Carlo, boosting, cold storage sims
paper/ # arXiv paper.tex, references.bib, build artifacts
docs/ # Operator notes (architecture walkthrough, lessons, next steps)
| Runs | SDK | Provenance |
|---|---|---|
| Phase 5 LoCoMo v0.4/v0.5 head-to-head | Python SDK v0.5.0 tuned defaults, SDK commit 707758d; benchmark commit 82f08c2 |
tuning/runs/phase5/v05_tuned/aggregate.json, tuning/runs/phase5/summary.json, experimentlog_v2.md, and docs/milestones/phase-5-full-locomo.md |
current_sdk_20260505 (LoCoMo baseline, LongMemEval-S, LTI-Bench v2, oracle, ablations, decay, recall, judge reliability) |
package version 0.3.0 from editable ../cognitive-memory-sdk/sdks/python |
See experimentlog.md and docs/current-refresh-20260505.md for exact commands, timestamps, output paths, and worktree state |
| LongMemEval-S 500-question headline | current-refresh completed artifact | longmemeval/results/current_sdk_20260505/primary.json |
Historical March runs (v6/ namespace) |
v0.2.0 / v0.3.0 snapshots | Retained for provenance only |
See docs/benchmarks-overview.md for full methodology and docs/lessons-and-gotchas.md for what we learned the hard way.
MIT