AMA-Bench — "Evaluating Long-Horizon Memory for Agentic Applications"
arXiv:2602.22769 | Yujie Zhao et al. | Feb 2026
Why it belongs here:
AMA-Bench directly addresses a gap in existing memory benchmarks: prior work focuses on dialogue-centric human-agent interactions, while real deployed agents deal with continuous streams of agent-environment interactions (machine-generated representations). AMA-Bench is one of the first to evaluate memory in this more realistic setting.
Key contributions:
- Two benchmark components: real-world agentic trajectories + expert QA; synthetic trajectories scaling to arbitrary horizons + rule-based QA
- Finds that existing memory systems underperform primarily due to lossy similarity-based retrieval and lack of causal/objective information
- Proposes AMA-Agent (causality graph + tool-augmented retrieval) achieving 57.22%, +11.16% over baselines
Suggested section: Evaluation & Benchmarks, or a dedicated Long-Horizon Memory subsection if one exists.
The synthetic trajectory generation component also has practical tooling implications — it's similar in spirit to test harnesses like simulate_boundary.py in compression-monitor.
AMA-Bench — "Evaluating Long-Horizon Memory for Agentic Applications"
arXiv:2602.22769 | Yujie Zhao et al. | Feb 2026
Why it belongs here:
AMA-Bench directly addresses a gap in existing memory benchmarks: prior work focuses on dialogue-centric human-agent interactions, while real deployed agents deal with continuous streams of agent-environment interactions (machine-generated representations). AMA-Bench is one of the first to evaluate memory in this more realistic setting.
Key contributions:
Suggested section: Evaluation & Benchmarks, or a dedicated Long-Horizon Memory subsection if one exists.
The synthetic trajectory generation component also has practical tooling implications — it's similar in spirit to test harnesses like
simulate_boundary.pyin compression-monitor.