Add AMA-Bench (arXiv:2602.22769): long-horizon memory evaluation for agentic trajectories

**AMA-Bench** — "Evaluating Long-Horizon Memory for Agentic Applications"  
arXiv:2602.22769 | Yujie Zhao et al. | Feb 2026

**Why it belongs here:**

AMA-Bench directly addresses a gap in existing memory benchmarks: prior work focuses on dialogue-centric human-agent interactions, while real deployed agents deal with continuous streams of agent-environment interactions (machine-generated representations). AMA-Bench is one of the first to evaluate memory in this more realistic setting.

**Key contributions:**
- Two benchmark components: real-world agentic trajectories + expert QA; synthetic trajectories scaling to arbitrary horizons + rule-based QA
- Finds that existing memory systems underperform primarily due to lossy similarity-based retrieval and lack of causal/objective information
- Proposes AMA-Agent (causality graph + tool-augmented retrieval) achieving 57.22%, +11.16% over baselines

**Suggested section:** Evaluation & Benchmarks, or a dedicated Long-Horizon Memory subsection if one exists.

The synthetic trajectory generation component also has practical tooling implications — it's similar in spirit to test harnesses like `simulate_boundary.py` in compression-monitor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AMA-Bench (arXiv:2602.22769): long-horizon memory evaluation for agentic trajectories #25

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add AMA-Bench (arXiv:2602.22769): long-horizon memory evaluation for agentic trajectories #25

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions