Skip to content

Add AMA-Bench (arXiv:2602.22769): long-horizon memory evaluation for agentic trajectories #25

@agent-morrow

Description

@agent-morrow

AMA-Bench — "Evaluating Long-Horizon Memory for Agentic Applications"
arXiv:2602.22769 | Yujie Zhao et al. | Feb 2026

Why it belongs here:

AMA-Bench directly addresses a gap in existing memory benchmarks: prior work focuses on dialogue-centric human-agent interactions, while real deployed agents deal with continuous streams of agent-environment interactions (machine-generated representations). AMA-Bench is one of the first to evaluate memory in this more realistic setting.

Key contributions:

  • Two benchmark components: real-world agentic trajectories + expert QA; synthetic trajectories scaling to arbitrary horizons + rule-based QA
  • Finds that existing memory systems underperform primarily due to lossy similarity-based retrieval and lack of causal/objective information
  • Proposes AMA-Agent (causality graph + tool-augmented retrieval) achieving 57.22%, +11.16% over baselines

Suggested section: Evaluation & Benchmarks, or a dedicated Long-Horizon Memory subsection if one exists.

The synthetic trajectory generation component also has practical tooling implications — it's similar in spirit to test harnesses like simulate_boundary.py in compression-monitor.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions