Skip to content

Suggestion: Add reproducible memory evaluation benchmarks #106

@yubingz

Description

@yubingz

Hi team, congrats on the open-source release! The layered L0→L3 architecture is really interesting.

I noticed your benchmarks show impressive improvements (48%→76% on PersonaMem accuracy). For the community to reproduce and build on these results, would you consider adding a standardized evaluation module?

I've been working on MemTest, a benchmark database design system for AI memory evaluation. It takes a different approach from typical eval frameworks — instead of providing evaluation metrics, it provides test databases that stress-test different aspects of memory retrieval:

  • Storage integrity: Can the system store and retrieve all variants of a memory?
  • Retrieval precision: 5 query types (person/location/event/time/composite) with temporal reasoning
  • Memory clustering: Are related memories grouped together?
  • Forgetting directionality: High-frequency vs low-frequency recall
  • Reasoning: Multi-hop chain queries across memories
  • Deep retrieval: Recall decay over near/mid/far semantic distance

It also includes a corpus-driven builder that generates test databases from any text corpus. We used it with the Four Great Classical Novels of Chinese literature to generate 21,793 memories + 750 queries.

For a layered architecture like yours, this could be especially useful for evaluating whether L0/L1/L2/L3 each retrieve from the correct layer under different query patterns.

Would this kind of standardized benchmarking be useful for your project? Happy to help draft a starter integration.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions