Hi team, congrats on the open-source release! The layered L0→L3 architecture is really interesting.
I noticed your benchmarks show impressive improvements (48%→76% on PersonaMem accuracy). For the community to reproduce and build on these results, would you consider adding a standardized evaluation module?
I've been working on MemTest, a benchmark database design system for AI memory evaluation. It takes a different approach from typical eval frameworks — instead of providing evaluation metrics, it provides test databases that stress-test different aspects of memory retrieval:
- Storage integrity: Can the system store and retrieve all variants of a memory?
- Retrieval precision: 5 query types (person/location/event/time/composite) with temporal reasoning
- Memory clustering: Are related memories grouped together?
- Forgetting directionality: High-frequency vs low-frequency recall
- Reasoning: Multi-hop chain queries across memories
- Deep retrieval: Recall decay over near/mid/far semantic distance
It also includes a corpus-driven builder that generates test databases from any text corpus. We used it with the Four Great Classical Novels of Chinese literature to generate 21,793 memories + 750 queries.
For a layered architecture like yours, this could be especially useful for evaluating whether L0/L1/L2/L3 each retrieve from the correct layer under different query patterns.
Would this kind of standardized benchmarking be useful for your project? Happy to help draft a starter integration.
Thanks!
Hi team, congrats on the open-source release! The layered L0→L3 architecture is really interesting.
I noticed your benchmarks show impressive improvements (48%→76% on PersonaMem accuracy). For the community to reproduce and build on these results, would you consider adding a standardized evaluation module?
I've been working on MemTest, a benchmark database design system for AI memory evaluation. It takes a different approach from typical eval frameworks — instead of providing evaluation metrics, it provides test databases that stress-test different aspects of memory retrieval:
It also includes a corpus-driven builder that generates test databases from any text corpus. We used it with the Four Great Classical Novels of Chinese literature to generate 21,793 memories + 750 queries.
For a layered architecture like yours, this could be especially useful for evaluating whether L0/L1/L2/L3 each retrieve from the correct layer under different query patterns.
Would this kind of standardized benchmarking be useful for your project? Happy to help draft a starter integration.
Thanks!