Skip to content

Run full-scale LongMemEval and LoCoMo benchmarks #13

@lightcone0

Description

@lightcone0

Context

db0's current benchmark scores are based on limited samples:

  • LongMemEval: 80.0% on 50 questions (of 500 total)
  • LoCoMo: 76.9% on 199 queries from 1 sample (of 10 samples, ~1986 queries total)

These are promising but not directly comparable to published scores from other systems that run the full datasets.

What's needed

  1. Full LongMemEval run — all 500 questions, multiple profiles (conversational, high-recall, knowledge-base), with Gemini and OpenAI embeddings
  2. Full LoCoMo run — all 10 samples (~1986 queries), multiple ingestion modes
  3. Category breakdown — especially temporal reasoning, knowledge-update, and multi-session categories where architecture differences matter most
  4. Profile comparison — demonstrate that the right profile matters: run the same benchmark with conversational vs. high-recall vs. knowledge-base profiles

Expected outcome

Practical notes

  • Full LongMemEval with Gemini embeddings + judge takes ~4-8 hours
  • Full LoCoMo with 10 samples takes ~2-3 hours
  • Both require GEMINI_API_KEY or OPENAI_API_KEY
  • Results should be updated in packages/benchmark/README.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions