Benchmarks: graphify as long-term memory and code intelligence #1677
safishamsi
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We just published graphify's benchmarks. The full writeup, with per-system tables, methodology, judge validation, and commands to reproduce everything, is in BENCHMARKS.md (also linked from the README).
Here is the short version.
We ran graphify on our own open harness against dedicated memory systems (mem0, supermemory) and the usual baselines (BM25, dense RAG, hybrid), plus a handful of code-intelligence tools on a ~1M-LOC production repo (ERPNext). Everything runs under the same rules: one model for every system (Kimi K2.6), the same budgets, and the same local embedder where a system allows it. Answers are graded on gold key-facts by a judge we blind-validated against a second independent judge (90.6% agreement, kappa 0.81).
The parts we are happy with:
We also want to be straight about where it does not win. supermemory beats graphify on raw LOCOMO QA (49.7% vs 45.3%), though at about 11x the ingest cost. And the recall comparison against supermemory is not fully apples-to-apples: its self-host forces its own embedder, so take that one number with a grain of salt. The QA numbers are the clean comparison.
Have a look at the full tables and run it yourself: BENCHMARKS.md. If something looks off, or you think we misconfigured a system, reply here or open an issue with the config and we will re-run it.
Beta Was this translation helpful? Give feedback.
All reactions