Benchmarks: graphify as long-term memory and code intelligence #1677

safishamsi · 2026-07-04T23:39:28Z

safishamsi
Jul 4, 2026
Maintainer

We just published graphify's benchmarks. The full writeup, with per-system tables, methodology, judge validation, and commands to reproduce everything, is in BENCHMARKS.md (also linked from the README).

Here is the short version.

We ran graphify on our own open harness against dedicated memory systems (mem0, supermemory) and the usual baselines (BM25, dense RAG, hybrid), plus a handful of code-intelligence tools on a ~1M-LOC production repo (ERPNext). Everything runs under the same rules: one model for every system (Kimi K2.6), the same budgets, and the same local embedder where a system allows it. Answers are graded on gold key-facts by a judge we blind-validated against a second independent judge (90.6% agreement, kappa 0.81).

The parts we are happy with:

On LOCOMO (n=300), graphify retrieves the right memory more often than anything else we tested (recall@10 0.497, roughly 10x mem0), answers more accurately per dollar (45.3%, +18 points over mem0), and ingests for about a tenth of what supermemory costs.
On LongMemEval-S (n=50) it hits 76%, tied for the top with a strong dense-RAG baseline.
On code, giving an agent a single graphify tool takes it from 70.8% (plain grep and read) to 82.0% key-fact coverage, at a fraction of the tokens.
And the graph itself builds AST-only with zero LLM credits.

We also want to be straight about where it does not win. supermemory beats graphify on raw LOCOMO QA (49.7% vs 45.3%), though at about 11x the ingest cost. And the recall comparison against supermemory is not fully apples-to-apples: its self-host forces its own embedder, so take that one number with a grain of salt. The QA numbers are the clean comparison.

Have a look at the full tables and run it yourself: BENCHMARKS.md. If something looks off, or you think we misconfigured a system, reply here or open an issue with the config and we will re-run it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Benchmarks: graphify as long-term memory and code intelligence #1677

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Uh oh!

Benchmarks: graphify as long-term memory and code intelligence #1677

Uh oh!

Uh oh!

safishamsi Jul 4, 2026 Maintainer

Replies: 0 comments

safishamsi
Jul 4, 2026
Maintainer