Why we built this — what we measured at 100 entities #1
ho4040
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
TL;DR. When we asked
gpt-4o-minito repair a 100-entity / 183-edge JSON knowledge graph on critic feedback, the prevailing "re-emit the entire object" pattern fixed 0 of 8 flagged defects in 73K tokens and 8.5 minutes of wall clock. The library — a critic loop with surgical RFC 6902 patching and sub-agent decomposition — fixes 8 of 8 in 17K tokens and 26 seconds.The release notes are short; this thread is the long form — what we measured, why it matters, and the design choices that turned out to be load-bearing. Treat it as the design rationale doc.
What the library does
A composable
gather → plan → executeloop with five domain-pluggable slots:The library imports no LLM client, no persistence layer, no event sink. Storage backends and event sinks are Protocols you plug in.
pip install json-correction-loop(now live on PyPI), Apache-2.0, Python 3.11+.What we measured
Target task: KG correction. Inject six classes of ground-truthed defects (
relation_swap,entity_drop,entity_paraphrase,type_violation,dangling_ref,duplicate) into a clean KG, ask the system to repair, measure (a) how many flagged defects were fixed and (b) how many unflagged fields drifted along the way.Six conditions, in order from least to most "ours":
oracleB0B1O1O2O2NAll
gpt-4o-minivia OpenRouter,temperature=0. Two fixtures: a small hand-curated 16-entity KG (real Korean cinema entities, useful for per-condition comparison) and a synthetic 100-entity KG with fictional labels (so LLM training prior cannot quietly "auto-correct" paraphrase defects).Phase A through B-5
We staged the work to keep the diagnosis tight:
Per-condition results, small fixture
Finding 1: full-regen drifts
B0 hits 100% fix rate but mutates 4 fields per run the critic never asked to touch — label paraphrases, edge reorderings, ID rewrites. That's "drift" — small corruption hiding behind the success metric. On the small fixture it might not bite. With downstream consumers that depend on stable IDs (chapter-N referencing character-N elsewhere in a narrative system), drift silently breaks integrity.
Finding 2: per-seed variance, not just means
Means hide the picture. Oracle is always at (0 drift, 100% fix). B0 is fix-stable but drifts 0–7. B1 / O1 fix rates bounce 0–100%. Only O2N consistently lands in the right corner — that's the reliability sub-agents add.
Finding 3: a naive surgical patcher fixes less
Common intuition: "send the diff, not the whole document — cheaper and drift-free." JSON Whisperer (arXiv 2510.04717) reports 31% token savings at <5% quality loss. We expected the same. We measured something else: B1 fixes 55%, O1 fixes 64%.
Debugging one case (seed=1):
Two failure modes simultaneously:
remove /edges/2shifts later indices. Eight removes in ascending order — the fifth onward points at the wrong thing. Exactly what JSON Whisperer's EASE encoding addresses.Prompt hardening mitigates both ("restoration > deletion", "remove descending"). But there is a deeper third failure mode that prompting cannot reach.
Finding 4: symptom and root cause are different
The hardest defect is
type_violation. Flip Q12's type from Film to Person, and Q12's type field itself is just "Person" — the critic can't catch it directly. The critic catches it on every edge that touches Q12. "Song Kang-hostarred_inQ12" violatesstarred_in's (Person, Film) requirement because the actual types are now (Person, Person).The critic points at the symptom (
/edges/N), not the root cause (/entities/Q12/type).Given
/edges/N, the LLM "fixes" it by changing the edge's predicate. Critic is satisfied — but Q12 is still wrong, the next iteration flags it on a different edge, and the LLM mutates that edge's predicate too. Eventually every edge touching Q12 has a silently corrupted predicate while the critic happily passes.This is the motivation for the path_finder sub-agent.
Finding 5: path_finder + narrowing closes the gap
path_finderis one LLM call per gather batch that maps each flagged pointer to a (possibly redirected) root-cause pointer.Context narrowingshows both the sub-agent and the patcher only the affected slice — implicated entities/edges plus 1-hop neighbors.Result on the small fixture:
At 100 entities — the breaking point
If you stop at the small fixture, O2N is 3.4× more expensive than B0 in tokens. Sub-agents add LLM calls. "Surgical patching is expensive" is a fair conclusion to draw at this scale.
Then we ran the same conditions at size=100:
B0 saturates
gpt-4o-mini's 8Kmax_tokensceiling mid-array, returns truncated JSON, the critic finds more dangling_refs (truncation introduces them), and the loop hardcaps after 5 iterations of 14K tokens each. Zero defects fixed. O2N's tokens barely changed (small fixture 18K → larger KG 17K) because a surgical patcher's cost is bounded by the edit footprint, not the KG size.The exact breaking point is model-specific (Claude Opus's 200K context buys you a much bigger KG before regen fails). But that there is a breaking point is structural to autoregressive decoding — longer outputs mean both more wall time and more per-token error.
Ablation — narrowing is correctness, not cost
The most striking row is O2: path_finder + full context did not work at all. Its JSON output was unparseable. The full-context prompt at size=100 is ~15K input tokens; the model's structured-output reliability degrades there even though the model nominally supports much longer contexts.
Skip narrowing in production thinking you'll add it later, and your sub-agents will silently start emitting broken JSON at some point.
Summary
When repairing large JSON state through an LLM-driven critic loop:
Each piece is load-bearing. Drop any one, the whole thing falls over.
Limitations (honestly)
gpt-4o-mini). Larger Claude / GPT-4o variants push the breaking point later, but surgical's relative advantage shows up at any scale where the KG is too large for reliable single-shot regen.Try it
examples/01_quickstart.pyexamples/kg_correction/EXPERIMENTS.mdDiscussion
The lesson we paid the most attention to: the breaking point is closer than the intuitions of the people writing the code suggest. Better to find out under controlled conditions than at 3am in production.
If you run an LLM system that produces large JSON state, this is worth measuring on your own pipeline. Open an issue or reply here with what you find — especially if your domain or model breaks the pattern in some way. We're particularly interested in:
Beta Was this translation helpful? Give feedback.
All reactions