Why we built this — what we measured at 100 entities #1

ho4040 · 2026-05-08T15:43:39Z

ho4040
May 8, 2026
Maintainer

TL;DR. When we asked gpt-4o-mini to repair a 100-entity / 183-edge JSON knowledge graph on critic feedback, the prevailing "re-emit the entire object" pattern fixed 0 of 8 flagged defects in 73K tokens and 8.5 minutes of wall clock. The library — a critic loop with surgical RFC 6902 patching and sub-agent decomposition — fixes 8 of 8 in 17K tokens and 26 seconds.

size=100 (100 ent / 183 edges, 8 defects, gpt-4o-mini, T=0):
  full-regen baseline:     fix=0%    tokens=73,740   wall=435s
  this stack (loop+subagents):  fix=100%  tokens=17,117   wall= 26s

The release notes are short; this thread is the long form — what we measured, why it matters, and the design choices that turned out to be load-bearing. Treat it as the design rationale doc.

What the library does

A composable gather → plan → execute loop with five domain-pluggable slots:

Critic (you supply). Reports defects against stable item IDs (JSON pointers, entity IDs).
path_finder. Maps each critic-flagged symptom pointer to its root-cause pointer.
Context narrowing. Scopes both the sub-agent and the patcher to the slice of state implicated by flagged paths. Turns out to be a correctness component, not just an optimization.
Surgical patcher. Emits RFC 6902 ops via tool calling, validated and applied with a standard JSON Patch library.
Convergence policy. Quality-stable + hardcap, composable as Protocols.

The library imports no LLM client, no persistence layer, no event sink. Storage backends and event sinks are Protocols you plug in.

pip install json-correction-loop (now live on PyPI), Apache-2.0, Python 3.11+.

What we measured

Target task: KG correction. Inject six classes of ground-truthed defects (relation_swap, entity_drop, entity_paraphrase, type_violation, dangling_ref, duplicate) into a clean KG, ask the system to repair, measure (a) how many flagged defects were fixed and (b) how many unflagged fields drifted along the way.

Six conditions, in order from least to most "ours":

Condition	Mechanism
`oracle`	Receives the ground-truthed defect log, applies inverses deterministically. Upper bound. No LLM.
`B0`	"Re-emit the entire KG." The common production pattern.
`B1`	RFC 6902 patch ops in one shot, no loop. JSON Whisperer–style.
`O1`	B1 inside the critic loop, no sub-agents.
`O2`	O1 + path_finder, full-context prompts.
`O2N`	O2 + context narrowing. The full stack.

All gpt-4o-mini via OpenRouter, temperature=0. Two fixtures: a small hand-curated 16-entity KG (real Korean cinema entities, useful for per-condition comparison) and a synthetic 100-entity KG with fictional labels (so LLM training prior cannot quietly "auto-correct" paraphrase defects).

Phase A through B-5

We staged the work to keep the diagnosis tight:

Phase	What it validates
A — oracle	Whether a deterministic patcher with the ground-truth log fixes 100%. Pipeline sanity check.
B-0 — B0	Quantify the drift of the standard full-regen pattern.
B-1 — B1	Limits of single-shot RFC 6902 patching.
B-2 — O1 / O2	Isolate the contribution of the loop and of path_finder.
B-3 — O2N	Add context narrowing.
B-4 — size sweep	Where does it break as the KG grows?
B-5 — ablation	Separate each component's contribution at scale.

Per-condition results, small fixture

Cond	Fix%	Drift	Tokens	Wall
oracle (UB)	100	0.2	0	0s
B0 full-regen	100	4.0	2,506	14.5s
B1 single-shot patch	55	5.5	1,774	3.1s
O1 loop+patch	64	4.8	6,218	9.6s
O2 +path_finder	84	5.5	13,244	19.7s
O2N +narrowing	97	4.5	8,482	17.2s

Finding 1: full-regen drifts

B0 hits 100% fix rate but mutates 4 fields per run the critic never asked to touch — label paraphrases, edge reorderings, ID rewrites. That's "drift" — small corruption hiding behind the success metric. On the small fixture it might not bite. With downstream consumers that depend on stable IDs (chapter-N referencing character-N elsewhere in a narrative system), drift silently breaks integrity.

Finding 2: per-seed variance, not just means

Means hide the picture. Oracle is always at (0 drift, 100% fix). B0 is fix-stable but drifts 0–7. B1 / O1 fix rates bounce 0–100%. Only O2N consistently lands in the right corner — that's the reliability sub-agents add.

Finding 3: a naive surgical patcher fixes less

Common intuition: "send the diff, not the whole document — cheaper and drift-free." JSON Whisperer (arXiv 2510.04717) reports 31% token savings at <5% quality loss. We expected the same. We measured something else: B1 fixes 55%, O1 fixes 64%.

Debugging one case (seed=1):

8 critic-flagged issues:
  /edges/2:  dangling_ref (Q2 not in entities)
  /edges/3:  dangling_ref (Q2 not in entities)
  /edges/4:  type_violation ('spouse_of' wants Person→Person, got Person→Film)
  /edges/5:  dangling_ref (Q15 not in entities)
  ...

8 patch ops the LLM emitted:
  {'op': 'remove', 'path': '/edges/2'}    # should restore Q2, not delete the edge
  {'op': 'remove', 'path': '/edges/3'}
  {'op': 'remove', 'path': '/edges/4'}
  ...
  {'op': 'remove', 'path': '/edges/9'}    # invalid: indices shifted

Two failure modes simultaneously:

Lazy remove. Instead of fixing a defect, the LLM deletes the offending item. Right fix for a missing-Q2 dangling_ref is to restore Q2; the LLM removes every edge that references Q2. Data loss.
Index drift. remove /edges/2 shifts later indices. Eight removes in ascending order — the fifth onward points at the wrong thing. Exactly what JSON Whisperer's EASE encoding addresses.

Prompt hardening mitigates both ("restoration > deletion", "remove descending"). But there is a deeper third failure mode that prompting cannot reach.

Finding 4: symptom and root cause are different

The hardest defect is type_violation. Flip Q12's type from Film to Person, and Q12's type field itself is just "Person" — the critic can't catch it directly. The critic catches it on every edge that touches Q12. "Song Kang-ho starred_in Q12" violates starred_in's (Person, Film) requirement because the actual types are now (Person, Person).

The critic points at the symptom (/edges/N), not the root cause (/entities/Q12/type).

Given /edges/N, the LLM "fixes" it by changing the edge's predicate. Critic is satisfied — but Q12 is still wrong, the next iteration flags it on a different edge, and the LLM mutates that edge's predicate too. Eventually every edge touching Q12 has a silently corrupted predicate while the critic happily passes.

This is the motivation for the path_finder sub-agent.

Finding 5: path_finder + narrowing closes the gap

path_finder is one LLM call per gather batch that maps each flagged pointer to a (possibly redirected) root-cause pointer. Context narrowing shows both the sub-agent and the patcher only the affected slice — implicated entities/edges plus 1-hop neighbors.

Result on the small fixture:

O1 (loop only): 64%
O2 (+ path_finder, full context): 84%
O2N (+ narrowing): 97% — within 3pp of the oracle.

At 100 entities — the breaking point

If you stop at the small fixture, O2N is 3.4× more expensive than B0 in tokens. Sub-agents add LLM calls. "Surgical patching is expensive" is a fair conclusion to draw at this scale.

Then we ran the same conditions at size=100:

size=100 (100 ent / 183 edges, 8 defects):
  B0:   fix=  0%   tokens=73,740   wall=435s
  O2N:  fix=100%   tokens=17,117   wall= 26s

B0 saturates gpt-4o-mini's 8K max_tokens ceiling mid-array, returns truncated JSON, the critic finds more dangling_refs (truncation introduces them), and the loop hardcaps after 5 iterations of 14K tokens each. Zero defects fixed. O2N's tokens barely changed (small fixture 18K → larger KG 17K) because a surgical patcher's cost is bounded by the edit footprint, not the KG size.

The central empirical claim. Surgical patching's cost scales with edit footprint. Full-regen's scales with document size. For any non-trivial JSON state under multi-defect critique, surgical strictly dominates — both on tokens, and on whether the method works at all.

The exact breaking point is model-specific (Claude Opus's 200K context buys you a much bigger KG before regen fails). But that there is a breaking point is structural to autoregressive decoding — longer outputs mean both more wall time and more per-token error.

Ablation — narrowing is correctness, not cost

Cond	path_finder	narrowing	Fix%	Drift	Tokens
B0	n/a	n/a	0%	8	73,740
O1	no	no	35%	23	42,621
O1N	no	yes	57%	13	14,392
O2	yes	no	JSON parse failure	—	—
O2N	yes	yes	100%	6	17,117

The most striking row is O2: path_finder + full context did not work at all. Its JSON output was unparseable. The full-context prompt at size=100 is ~15K input tokens; the model's structured-output reliability degrades there even though the model nominally supports much longer contexts.

Narrowing is not "premature optimization." It is a prerequisite for sub-agent reliability.

Skip narrowing in production thinking you'll add it later, and your sub-agents will silently start emitting broken JSON at some point.

Summary

When repairing large JSON state through an LLM-driven critic loop:

Full-regen breaks earlier than intuition suggests. 100 entities is production-realistic. Bigger models push it later but don't eliminate it — autoregressive decoding has structural failure at long outputs.
A simple surgical patcher (JSON Whisperer + critic loop) is not enough. Symptom-vs-root-cause keeps fix rate stuck at ~64%.
Sub-agents (path_finder) close the gap — +33pp.
Context narrowing is a correctness component, not a cost optimization. Drop it and the sub-agents stop working.

Each piece is load-bearing. Drop any one, the whole thing falls over.

Limitations (honestly)

Single model (gpt-4o-mini). Larger Claude / GPT-4o variants push the breaking point later, but surgical's relative advantage shows up at any scale where the KG is too large for reliable single-shot regen.
Single domain (KG correction). Generality is currently an architectural argument; empirical confirmation on a second domain (config refactor, narrative state) is the natural next step.
Synthetic perturbations. Real LLM-emitted defects have different distributions.
Ablation is one seed at size=100. Robustness with 5+ seeds is on the list.
No semantic critic yet — paraphrase defects (entity_paraphrase) are invisible to the structural critic; closing that bucket needs a semantic critic + RAG.

Try it

pip install json-correction-loop

Quickstart (no-LLM): examples/01_quickstart.py
KG-correction example: examples/kg_correction/
Full numbers + reproduction recipe: EXPERIMENTS.md

Discussion

The lesson we paid the most attention to: the breaking point is closer than the intuitions of the people writing the code suggest. Better to find out under controlled conditions than at 3am in production.

If you run an LLM system that produces large JSON state, this is worth measuring on your own pipeline. Open an issue or reply here with what you find — especially if your domain or model breaks the pattern in some way. We're particularly interested in:

Does the size=100 wall move as expected when you swap in a larger model?
Does the symptom-vs-root-cause failure look the same in non-KG domains (config trees, narrative state)?
What sub-agents would your domain need that aren't path_finder / patcher / narrowing?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why we built this — what we measured at 100 entities #1

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Why we built this — what we measured at 100 entities #1

Uh oh!

ho4040 May 8, 2026 Maintainer

What the library does

What we measured

Phase A through B-5

Per-condition results, small fixture

Finding 1: full-regen drifts

Finding 2: per-seed variance, not just means

Finding 3: a naive surgical patcher fixes less

Finding 4: symptom and root cause are different

Finding 5: path_finder + narrowing closes the gap

At 100 entities — the breaking point

Ablation — narrowing is correctness, not cost

Summary

Limitations (honestly)

Try it

Discussion

Replies: 0 comments

ho4040
May 8, 2026
Maintainer