Skip to content

feat: embedding-based dedup candidates + lint link suggestions#381

Open
kostadis wants to merge 10 commits into
nashsu:mainfrom
kostadis:dedup-embeddings-turbovecdb
Open

feat: embedding-based dedup candidates + lint link suggestions#381
kostadis wants to merge 10 commits into
nashsu:mainfrom
kostadis:dedup-embeddings-turbovecdb

Conversation

@kostadis

@kostadis kostadis commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Builds on v0.4.23. Two features plus the service and research behind them.

1. Embedding-based dedup candidate generation

  • dedup-embed.ts: generates duplicate-candidate groups for rich pages via embeddings + a bounded, parallelized per-batch LLM confirm pass, surfaced as a parallel "Scan (embeddings)" lane.
  • Tighter τ threshold + a lexical date lane to cut chaff, strict confirm.
  • Dedup of duplicate groups by normalized slug-set key (the LLM can emit the same pair twice or in slug-order variants).

2. Lint link suggestions + broken-link stubs

  • lint.ts: runLinkSuggestions / resolveOrphansByEmbedding surface related-page links for orphan pages via embeddings, with a bestLexicalSlug fallback.
  • buildBrokenLinkStub + addRelatedLink apply fixes; LintResult gains brokenTarget / suggestedTarget / suggestedSource, threaded through the store and lint-view UI.

3. turbovecdb-service/ (vendored HTTP adapter)

The "Scan (embeddings)" lane POSTs to a small HTTP service (/clear, /upsert, /candidate_pairs). This PR vendors it under turbovecdb-service/ (service.py — stdlib-only http.server layer; route names match the dedup-embed.ts client exactly) so the feature is reproducible from a checkout.

4. Research artifacts

  • scripts/dedup_prototype/ — reproducible probes (turbovecdb contract probe, candidate-gen, e2e pipeline) and FINDINGS.md documenting the dedup taxonomy, two-index architecture, and the turbovecdb gaps (G1–G4) the service works around. Happy to drop these if you'd prefer to keep the repo lean.

Requirements / how to run the embedding lane

The embedding-dedup lane needs the turbovecdb-service running. It has one third-party dependency, turbovecdb, published on PyPI:

cd turbovecdb-service
pip install -r requirements.txt        # pulls turbovecdb + turbovec/numpy/filelock
python service.py --port 8077

Plus an OpenAI-compatible embeddings endpoint and chat endpoint (for the confirm pass), configured in settings.

Without the service running, the rest of the lint/dedup features are unaffected; only the embedding candidate-gen lane is inert. See turbovecdb-service/README.md for the full API.

Tests

New/expanded vitest coverage: lint-links.test.ts, dedup-embed.test.ts, lint-view.test.ts. The affected suites pass locally (38 tests across the lint/dedup blast radius).

Notes

  • package-lock.json reconciled against v0.4.23 via npm install (no new runtime deps from this branch).

kostadis and others added 10 commits June 10, 2026 22:41
Exercise the existing (unchanged) turbovecdb against the dedup
candidate-generation contract — synthetic R1–R10 probe plus a real
~1000-page Storm King run via nomic-embed-text.

Outcome: existing turbovecdb is sufficient as-is. Every requirement
passes; candidate-gen over 986 real pages runs in ~2.0s (vs the 30-min
single-prompt LLM timeout). Four minor turbovecdb gaps recorded for a
separate plan (G1 no native clear, G2 opaque duplicate-id upsert error,
G3 no batch all-pairs, G4 can't filter primary id). The real bottleneck
is embedding-input quality (nomic collapses short/stub-page inputs),
which is an llm_wiki-side concern, not turbovecdb.

Adds scripts/dedup_prototype/{tvdb_contract_probe,skt_candidate_gen}.py.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
docs/ is gitignored (internal, not shipped), so keep the findings with
the reproducible scripts under scripts/dedup_prototype/FINDINGS.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Storm King: only 15 frontmatter-only stubs (prose=0), mostly distinct
(1 real typo dup); excluding them surfaces real content dupes but leaves
a residual collapse cluster of thin/templated pages. Stubs need lexical
(not embedding) dedup; thin pages need full-body embedding input.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The 0.000 'collapse' pairs are mostly REAL duplicate content (identical
bodies: curse/sacrifice/grandfather-tree, giant-reward-offered/giant-reward),
not noise — the embedding candidate-gen was working. Genuine noise is small
(15 prose=0 stubs + a few placeholders). Records the three-lane model and the
turbovecdb two-collection (rich/thin) architecture, validated via
db.list_collections().

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…embeddings)")

Adds a scalable duplicate-detection path that avoids the one-giant-prompt scan's
timeout on large wikis. Rich (substantive-prose) entity/concept pages are embedded
once, indexed in the turbovecdb HTTP service, and reduced to candidate pairs by
cosine similarity; those are union-find clustered and confirmed with the EXISTING
LLM detector (same DETECTOR_SYSTEM_PROMPT, parsing, and notDuplicates whitelist)
over bounded batches — so the detector sees ~candidate pages, not the whole wiki.

- src/lib/dedup-embed.ts: loadRichPageRecords / embed / service client /
  clusterPairs (union-find) / packClusters (bounded batches) /
  runEmbeddingDuplicateDetection -> DuplicateGroup[] (the shape the Maintenance UI
  + merge queue already consume). Stubs/placeholders excluded (they collapse under
  embeddings — see scripts/dedup_prototype/FINDINGS.md).
- maintenance-section.tsx: a parallel "Scan (embeddings, beta)" button next to the
  existing scan (unchanged), a turbovecdb-service URL field (localStorage), and a
  progress line. Reuses the store's embeddingConfig.
- Pure helpers unit-tested; full pipeline validated end-to-end on the ~1000-page
  Storm King wiki (208 rich pages -> 201 candidate pairs -> 23 real duplicate
  groups in ~70s vs the prior 30-min hang). e2e_pipeline.py reproduces it.

The turbovecdb HTTP service lives standalone at ~/src/turbovecdb-service.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The confirm calls used the v0.4.21 dedup wrapper (temperature only) — no
reasoning-off, no max_tokens — so a batch could run unbounded (observed: 8+ min
on one batch), recreating the original runaway. Add a local buildConfirmLlmCall
that disables thinking and caps output (max_tokens=4096), shrink batches to 60
pages, and run batches with bounded concurrency (4) so total wall-clock ≈ one
batch instead of the sum.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…hter τ

The embedding scan surfaced a lot of false groups ("chaff"). Two root causes,
both addressed:

1. Date-suffixed snapshot pages (X-YYYY-MM-DD) are templated stubs that embed
   alike, so embeddings mis-clustered unrelated entities (wayside-inn-<date> with
   whispering-woods-<date>). New LEXICAL lane groups them by stripped base slug
   instead — exact: a base page + its dated snapshots share a base name; unrelated
   stubs don't. They're excluded from the embedding lane entirely. (21 clean
   date-dup groups on the test wiki, zero cross-entity chaff.)

2. Embedding similarity ≠ duplication: topically-related pages cluster at
   ~0.08–0.15 cosine and the permissive detector grouped them as "medium"
   ("political-system" vs "political-structure"). Fixes: a STRICT confirm prompt
   (same-entity-only, optional systemPrompt arg on detectDuplicateGroups), drop
   low-confidence groups, and tighten the default τ 0.15 → 0.10 (now UI-tunable).
   On the test wiki this cut the embedding lane to only genuine dups.

Also dedupe basenames within a lexical group (entity/X + concept/X collision, F1).
Adds unit tests for the date-suffix helpers; 50 dedup tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a link-suggestion lane to the lint subsystem:
- runLinkSuggestions / resolveOrphansByEmbedding surface related-page
  links for orphans via embeddings, with bestLexicalSlug fallback
- buildBrokenLinkStub + addRelatedLink apply fixes; LintResult gains
  brokenTarget / suggestedTarget / suggestedSource, threaded through
  lint-store and the lint-view UI
- dedup-embed: dedup duplicate groups by normalized slug-set key
  (LLM can emit the same pair twice / in slug-order variants); export
  servicePost for reuse
- tests: new lint-links.test.ts; expanded lint-view.test.ts

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dedup-embed.ts's "Scan (embeddings)" lane POSTs to a turbovecdb HTTP
service (/clear, /upsert, /candidate_pairs) that previously lived only
as an uncommitted script outside the repo, making the feature
unrunnable from a checkout. Vendor it under turbovecdb-service/.

- service.py: stdlib-only http.server layer over the turbovecdb library;
  per-project db_path, single "pages" collection, works around turbovecdb
  gaps G1 (no native clear) and G2 (dup ids in a batch). Route names match
  the dedup-embed.ts client exactly.
- README: run instructions + explicit Requirements note that the
  turbovecdb library itself is a separate local (non-PyPI, non-vendored)
  dependency that must be importable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
turbovecdb is published on PyPI, so the service is installable from a
clean checkout. Add requirements.txt (turbovecdb>=0.1.0, which pulls
turbovec/numpy/filelock transitively) and update the README to
"pip install -r requirements.txt" instead of the bring-your-own-library
note.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kostadis

Copy link
Copy Markdown
Contributor Author

Using these fxes, I was able to clean up a large llm-wiki that was generated from a 200+ files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant