feat: embedding-based dedup candidates + lint link suggestions by kostadis · Pull Request #381 · nashsu/llm_wiki

kostadis · 2026-06-11T05:46:35Z

Summary

Builds on v0.4.23. Two features plus the service and research behind them.

1. Embedding-based dedup candidate generation

dedup-embed.ts: generates duplicate-candidate groups for rich pages via embeddings + a bounded, parallelized per-batch LLM confirm pass, surfaced as a parallel "Scan (embeddings)" lane.
Tighter τ threshold + a lexical date lane to cut chaff, strict confirm.
Dedup of duplicate groups by normalized slug-set key (the LLM can emit the same pair twice or in slug-order variants).

2. Lint link suggestions + broken-link stubs

lint.ts: runLinkSuggestions / resolveOrphansByEmbedding surface related-page links for orphan pages via embeddings, with a bestLexicalSlug fallback.
buildBrokenLinkStub + addRelatedLink apply fixes; LintResult gains brokenTarget / suggestedTarget / suggestedSource, threaded through the store and lint-view UI.

3. `turbovecdb-service/` (vendored HTTP adapter)

The "Scan (embeddings)" lane POSTs to a small HTTP service (/clear, /upsert, /candidate_pairs). This PR vendors it under turbovecdb-service/ (service.py — stdlib-only http.server layer; route names match the dedup-embed.ts client exactly) so the feature is reproducible from a checkout.

4. Research artifacts

scripts/dedup_prototype/ — reproducible probes (turbovecdb contract probe, candidate-gen, e2e pipeline) and FINDINGS.md documenting the dedup taxonomy, two-index architecture, and the turbovecdb gaps (G1–G4) the service works around. Happy to drop these if you'd prefer to keep the repo lean.

Requirements / how to run the embedding lane

The embedding-dedup lane needs the turbovecdb-service running. It has one third-party dependency, turbovecdb, published on PyPI:

cd turbovecdb-service
pip install -r requirements.txt        # pulls turbovecdb + turbovec/numpy/filelock
python service.py --port 8077

Plus an OpenAI-compatible embeddings endpoint and chat endpoint (for the confirm pass), configured in settings.

Without the service running, the rest of the lint/dedup features are unaffected; only the embedding candidate-gen lane is inert. See turbovecdb-service/README.md for the full API.

Tests

New/expanded vitest coverage: lint-links.test.ts, dedup-embed.test.ts, lint-view.test.ts. The affected suites pass locally (38 tests across the lint/dedup blast radius).

Notes

package-lock.json reconciled against v0.4.23 via npm install (no new runtime deps from this branch).

Exercise the existing (unchanged) turbovecdb against the dedup candidate-generation contract — synthetic R1–R10 probe plus a real ~1000-page Storm King run via nomic-embed-text. Outcome: existing turbovecdb is sufficient as-is. Every requirement passes; candidate-gen over 986 real pages runs in ~2.0s (vs the 30-min single-prompt LLM timeout). Four minor turbovecdb gaps recorded for a separate plan (G1 no native clear, G2 opaque duplicate-id upsert error, G3 no batch all-pairs, G4 can't filter primary id). The real bottleneck is embedding-input quality (nomic collapses short/stub-page inputs), which is an llm_wiki-side concern, not turbovecdb. Adds scripts/dedup_prototype/{tvdb_contract_probe,skt_candidate_gen}.py. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs/ is gitignored (internal, not shipped), so keep the findings with the reproducible scripts under scripts/dedup_prototype/FINDINGS.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Storm King: only 15 frontmatter-only stubs (prose=0), mostly distinct (1 real typo dup); excluding them surfaces real content dupes but leaves a residual collapse cluster of thin/templated pages. Stubs need lexical (not embedding) dedup; thin pages need full-body embedding input. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The 0.000 'collapse' pairs are mostly REAL duplicate content (identical bodies: curse/sacrifice/grandfather-tree, giant-reward-offered/giant-reward), not noise — the embedding candidate-gen was working. Genuine noise is small (15 prose=0 stubs + a few placeholders). Records the three-lane model and the turbovecdb two-collection (rich/thin) architecture, validated via db.list_collections(). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…embeddings)") Adds a scalable duplicate-detection path that avoids the one-giant-prompt scan's timeout on large wikis. Rich (substantive-prose) entity/concept pages are embedded once, indexed in the turbovecdb HTTP service, and reduced to candidate pairs by cosine similarity; those are union-find clustered and confirmed with the EXISTING LLM detector (same DETECTOR_SYSTEM_PROMPT, parsing, and notDuplicates whitelist) over bounded batches — so the detector sees ~candidate pages, not the whole wiki. - src/lib/dedup-embed.ts: loadRichPageRecords / embed / service client / clusterPairs (union-find) / packClusters (bounded batches) / runEmbeddingDuplicateDetection -> DuplicateGroup[] (the shape the Maintenance UI + merge queue already consume). Stubs/placeholders excluded (they collapse under embeddings — see scripts/dedup_prototype/FINDINGS.md). - maintenance-section.tsx: a parallel "Scan (embeddings, beta)" button next to the existing scan (unchanged), a turbovecdb-service URL field (localStorage), and a progress line. Reuses the store's embeddingConfig. - Pure helpers unit-tested; full pipeline validated end-to-end on the ~1000-page Storm King wiki (208 rich pages -> 201 candidate pairs -> 23 real duplicate groups in ~70s vs the prior 30-min hang). e2e_pipeline.py reproduces it. The turbovecdb HTTP service lives standalone at ~/src/turbovecdb-service. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The confirm calls used the v0.4.21 dedup wrapper (temperature only) — no reasoning-off, no max_tokens — so a batch could run unbounded (observed: 8+ min on one batch), recreating the original runaway. Add a local buildConfirmLlmCall that disables thinking and caps output (max_tokens=4096), shrink batches to 60 pages, and run batches with bounded concurrency (4) so total wall-clock ≈ one batch instead of the sum. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…hter τ The embedding scan surfaced a lot of false groups ("chaff"). Two root causes, both addressed: 1. Date-suffixed snapshot pages (X-YYYY-MM-DD) are templated stubs that embed alike, so embeddings mis-clustered unrelated entities (wayside-inn-<date> with whispering-woods-<date>). New LEXICAL lane groups them by stripped base slug instead — exact: a base page + its dated snapshots share a base name; unrelated stubs don't. They're excluded from the embedding lane entirely. (21 clean date-dup groups on the test wiki, zero cross-entity chaff.) 2. Embedding similarity ≠ duplication: topically-related pages cluster at ~0.08–0.15 cosine and the permissive detector grouped them as "medium" ("political-system" vs "political-structure"). Fixes: a STRICT confirm prompt (same-entity-only, optional systemPrompt arg on detectDuplicateGroups), drop low-confidence groups, and tighten the default τ 0.15 → 0.10 (now UI-tunable). On the test wiki this cut the embedding lane to only genuine dups. Also dedupe basenames within a lexical group (entity/X + concept/X collision, F1). Adds unit tests for the date-suffix helpers; 50 dedup tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Adds a link-suggestion lane to the lint subsystem: - runLinkSuggestions / resolveOrphansByEmbedding surface related-page links for orphans via embeddings, with bestLexicalSlug fallback - buildBrokenLinkStub + addRelatedLink apply fixes; LintResult gains brokenTarget / suggestedTarget / suggestedSource, threaded through lint-store and the lint-view UI - dedup-embed: dedup duplicate groups by normalized slug-set key (LLM can emit the same pair twice / in slug-order variants); export servicePost for reuse - tests: new lint-links.test.ts; expanded lint-view.test.ts Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dedup-embed.ts's "Scan (embeddings)" lane POSTs to a turbovecdb HTTP service (/clear, /upsert, /candidate_pairs) that previously lived only as an uncommitted script outside the repo, making the feature unrunnable from a checkout. Vendor it under turbovecdb-service/. - service.py: stdlib-only http.server layer over the turbovecdb library; per-project db_path, single "pages" collection, works around turbovecdb gaps G1 (no native clear) and G2 (dup ids in a batch). Route names match the dedup-embed.ts client exactly. - README: run instructions + explicit Requirements note that the turbovecdb library itself is a separate local (non-PyPI, non-vendored) dependency that must be importable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

turbovecdb is published on PyPI, so the service is installable from a clean checkout. Add requirements.txt (turbovecdb>=0.1.0, which pulls turbovec/numpy/filelock transitively) and update the README to "pip install -r requirements.txt" instead of the bring-your-own-library note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

kostadis · 2026-06-11T06:57:29Z

Using these fxes, I was able to clean up a large llm-wiki that was generated from a 200+ files.

kostadis and others added 10 commits June 10, 2026 22:41

docs: add turbovecdb dedup findings/gap list next to the probes

d6e4ba1

docs/ is gitignored (internal, not shipped), so keep the findings with the reproducible scripts under scripts/dedup_prototype/FINDINGS.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: embedding-based dedup candidates + lint link suggestions#381

feat: embedding-based dedup candidates + lint link suggestions#381
kostadis wants to merge 10 commits into
nashsu:mainfrom
kostadis:dedup-embeddings-turbovecdb

kostadis commented Jun 11, 2026 •

edited

Loading

Uh oh!

kostadis commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kostadis commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Embedding-based dedup candidate generation

2. Lint link suggestions + broken-link stubs

3. turbovecdb-service/ (vendored HTTP adapter)

4. Research artifacts

Requirements / how to run the embedding lane

Tests

Notes

Uh oh!

kostadis commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kostadis commented Jun 11, 2026 •

edited

Loading

3. `turbovecdb-service/` (vendored HTTP adapter)