feat: embedding-based dedup candidates + lint link suggestions#381
Open
kostadis wants to merge 10 commits into
Open
feat: embedding-based dedup candidates + lint link suggestions#381kostadis wants to merge 10 commits into
kostadis wants to merge 10 commits into
Conversation
Exercise the existing (unchanged) turbovecdb against the dedup
candidate-generation contract — synthetic R1–R10 probe plus a real
~1000-page Storm King run via nomic-embed-text.
Outcome: existing turbovecdb is sufficient as-is. Every requirement
passes; candidate-gen over 986 real pages runs in ~2.0s (vs the 30-min
single-prompt LLM timeout). Four minor turbovecdb gaps recorded for a
separate plan (G1 no native clear, G2 opaque duplicate-id upsert error,
G3 no batch all-pairs, G4 can't filter primary id). The real bottleneck
is embedding-input quality (nomic collapses short/stub-page inputs),
which is an llm_wiki-side concern, not turbovecdb.
Adds scripts/dedup_prototype/{tvdb_contract_probe,skt_candidate_gen}.py.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
docs/ is gitignored (internal, not shipped), so keep the findings with the reproducible scripts under scripts/dedup_prototype/FINDINGS.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Storm King: only 15 frontmatter-only stubs (prose=0), mostly distinct (1 real typo dup); excluding them surfaces real content dupes but leaves a residual collapse cluster of thin/templated pages. Stubs need lexical (not embedding) dedup; thin pages need full-body embedding input. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The 0.000 'collapse' pairs are mostly REAL duplicate content (identical bodies: curse/sacrifice/grandfather-tree, giant-reward-offered/giant-reward), not noise — the embedding candidate-gen was working. Genuine noise is small (15 prose=0 stubs + a few placeholders). Records the three-lane model and the turbovecdb two-collection (rich/thin) architecture, validated via db.list_collections(). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…embeddings)") Adds a scalable duplicate-detection path that avoids the one-giant-prompt scan's timeout on large wikis. Rich (substantive-prose) entity/concept pages are embedded once, indexed in the turbovecdb HTTP service, and reduced to candidate pairs by cosine similarity; those are union-find clustered and confirmed with the EXISTING LLM detector (same DETECTOR_SYSTEM_PROMPT, parsing, and notDuplicates whitelist) over bounded batches — so the detector sees ~candidate pages, not the whole wiki. - src/lib/dedup-embed.ts: loadRichPageRecords / embed / service client / clusterPairs (union-find) / packClusters (bounded batches) / runEmbeddingDuplicateDetection -> DuplicateGroup[] (the shape the Maintenance UI + merge queue already consume). Stubs/placeholders excluded (they collapse under embeddings — see scripts/dedup_prototype/FINDINGS.md). - maintenance-section.tsx: a parallel "Scan (embeddings, beta)" button next to the existing scan (unchanged), a turbovecdb-service URL field (localStorage), and a progress line. Reuses the store's embeddingConfig. - Pure helpers unit-tested; full pipeline validated end-to-end on the ~1000-page Storm King wiki (208 rich pages -> 201 candidate pairs -> 23 real duplicate groups in ~70s vs the prior 30-min hang). e2e_pipeline.py reproduces it. The turbovecdb HTTP service lives standalone at ~/src/turbovecdb-service. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The confirm calls used the v0.4.21 dedup wrapper (temperature only) — no reasoning-off, no max_tokens — so a batch could run unbounded (observed: 8+ min on one batch), recreating the original runaway. Add a local buildConfirmLlmCall that disables thinking and caps output (max_tokens=4096), shrink batches to 60 pages, and run batches with bounded concurrency (4) so total wall-clock ≈ one batch instead of the sum. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…hter τ
The embedding scan surfaced a lot of false groups ("chaff"). Two root causes,
both addressed:
1. Date-suffixed snapshot pages (X-YYYY-MM-DD) are templated stubs that embed
alike, so embeddings mis-clustered unrelated entities (wayside-inn-<date> with
whispering-woods-<date>). New LEXICAL lane groups them by stripped base slug
instead — exact: a base page + its dated snapshots share a base name; unrelated
stubs don't. They're excluded from the embedding lane entirely. (21 clean
date-dup groups on the test wiki, zero cross-entity chaff.)
2. Embedding similarity ≠ duplication: topically-related pages cluster at
~0.08–0.15 cosine and the permissive detector grouped them as "medium"
("political-system" vs "political-structure"). Fixes: a STRICT confirm prompt
(same-entity-only, optional systemPrompt arg on detectDuplicateGroups), drop
low-confidence groups, and tighten the default τ 0.15 → 0.10 (now UI-tunable).
On the test wiki this cut the embedding lane to only genuine dups.
Also dedupe basenames within a lexical group (entity/X + concept/X collision, F1).
Adds unit tests for the date-suffix helpers; 50 dedup tests green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a link-suggestion lane to the lint subsystem: - runLinkSuggestions / resolveOrphansByEmbedding surface related-page links for orphans via embeddings, with bestLexicalSlug fallback - buildBrokenLinkStub + addRelatedLink apply fixes; LintResult gains brokenTarget / suggestedTarget / suggestedSource, threaded through lint-store and the lint-view UI - dedup-embed: dedup duplicate groups by normalized slug-set key (LLM can emit the same pair twice / in slug-order variants); export servicePost for reuse - tests: new lint-links.test.ts; expanded lint-view.test.ts Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dedup-embed.ts's "Scan (embeddings)" lane POSTs to a turbovecdb HTTP service (/clear, /upsert, /candidate_pairs) that previously lived only as an uncommitted script outside the repo, making the feature unrunnable from a checkout. Vendor it under turbovecdb-service/. - service.py: stdlib-only http.server layer over the turbovecdb library; per-project db_path, single "pages" collection, works around turbovecdb gaps G1 (no native clear) and G2 (dup ids in a batch). Route names match the dedup-embed.ts client exactly. - README: run instructions + explicit Requirements note that the turbovecdb library itself is a separate local (non-PyPI, non-vendored) dependency that must be importable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
turbovecdb is published on PyPI, so the service is installable from a clean checkout. Add requirements.txt (turbovecdb>=0.1.0, which pulls turbovec/numpy/filelock transitively) and update the README to "pip install -r requirements.txt" instead of the bring-your-own-library note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Using these fxes, I was able to clean up a large llm-wiki that was generated from a 200+ files. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds on
v0.4.23. Two features plus the service and research behind them.1. Embedding-based dedup candidate generation
dedup-embed.ts: generates duplicate-candidate groups for rich pages via embeddings + a bounded, parallelized per-batch LLM confirm pass, surfaced as a parallel "Scan (embeddings)" lane.2. Lint link suggestions + broken-link stubs
lint.ts:runLinkSuggestions/resolveOrphansByEmbeddingsurface related-page links for orphan pages via embeddings, with abestLexicalSlugfallback.buildBrokenLinkStub+addRelatedLinkapply fixes;LintResultgainsbrokenTarget/suggestedTarget/suggestedSource, threaded through the store andlint-viewUI.3.
turbovecdb-service/(vendored HTTP adapter)The "Scan (embeddings)" lane POSTs to a small HTTP service (
/clear,/upsert,/candidate_pairs). This PR vendors it underturbovecdb-service/(service.py— stdlib-onlyhttp.serverlayer; route names match thededup-embed.tsclient exactly) so the feature is reproducible from a checkout.4. Research artifacts
scripts/dedup_prototype/— reproducible probes (turbovecdb contract probe, candidate-gen, e2e pipeline) andFINDINGS.mddocumenting the dedup taxonomy, two-index architecture, and the turbovecdb gaps (G1–G4) the service works around. Happy to drop these if you'd prefer to keep the repo lean.Requirements / how to run the embedding lane
The embedding-dedup lane needs the
turbovecdb-servicerunning. It has one third-party dependency,turbovecdb, published on PyPI:Plus an OpenAI-compatible embeddings endpoint and chat endpoint (for the confirm pass), configured in settings.
Without the service running, the rest of the lint/dedup features are unaffected; only the embedding candidate-gen lane is inert. See
turbovecdb-service/README.mdfor the full API.Tests
New/expanded vitest coverage:
lint-links.test.ts,dedup-embed.test.ts,lint-view.test.ts. The affected suites pass locally (38 tests across the lint/dedup blast radius).Notes
package-lock.jsonreconciled againstv0.4.23vianpm install(no new runtime deps from this branch).