fix: add missing wikilink enrichment step to ingest pipeline#422
Open
SunRicardo wants to merge 1 commit into
Open
fix: add missing wikilink enrichment step to ingest pipeline#422SunRicardo wants to merge 1 commit into
SunRicardo wants to merge 1 commit into
Conversation
2d715df to
103c684
Compare
The enrichWithWikilinks function was implemented but never called in the ingest pipeline, causing generated wiki pages to lack [[wikilinks]] in their body text. This made pages invisible in the knowledge graph. Changes: - Add Step 6 (enrichment) before Step 7 (embeddings) in autoIngestImpl - Add AbortSignal support with checks after readFile and before writeFile - Fix findUnlinkedOccurrence to properly detect [[...]] boundaries - Add target validation against wiki index (extractPageNamesFromIndex) - Add cache-hit enrichment for older cached pages - Export shouldSkipWikilinkEnrichment for testability - Improve logging: activity.updateItem, console.warn for failures - Restore real language directive tests (check streamChat system prompt) - Add comprehensive skip helper tests (9 cases) - Add tests for frontmatter preservation, wikilink boundary, abort scenarios Tests: all 1533 tests pass npm run test:llm not run (requires real LLM credentials) GitHub Actions: action_required (needs manual approval)
103c684 to
4ff8b07
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary / 概述
The
enrichWithWikilinksfunction was implemented but never called in the ingest pipeline, causing generated wiki pages to lack[[wikilinks]]in their body text. This made pages invisible in the knowledge graph.enrichWithWikilinks函数已在代码中实现,但从未在 ingest pipeline 中被调用,导致生成的 wiki 页面正文缺少[[wikilinks]]。Problem / 问题
Generated wiki pages had
related:frontmatter with wikilinks, but body text used plain textThe graph builder (
wiki-graph.ts) only uses body[[wikilinks]]to create edgesThis made ~14% of pages invisible in the knowledge graph
生成的 wiki 页面在
related:frontmatter 中有 wikilinks,但正文使用纯文本图谱构建器 (
wiki-graph.ts) 仅使用正文中的[[wikilinks]]创建边这导致约 14% 的页面在知识图谱中不可见
Root Cause / 根因
The ingest LLM prompt correctly instructs the model to use
[[wikilink]]syntax in the body. However, mid-size models often ignore this instruction and output plain text entity/concept names instead.The
enrichWithWikilinksfunction was designed as a post-save enrichment step to fix this: it asks the LLM to return a JSON list of{term, target}substitutions, then applies them safely. But this function was never imported or called in the ingest pipeline.Ingest LLM prompt 正确指示模型在正文中使用
[[wikilink]]语法。但中等规模模型常忽略此指令,输出纯文本实体/概念名。enrichWithWikilinks函数设计为保存后的增强步骤来修复此问题:它要求 LLM 返回{term, target}替换列表,然后安全应用。但此函数从未在 ingest pipeline 中被导入或调用。Solution / 修复方案
Added Step 6 (wikilink enrichment) before Step 7 (embeddings) in
autoIngestImpl:在
autoIngestImpl函数的 Step 7(embeddings)之前添加 Step 6(wikilink enrichment):Key Changes / 关键变更
Reordered execution steps / 调整执行顺序 — Enrichment before embeddings ensures embeddings reflect final content with wikilinks.
AbortSignal support / 取消信号支持 —
enrichWithWikilinksaccepts optionalAbortSignal. Checks afterreadFileand beforewriteFile.Wikilink boundary detection / wikilink 边界检测 —
findUnlinkedOccurrencepre-scans all[[...]]intervals to prevent nested/corrupted wikilinks.Target validation / 目标验证 —
extractPageNamesFromIndexvalidates that LLM-returned targets exist in the wiki index before writing.Cache-hit enrichment / 缓存命中增强 — Cache-hit branch also runs wikilink enrichment for cached pages, so older pages generated before this pipeline step can be repaired when the same source is re-ingested.
Helper extraction / 辅助函数提取 —
shouldSkipWikilinkEnrichmentextracted and exported for DRY and testability.Improved logging / 改善日志 —
activity.updateItemshows progress;console.warnfor individual failures.Verification / 验证
Local verification:
Result:
npm run test:llmwas not run because it requires real LLM credentialsaction_requiredwith no jobs, likely because this PR is from an external contributor/fork and requires maintainer approval to run workflowsTests / 测试
Language directive prompt tests / 语言指令测试
buildLanguageDirectiveimplementation and checks the actual system prompt passed tostreamChatCore enrichment tests / 核心增强测试
toContain[[...]]target/alias parts are skippedAbortSignal tests / 取消信号测试
readFile—streamChatnot called, file not writtenstreamChatreturns — file not writtenSkip helper tests / 跳过辅助函数测试
shouldSkipWikilinkEnrichmentcovers:wiki/index.md,wiki/log.md,wiki/overview.md,wiki/sources/*, nested paths, normal concept/entity pagesFiles Modified / 修改文件
src/lib/ingest.ts— Step 6/7 reorder, cache-hit enrichment, exported helpersrc/lib/enrich-wikilinks.ts— Signal checks, target validation, boundary detectionsrc/lib/enrich-wikilinks.test.ts— 17 test cases covering all scenarios