Skip to content

fix: add missing wikilink enrichment step to ingest pipeline#422

Open
SunRicardo wants to merge 1 commit into
nashsu:mainfrom
SunRicardo:fix/add-wikilink-enrichment-step
Open

fix: add missing wikilink enrichment step to ingest pipeline#422
SunRicardo wants to merge 1 commit into
nashsu:mainfrom
SunRicardo:fix/add-wikilink-enrichment-step

Conversation

@SunRicardo

@SunRicardo SunRicardo commented Jun 18, 2026

Copy link
Copy Markdown

Summary / 概述

The enrichWithWikilinks function was implemented but never called in the ingest pipeline, causing generated wiki pages to lack [[wikilinks]] in their body text. This made pages invisible in the knowledge graph.

enrichWithWikilinks 函数已在代码中实现,但从未在 ingest pipeline 中被调用,导致生成的 wiki 页面正文缺少 [[wikilinks]]

Problem / 问题

  • Generated wiki pages had related: frontmatter with wikilinks, but body text used plain text

  • The graph builder (wiki-graph.ts) only uses body [[wikilinks]] to create edges

  • This made ~14% of pages invisible in the knowledge graph

  • 生成的 wiki 页面在 related: frontmatter 中有 wikilinks,但正文使用纯文本

  • 图谱构建器 (wiki-graph.ts) 仅使用正文中的 [[wikilinks]] 创建边

  • 这导致约 14% 的页面在知识图谱中不可见

Root Cause / 根因

The ingest LLM prompt correctly instructs the model to use [[wikilink]] syntax in the body. However, mid-size models often ignore this instruction and output plain text entity/concept names instead.

The enrichWithWikilinks function was designed as a post-save enrichment step to fix this: it asks the LLM to return a JSON list of {term, target} substitutions, then applies them safely. But this function was never imported or called in the ingest pipeline.

Ingest LLM prompt 正确指示模型在正文中使用 [[wikilink]] 语法。但中等规模模型常忽略此指令,输出纯文本实体/概念名。

enrichWithWikilinks 函数设计为保存后的增强步骤来修复此问题:它要求 LLM 返回 {term, target} 替换列表,然后安全应用。但此函数从未在 ingest pipeline 中被导入或调用。

Solution / 修复方案

Added Step 6 (wikilink enrichment) before Step 7 (embeddings) in autoIngestImpl:

autoIngestImpl 函数的 Step 7(embeddings)之前添加 Step 6(wikilink enrichment):

Key Changes / 关键变更

  1. Reordered execution steps / 调整执行顺序 — Enrichment before embeddings ensures embeddings reflect final content with wikilinks.

  2. AbortSignal support / 取消信号支持enrichWithWikilinks accepts optional AbortSignal. Checks after readFile and before writeFile.

  3. Wikilink boundary detection / wikilink 边界检测findUnlinkedOccurrence pre-scans all [[...]] intervals to prevent nested/corrupted wikilinks.

  4. Target validation / 目标验证extractPageNamesFromIndex validates that LLM-returned targets exist in the wiki index before writing.

  5. Cache-hit enrichment / 缓存命中增强 — Cache-hit branch also runs wikilink enrichment for cached pages, so older pages generated before this pipeline step can be repaired when the same source is re-ingested.

  6. Helper extraction / 辅助函数提取shouldSkipWikilinkEnrichment extracted and exported for DRY and testability.

  7. Improved logging / 改善日志activity.updateItem shows progress; console.warn for individual failures.

Verification / 验证

Local verification:

npm run typecheck
npm run test:mocks

Result:

  • ✅ TypeScript type check passed locally
  • ✅ Mock test suite passed locally: 1533 tests
  • ⚠️ npm run test:llm was not run because it requires real LLM credentials
  • ⚠️ GitHub Actions currently shows action_required with no jobs, likely because this PR is from an external contributor/fork and requires maintainer approval to run workflows

Tests / 测试

Language directive prompt tests / 语言指令测试

  • Verifies language directive is built at call time
  • Verifies successive calls pick up changed output language
  • Uses the real buildLanguageDirective implementation and checks the actual system prompt passed to streamChat

Core enrichment tests / 核心增强测试

  • Frontmatter preservation — exact frontmatter match, not just toContain
  • Existing wikilink boundary protection — terms inside [[...]] target/alias parts are skipped
  • Invalid target filtering — targets not in wiki index are rejected
  • No valid links no-op — file not written when LLM returns empty/invalid links

AbortSignal tests / 取消信号测试

  • Early abort before readFilestreamChat not called, file not written
  • Abort after streamChat returns — file not written

Skip helper tests / 跳过辅助函数测试

  • shouldSkipWikilinkEnrichment covers: wiki/index.md, wiki/log.md, wiki/overview.md, wiki/sources/*, nested paths, normal concept/entity pages

Files Modified / 修改文件

  • src/lib/ingest.ts — Step 6/7 reorder, cache-hit enrichment, exported helper
  • src/lib/enrich-wikilinks.ts — Signal checks, target validation, boundary detection
  • src/lib/enrich-wikilinks.test.ts — 17 test cases covering all scenarios

@SunRicardo SunRicardo force-pushed the fix/add-wikilink-enrichment-step branch 2 times, most recently from 2d715df to 103c684 Compare June 18, 2026 06:25
The enrichWithWikilinks function was implemented but never called in
the ingest pipeline, causing generated wiki pages to lack [[wikilinks]]
in their body text. This made pages invisible in the knowledge graph.

Changes:
- Add Step 6 (enrichment) before Step 7 (embeddings) in autoIngestImpl
- Add AbortSignal support with checks after readFile and before writeFile
- Fix findUnlinkedOccurrence to properly detect [[...]] boundaries
- Add target validation against wiki index (extractPageNamesFromIndex)
- Add cache-hit enrichment for older cached pages
- Export shouldSkipWikilinkEnrichment for testability
- Improve logging: activity.updateItem, console.warn for failures
- Restore real language directive tests (check streamChat system prompt)
- Add comprehensive skip helper tests (9 cases)
- Add tests for frontmatter preservation, wikilink boundary, abort scenarios

Tests: all 1533 tests pass
npm run test:llm not run (requires real LLM credentials)
GitHub Actions: action_required (needs manual approval)
@SunRicardo SunRicardo force-pushed the fix/add-wikilink-enrichment-step branch from 103c684 to 4ff8b07 Compare June 18, 2026 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant