wiki-update: split tri-protocol ingest pipeline, integrate read-pdf fanout substrate#20
Open
nsmiller2501 wants to merge 2 commits into
Open
Conversation
Restructures /wiki-update around three ingest protocols (E/M/S) selected by upstream substrate availability. Hoists per-protocol prompts and the final wiki synthesis into sibling files; extracts scripts. Depends on /read-pdf providing the cached neutral extract and fanout substrate (see companion PR scunning1975#19). Skill structure: - New protocol_e.md: "Existing extract" — when a cached _text.md is already available from a prior /read-pdf or /split-pdf run, use it directly. - New protocol_m.md: "Marker" — when /read-pdf's marker conversion is installed, route the paper through prepare_substrate.py for clean fanout reading. - New protocol_s.md: "Split" — fallback when marker is unavailable; uses /split-pdf's vision-batch path. - New common.md: shared steps across all three protocols (scaffold, citation overlap check, summary write, bib-update invocation). - New wiki_synthesis.md: project-lens summarization prompt and rules, loaded once at the end of each ingest. - SKILL.md now orchestrates protocol selection and delegates to the appropriate protocol_*.md, then hands off to wiki_synthesis.md. Scripts: - New scripts/scaffold_wiki.sh: shell scaffolder for references/raw/, references/wiki/, and references/CLAUDE.md on first invocation. - New scripts/citation_overlap.py: detects when two papers cite the same prior work and flags it in the wiki cross-reference index. - New scripts/copy_marker_figure.py: pulls figure images out of marker output and into references/wiki/<paper>/figures/. skills/wiki-update/README.md updated to describe the new tri-protocol ingest pipeline.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-on to #8 (wiki-update). Restructures
/wiki-updatearound three ingest protocols (E/M/S) selected automatically by substrate availability, hoists per-protocol prompts into sibling files, and extracts deterministic scripts.What changed
Tri-protocol ingest
The skill now auto-detects which ingest path to use for each new PDF:
protocol_e.md— "Existing extract." If a cached_text.mdis already present (from a prior/read-pdfor/split-pdfrun), use it directly. Cheapest path; zero PDF re-reading.protocol_m.md— "Marker." When/read-pdf's marker conversion is installed, route the paper throughprepare_substrate.pyfor clean fanout reading. Highest fidelity for tables/equations/figures.protocol_s.md— "Split." Fallback when marker is unavailable; uses the legacy/split-pdfvision-batch path.Sibling-file structure
common.md: shared steps across all three protocols (first-run scaffold, citation-overlap check, project-lens summary, final/bib-updateinvocation).wiki_synthesis.md: project-lens summarization prompt and rules, loaded once per ingest.SKILL.md: now an orchestrator — detects substrate, picks the protocol, delegates to the chosenprotocol_*.md, finishes withwiki_synthesis.md.Scripts
scripts/scaffold_wiki.sh: shell scaffolder forreferences/raw/,references/wiki/, andreferences/CLAUDE.mdon first invocation.scripts/citation_overlap.py: detects when two papers cite the same prior work and flags it in the wiki cross-reference index.scripts/copy_marker_figure.py: pulls figure images out of marker output and intoreferences/wiki/<paper>/figures/.skills/wiki-update/README.mdupdated to describe the tri-protocol ingest pipeline.Why
/wiki-updatecarried all three ingest paths in one SKILL.md, paying the full token cost on every invocation regardless of which path actually fired. The protocol split lets the orchestrator load only the relevant path./read-pdf'sprepare_substrate.pyunlocks marker-fidelity figure/table/equation extraction for the wiki workflow — the same upgrade/read-pdfitself gets in PR read-pdf: become canonical PDF-reading skill, absorb split-pdf as compat wrapper #19.Testing
/wiki-updateagainst areferences/raw/containing three new PDFs._text.mdand marker was installed → Protocol M fired, ranprepare_substrate.py, produced a clean wiki summary with embedded figures._text.mdfrom an old/split-pdfrun → Protocol E fired, reused the extract./bib-updateis invoked as the final step and successfully appends new.bibentries.references/raw/,references/wiki/, andreferences/CLAUDE.mdif absent.