Skip to content

wiki-update: split tri-protocol ingest pipeline, integrate read-pdf fanout substrate#20

Open
nsmiller2501 wants to merge 2 commits into
scunning1975:mainfrom
nsmiller2501:followup/wiki-update-fanout
Open

wiki-update: split tri-protocol ingest pipeline, integrate read-pdf fanout substrate#20
nsmiller2501 wants to merge 2 commits into
scunning1975:mainfrom
nsmiller2501:followup/wiki-update-fanout

Conversation

@nsmiller2501
Copy link
Copy Markdown

Summary

Follow-on to #8 (wiki-update). Restructures /wiki-update around three ingest protocols (E/M/S) selected automatically by substrate availability, hoists per-protocol prompts into sibling files, and extracts deterministic scripts.

⚠️ Depends on /read-pdf from PR #19 for the marker-fanout ingest path. Protocol M (the marker-conversion path) uses /read-pdf/scripts/prepare_substrate.py to generate the per-chunk substrate. PR #19 is itself a follow-on to #6 — when #6 and #19 land, this PR's Protocol M works. If only #6 lands without #19, Protocol M falls back to Protocol S (vision-batch) automatically, and the skill still functions. Conceptually I can't imagine accepting /wiki-update (#8) while rejecting /read-pdf (#6) since #8 already names read-pdf as a dependency.

What changed

Tri-protocol ingest

The skill now auto-detects which ingest path to use for each new PDF:

  • protocol_e.md — "Existing extract." If a cached _text.md is already present (from a prior /read-pdf or /split-pdf run), use it directly. Cheapest path; zero PDF re-reading.
  • protocol_m.md — "Marker." When /read-pdf's marker conversion is installed, route the paper through prepare_substrate.py for clean fanout reading. Highest fidelity for tables/equations/figures.
  • protocol_s.md — "Split." Fallback when marker is unavailable; uses the legacy /split-pdf vision-batch path.

Sibling-file structure

  • common.md: shared steps across all three protocols (first-run scaffold, citation-overlap check, project-lens summary, final /bib-update invocation).
  • wiki_synthesis.md: project-lens summarization prompt and rules, loaded once per ingest.
  • SKILL.md: now an orchestrator — detects substrate, picks the protocol, delegates to the chosen protocol_*.md, finishes with wiki_synthesis.md.

Scripts

  • New scripts/scaffold_wiki.sh: shell scaffolder for references/raw/, references/wiki/, and references/CLAUDE.md on first invocation.
  • New scripts/citation_overlap.py: detects when two papers cite the same prior work and flags it in the wiki cross-reference index.
  • New scripts/copy_marker_figure.py: pulls figure images out of marker output and into references/wiki/<paper>/figures/.

skills/wiki-update/README.md updated to describe the tri-protocol ingest pipeline.

Why

Testing

  • Ran /wiki-update against a references/raw/ containing three new PDFs.
    • First PDF had no _text.md and marker was installed → Protocol M fired, ran prepare_substrate.py, produced a clean wiki summary with embedded figures.
    • Second PDF had a stale _text.md from an old /split-pdf run → Protocol E fired, reused the extract.
    • Third PDF was a scanned image-based PDF that marker couldn't handle → automatic fallback to Protocol S (vision-batch).
  • Verified the citation-overlap script identifies shared citations across the three papers and emits the cross-reference table.
  • Confirmed /bib-update is invoked as the final step and successfully appends new .bib entries.
  • Verified first-run scaffold creates references/raw/, references/wiki/, and references/CLAUDE.md if absent.

Restructures /wiki-update around three ingest protocols (E/M/S) selected
by upstream substrate availability. Hoists per-protocol prompts and the
final wiki synthesis into sibling files; extracts scripts.

Depends on /read-pdf providing the cached neutral extract and fanout
substrate (see companion PR scunning1975#19).

Skill structure:
- New protocol_e.md: "Existing extract" — when a cached _text.md is
  already available from a prior /read-pdf or /split-pdf run, use it
  directly.
- New protocol_m.md: "Marker" — when /read-pdf's marker conversion is
  installed, route the paper through prepare_substrate.py for clean
  fanout reading.
- New protocol_s.md: "Split" — fallback when marker is unavailable;
  uses /split-pdf's vision-batch path.
- New common.md: shared steps across all three protocols (scaffold,
  citation overlap check, summary write, bib-update invocation).
- New wiki_synthesis.md: project-lens summarization prompt and rules,
  loaded once at the end of each ingest.
- SKILL.md now orchestrates protocol selection and delegates to the
  appropriate protocol_*.md, then hands off to wiki_synthesis.md.

Scripts:
- New scripts/scaffold_wiki.sh: shell scaffolder for references/raw/,
  references/wiki/, and references/CLAUDE.md on first invocation.
- New scripts/citation_overlap.py: detects when two papers cite the
  same prior work and flags it in the wiki cross-reference index.
- New scripts/copy_marker_figure.py: pulls figure images out of
  marker output and into references/wiki/<paper>/figures/.

skills/wiki-update/README.md updated to describe the new tri-protocol
ingest pipeline.
@nsmiller2501 nsmiller2501 marked this pull request as ready for review May 23, 2026 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant