Skip to content

read-pdf: become canonical PDF-reading skill, absorb split-pdf as compat wrapper#19

Open
nsmiller2501 wants to merge 2 commits into
scunning1975:mainfrom
nsmiller2501:followup/read-pdf-canonical
Open

read-pdf: become canonical PDF-reading skill, absorb split-pdf as compat wrapper#19
nsmiller2501 wants to merge 2 commits into
scunning1975:mainfrom
nsmiller2501:followup/read-pdf-canonical

Conversation

@nsmiller2501
Copy link
Copy Markdown

Summary

Follow-on to #6 (read-pdf) that also transforms /split-pdf into a thin compatibility wrapper around /read-pdf --split. This is the natural endpoint of the marker-pipeline architecture introduced in #6 — the layout-aware marker conversion becomes the default path, and the split-and-vision-read workflow becomes a fallback mode reachable via --split.

⚠️ Touches /split-pdf as well as /read-pdf. This PR also makes small changes to /split-pdf (which is the subject of #5 and PR #15). The changes to split-pdf are part of the fold: SKILL.md becomes a one-line "use /read-pdf --split" pointer, scripts/split.py becomes a runpy shim that executes read-pdf's splitter, and an agent_isolation.md is added that defers to read-pdf's isolation patterns. Conceptually this PR can't be merged independently of #5 and #6 — the fold only makes sense if both predecessor skills exist. (Note: PR #15 modernizes split-pdf's standalone splitter to pypdf. If both this PR and #15 land, the pypdf swap in #15 is moot because split-pdf no longer has its own splitter — the runpy shim delegates to read-pdf's. That's fine and the two PRs do not conflict.)

What changed

/read-pdf evolution

  • New scripts/prepare_substrate.py (~400 lines): marker-output cleanup and per-chunk substrate preparation for the fanout reader. Sanitizes headings on non-academic PDFs and wires marker's paginate_output into page anchors.
  • New scripts/cache_text.py: persists the cleaned neutral extract cross-project so re-reading the same PDF is free.
  • New scripts/split.py: pypdf-based 4-page splitter used by --split mode and downstream fallbacks.
  • New fanout_synthesis.md, fanout_worker.md, extraction_schema.md: separates substrate preparation, per-chunk worker prompt, and final synthesis so each step has a focused context.
  • New isolation files (isolation_common.md, isolation_read.md, isolation_split.md, agent_isolation.md): per-mode subagent isolation patterns.
  • install.py: idempotent marker venv installer with major-update advisory and skip-on-reuse.
  • convert.py: SHA-256-keyed extract cache so repeat conversions of the same PDF are free.
  • SKILL.md: synthesis order made explicit; --split documented as the legacy vision-batch fallback.

/split-pdf becomes a compat wrapper

  • SKILL.md is now short: "use /read-pdf --split <args>". All actual reading logic moved to read-pdf.
  • New agent_isolation.md points at /read-pdf/isolation_split.md.
  • scripts/split.py is a 16-line runpy shim that executes the canonical splitter at /read-pdf/scripts/split.py. No duplicated PDF-splitting code.
  • methodology.md and the example directory under skills/split-pdf/ are unchanged.

Why

  • /read-pdf (introduced in Add /read-pdf skill with cached marker-based PDF-to-markdown conversion #6) is the right home for marker-based layout-aware extraction. /split-pdf's vision-batch path remains useful as a fallback (no marker setup, marker conversion failures, triage), but it shouldn't be the canonical entry point.
  • Folding cuts duplicated splitting and isolation logic: one splitter, one set of isolation patterns, one synthesis pipeline.
  • Existing /split-pdf muscle memory still works — /split-pdf paper.pdf invokes the wrapper which delegates to /read-pdf --split paper.pdf.

Testing

  • Ran /read-pdf (default marker path) end-to-end on a 40-page paper; confirmed substrate prep, fanout workers, and synthesis produce the same _text.md shape as the prior /split-pdf output.
  • Ran /read-pdf --split on the same paper; confirmed the legacy vision-batch path still works.
  • Ran /split-pdf paper.pdf directly and verified the wrapper successfully delegates and produces identical output to /read-pdf --split.
  • Verified the extract cache: a second invocation of /read-pdf on the same SHA-256 returns cached _text.md without re-running marker.
  • Verified marker install path: install.py skips a reusable existing venv and emits the major-update advisory when applicable.

Restructures /read-pdf as the canonical academic-PDF reading skill and
demotes /split-pdf to a thin compatibility wrapper that delegates to
/read-pdf --split.

Read-pdf changes:
- New scripts/prepare_substrate.py: marker-output cleanup and
  per-chunk substrate preparation for the fanout reader.
- New scripts/cache_text.py: persists the cleaned neutral extract
  cross-project so re-reading the same PDF is free.
- New scripts/split.py: pypdf-based 4-page splitter (used by
  --split mode and downstream fallbacks).
- New fanout_synthesis.md, fanout_worker.md, extraction_schema.md:
  separate substrate-prep from synthesis from the worker prompt.
- New isolation_common.md, isolation_read.md, isolation_split.md,
  agent_isolation.md: subagent isolation patterns split by mode.
- install.py: idempotent marker venv installer with major-update
  advisory and skip-on-reuse.
- convert.py: SHA-256-keyed extract cache for marker output.
- SKILL.md: synthesis order made explicit; --split documented as
  the legacy vision-batch fallback.

Split-pdf changes (delegating to /read-pdf):
- SKILL.md becomes a short compat wrapper: "use /read-pdf --split".
- New agent_isolation.md points at read-pdf/isolation_split.md.
- scripts/split.py becomes a runpy shim that executes read-pdf's
  scripts/split.py.
- Existing methodology.md and README content unchanged.

skills/read-pdf/README.md and skills/split-pdf/README.md updated
to describe the new pipeline.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant