read-pdf: become canonical PDF-reading skill, absorb split-pdf as compat wrapper by nsmiller2501 · Pull Request #19 · scunning1975/MixtapeTools

nsmiller2501 · 2026-05-23T22:37:59Z

Summary

Follow-on to #6 (read-pdf) that also transforms /split-pdf into a thin compatibility wrapper around /read-pdf --split. This is the natural endpoint of the marker-pipeline architecture introduced in #6 — the layout-aware marker conversion becomes the default path, and the split-and-vision-read workflow becomes a fallback mode reachable via --split.

⚠️ Touches /split-pdf as well as /read-pdf. This PR also makes small changes to /split-pdf (which is the subject of #5 and PR #15). The changes to split-pdf are part of the fold: SKILL.md becomes a one-line "use /read-pdf --split" pointer, scripts/split.py becomes a runpy shim that executes read-pdf's splitter, and an agent_isolation.md is added that defers to read-pdf's isolation patterns. Conceptually this PR can't be merged independently of #5 and #6 — the fold only makes sense if both predecessor skills exist. (Note: PR #15 modernizes split-pdf's standalone splitter to pypdf. If both this PR and #15 land, the pypdf swap in #15 is moot because split-pdf no longer has its own splitter — the runpy shim delegates to read-pdf's. That's fine and the two PRs do not conflict.)

What changed

`/read-pdf` evolution

New scripts/prepare_substrate.py (~400 lines): marker-output cleanup and per-chunk substrate preparation for the fanout reader. Sanitizes headings on non-academic PDFs and wires marker's paginate_output into page anchors.
New scripts/cache_text.py: persists the cleaned neutral extract cross-project so re-reading the same PDF is free.
New scripts/split.py: pypdf-based 4-page splitter used by --split mode and downstream fallbacks.
New fanout_synthesis.md, fanout_worker.md, extraction_schema.md: separates substrate preparation, per-chunk worker prompt, and final synthesis so each step has a focused context.
New isolation files (isolation_common.md, isolation_read.md, isolation_split.md, agent_isolation.md): per-mode subagent isolation patterns.
install.py: idempotent marker venv installer with major-update advisory and skip-on-reuse.
convert.py: SHA-256-keyed extract cache so repeat conversions of the same PDF are free.
SKILL.md: synthesis order made explicit; --split documented as the legacy vision-batch fallback.

`/split-pdf` becomes a compat wrapper

SKILL.md is now short: "use /read-pdf --split <args>". All actual reading logic moved to read-pdf.
New agent_isolation.md points at /read-pdf/isolation_split.md.
scripts/split.py is a 16-line runpy shim that executes the canonical splitter at /read-pdf/scripts/split.py. No duplicated PDF-splitting code.
methodology.md and the example directory under skills/split-pdf/ are unchanged.

Why

/read-pdf (introduced in Add /read-pdf skill with cached marker-based PDF-to-markdown conversion #6) is the right home for marker-based layout-aware extraction. /split-pdf's vision-batch path remains useful as a fallback (no marker setup, marker conversion failures, triage), but it shouldn't be the canonical entry point.
Folding cuts duplicated splitting and isolation logic: one splitter, one set of isolation patterns, one synthesis pipeline.
Existing /split-pdf muscle memory still works — /split-pdf paper.pdf invokes the wrapper which delegates to /read-pdf --split paper.pdf.

Testing

Ran /read-pdf (default marker path) end-to-end on a 40-page paper; confirmed substrate prep, fanout workers, and synthesis produce the same _text.md shape as the prior /split-pdf output.
Ran /read-pdf --split on the same paper; confirmed the legacy vision-batch path still works.
Ran /split-pdf paper.pdf directly and verified the wrapper successfully delegates and produces identical output to /read-pdf --split.
Verified the extract cache: a second invocation of /read-pdf on the same SHA-256 returns cached _text.md without re-running marker.
Verified marker install path: install.py skips a reusable existing venv and emits the major-update advisory when applicable.

Restructures /read-pdf as the canonical academic-PDF reading skill and demotes /split-pdf to a thin compatibility wrapper that delegates to /read-pdf --split. Read-pdf changes: - New scripts/prepare_substrate.py: marker-output cleanup and per-chunk substrate preparation for the fanout reader. - New scripts/cache_text.py: persists the cleaned neutral extract cross-project so re-reading the same PDF is free. - New scripts/split.py: pypdf-based 4-page splitter (used by --split mode and downstream fallbacks). - New fanout_synthesis.md, fanout_worker.md, extraction_schema.md: separate substrate-prep from synthesis from the worker prompt. - New isolation_common.md, isolation_read.md, isolation_split.md, agent_isolation.md: subagent isolation patterns split by mode. - install.py: idempotent marker venv installer with major-update advisory and skip-on-reuse. - convert.py: SHA-256-keyed extract cache for marker output. - SKILL.md: synthesis order made explicit; --split documented as the legacy vision-batch fallback. Split-pdf changes (delegating to /read-pdf): - SKILL.md becomes a short compat wrapper: "use /read-pdf --split". - New agent_isolation.md points at read-pdf/isolation_split.md. - scripts/split.py becomes a runpy shim that executes read-pdf's scripts/split.py. - Existing methodology.md and README content unchanged. skills/read-pdf/README.md and skills/split-pdf/README.md updated to describe the new pipeline.

nsmiller2501 added 2 commits May 8, 2026 11:24

Add read-pdf marker conversion skill

c8fc440

nsmiller2501 mentioned this pull request May 23, 2026

wiki-update: split tri-protocol ingest pipeline, integrate read-pdf fanout substrate #20

Open

nsmiller2501 marked this pull request as ready for review May 23, 2026 22:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read-pdf: become canonical PDF-reading skill, absorb split-pdf as compat wrapper#19

read-pdf: become canonical PDF-reading skill, absorb split-pdf as compat wrapper#19
nsmiller2501 wants to merge 2 commits into
scunning1975:mainfrom
nsmiller2501:followup/read-pdf-canonical

nsmiller2501 commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nsmiller2501 commented May 23, 2026

Summary

What changed

/read-pdf evolution

/split-pdf becomes a compat wrapper

Why

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`/read-pdf` evolution

`/split-pdf` becomes a compat wrapper