read-pdf: become canonical PDF-reading skill, absorb split-pdf as compat wrapper#19
Open
nsmiller2501 wants to merge 2 commits into
Open
read-pdf: become canonical PDF-reading skill, absorb split-pdf as compat wrapper#19nsmiller2501 wants to merge 2 commits into
nsmiller2501 wants to merge 2 commits into
Conversation
Restructures /read-pdf as the canonical academic-PDF reading skill and demotes /split-pdf to a thin compatibility wrapper that delegates to /read-pdf --split. Read-pdf changes: - New scripts/prepare_substrate.py: marker-output cleanup and per-chunk substrate preparation for the fanout reader. - New scripts/cache_text.py: persists the cleaned neutral extract cross-project so re-reading the same PDF is free. - New scripts/split.py: pypdf-based 4-page splitter (used by --split mode and downstream fallbacks). - New fanout_synthesis.md, fanout_worker.md, extraction_schema.md: separate substrate-prep from synthesis from the worker prompt. - New isolation_common.md, isolation_read.md, isolation_split.md, agent_isolation.md: subagent isolation patterns split by mode. - install.py: idempotent marker venv installer with major-update advisory and skip-on-reuse. - convert.py: SHA-256-keyed extract cache for marker output. - SKILL.md: synthesis order made explicit; --split documented as the legacy vision-batch fallback. Split-pdf changes (delegating to /read-pdf): - SKILL.md becomes a short compat wrapper: "use /read-pdf --split". - New agent_isolation.md points at read-pdf/isolation_split.md. - scripts/split.py becomes a runpy shim that executes read-pdf's scripts/split.py. - Existing methodology.md and README content unchanged. skills/read-pdf/README.md and skills/split-pdf/README.md updated to describe the new pipeline.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-on to #6 (read-pdf) that also transforms /split-pdf into a thin compatibility wrapper around
/read-pdf --split. This is the natural endpoint of the marker-pipeline architecture introduced in #6 — the layout-aware marker conversion becomes the default path, and the split-and-vision-read workflow becomes a fallback mode reachable via--split.What changed
/read-pdfevolutionscripts/prepare_substrate.py(~400 lines): marker-output cleanup and per-chunk substrate preparation for the fanout reader. Sanitizes headings on non-academic PDFs and wiresmarker'spaginate_outputinto page anchors.scripts/cache_text.py: persists the cleaned neutral extract cross-project so re-reading the same PDF is free.scripts/split.py: pypdf-based 4-page splitter used by--splitmode and downstream fallbacks.fanout_synthesis.md,fanout_worker.md,extraction_schema.md: separates substrate preparation, per-chunk worker prompt, and final synthesis so each step has a focused context.isolation_common.md,isolation_read.md,isolation_split.md,agent_isolation.md): per-mode subagent isolation patterns.install.py: idempotent marker venv installer with major-update advisory and skip-on-reuse.convert.py: SHA-256-keyed extract cache so repeat conversions of the same PDF are free.SKILL.md: synthesis order made explicit;--splitdocumented as the legacy vision-batch fallback./split-pdfbecomes a compat wrapperSKILL.mdis now short: "use/read-pdf --split <args>". All actual reading logic moved to read-pdf.agent_isolation.mdpoints at/read-pdf/isolation_split.md.scripts/split.pyis a 16-linerunpyshim that executes the canonical splitter at/read-pdf/scripts/split.py. No duplicated PDF-splitting code.methodology.mdand the example directory underskills/split-pdf/are unchanged.Why
/read-pdf(introduced in Add /read-pdf skill with cached marker-based PDF-to-markdown conversion #6) is the right home for marker-based layout-aware extraction./split-pdf's vision-batch path remains useful as a fallback (no marker setup, marker conversion failures, triage), but it shouldn't be the canonical entry point./split-pdfmuscle memory still works —/split-pdf paper.pdfinvokes the wrapper which delegates to/read-pdf --split paper.pdf.Testing
/read-pdf(default marker path) end-to-end on a 40-page paper; confirmed substrate prep, fanout workers, and synthesis produce the same_text.mdshape as the prior/split-pdfoutput./read-pdf --spliton the same paper; confirmed the legacy vision-batch path still works./split-pdf paper.pdfdirectly and verified the wrapper successfully delegates and produces identical output to/read-pdf --split./read-pdfon the same SHA-256 returns cached_text.mdwithout re-running marker.install.pyskips a reusable existing venv and emits the major-update advisory when applicable.