From 901fa54d6175cf148f2756c3a1be8175df4ec689 Mon Sep 17 00:00:00 2001 From: Noah Miller Date: Fri, 8 May 2026 11:20:04 -0400 Subject: [PATCH 1/3] Add split-pdf bibliographic metadata output --- .claude/skills/split-pdf/README.md | 142 +++++++++++++++++++++++++++++ .claude/skills/split-pdf/SKILL.md | 35 +++++-- 2 files changed, 168 insertions(+), 9 deletions(-) create mode 100644 .claude/skills/split-pdf/README.md diff --git a/.claude/skills/split-pdf/README.md b/.claude/skills/split-pdf/README.md new file mode 100644 index 0000000..c925565 --- /dev/null +++ b/.claude/skills/split-pdf/README.md @@ -0,0 +1,142 @@ +# `/split-pdf` — Download, Split, and Deep-Read Academic Papers + +**Skill location:** [`.claude/skills/split-pdf/SKILL.md`](../../.claude/skills/split-pdf/SKILL.md) + +--- + +## What This Skill Does + +You give Claude a paper — either a local PDF file or a search query like "Gentzkow Shapiro 2014 competition newspapers" — and it does the rest. It finds the paper online and downloads it (or uses your local file in place), splits it into 4-page chunks using PyPDF2, then reads those chunks in small batches (3 at a time, ~12 pages), pausing between each batch for your review. As it reads, it writes structured notes into a `notes.md` file, extracting a bibliographic metadata block plus eight research dimensions. When finished, it saves a persistent `_text.md` extraction alongside the source PDF so future invocations can skip re-reading entirely. + +--- + +## Why It Exists + +Claude Code can read PDFs, but long academic papers cause two failures: + +1. **Session crash.** PDFs are token-expensive (fonts, vector graphics, tables, math notation). A 40-page paper can exceed the context window, producing an unrecoverable "prompt too long" error that destroys the entire session and all context. + +2. **Shallow reading.** Even when the PDF fits, Claude's attention degrades over long documents — it reads the abstract carefully, skims the methodology, and often hallucinates details from the results. You get a confident summary that's subtly wrong. + +These are related but distinct problems. The first kills the session. The second produces unreliable output while the session continues normally. Splitting addresses both. + +--- + +## The Solution + +Split the PDF into 4-page chunks, read 3 chunks at a time (~12 pages), and write structured notes incrementally. + +### How It Works + +| Step | Action | +|------|--------| +| **Acquire** | Download the PDF (via web search) or use a local file in place | +| **Check** | Look for existing `_text.md` extract or existing splits — offer to reuse | +| **Split** | PyPDF2 splits into 4-page chunks in `_build/split_/` | +| **Read** | Read 3 splits at a time, pause after each batch | +| **Extract** | Update running `notes.md` with structured information | +| **Persist** | Save final extraction to `_text.md` alongside the source PDF | +| **Confirm** | Wait for user approval before continuing to next batch | + +### Usage + +``` +/split-pdf path/to/paper.pdf +/split-pdf "Gentzkow Shapiro Sinkinson 2014 competition newspapers" +``` + +**You must tell Claude what paper to read.** Claude cannot webcrawl for a paper it doesn't know exists. Provide either a local file path or a search query specific enough to find the paper — an author name, title, keywords, year, or some combination. If you just type `/split-pdf` with nothing else, Claude will ask you what you're looking for. + +### What Gets Extracted + +The skill produces a **structured extraction** — more detailed and specific than a typical summary. It starts with a `## Bibliographic metadata` block, then records the dimensions a researcher needs to build on or replicate the work: + +0. **Bibliographic metadata** — DOI, authors, title, year, venue, and venue type when visible on the title page +1. **Research question** — What is the paper asking and why does it matter? +2. **Audience** — Which sub-community of researchers cares about this? +3. **Method** — How do they answer the question? What is the identification strategy? +4. **Data** — What data do they use? Where did they find it? Unit of observation? Sample size? Time period? +5. **Statistical methods** — What econometric or statistical techniques? Key specifications? +6. **Findings** — Main results? Key coefficient estimates and standard errors? +7. **Contributions** — What is learned that we didn't know before? +8. **Replication feasibility** — Public data? Replication archive? Data appendix? URLs? + +--- + +## Key Features + +### In-place PDF handling + +The skill uses the PDF wherever it already lives. No copying to a centralized `articles/` folder. This lets the skill work inside any project folder without rearranging your file structure. + +### Persistent extraction (`_text.md`) + +After all batches are read, the skill writes a structured plain-text extraction as `_text.md` next to the source PDF. On future invocations, the skill checks for this file first and offers to reuse it — skipping re-reading entirely. This saves tokens and time on previously processed papers. + +### Split reuse + +If splits already exist in the build directory from a previous run, the skill offers to reuse them instead of re-splitting. + +### Build directory convention + +Splits go into `_build/split_/` rather than directly alongside the PDF. This keeps working artifacts (splits, intermediate notes) separate from source files and finished outputs. Multiple PDFs in the same folder share one build directory. + +### Agent isolation protocol + +When another skill calls `/split-pdf` (for example, `/beautiful_deck` reading a paper before generating slides), the PDF reading runs inside a subagent. Each PDF page renders as image data that accumulates permanently in the conversation context. A 35-page paper can add 10-20MB. Without isolation, two or three large PDFs crash the session by hitting the API request size limit. The subagent reads the pages, writes plain-text output, and the parent skill only reads the text. + +Standalone invocations (user calls `/split-pdf` directly) use the interactive pause-and-confirm protocol in the main conversation. + +--- + +## Why This Design + +**Why 4-page chunks?** Small enough for careful attention, large enough to keep logical sections (a methodology subsection, a results table with discussion) together. A 40-page paper becomes 10 chunks read in 4 rounds. + +**Why 3 chunks per batch (~12 pages)?** Balances throughput against attention quality. Twelve pages is enough to make progress but not so much that comprehension degrades. + +**Why pause between batches?** So you can: +- Review intermediate output and catch errors before they compound +- Redirect the reading or ask follow-up questions +- Skip sections that aren't relevant +- Control pacing for sections that need more care + +**Why incremental notes instead of a final summary?** When Claude reads a full paper at once, it produces a summary — lossy compression. When it reads in batches and updates running notes, it accumulates detail. The final notes are richer than any one-shot summary. + +**Why persist the extraction?** A 40-page paper costs ~4 rounds of PDF image rendering. Doing that twice is waste. The `_text.md` file lets you come back to the paper weeks later without re-reading a single page. + +For the full methodology, see [`.claude/skills/split-pdf/methodology.md`](../../.claude/skills/split-pdf/methodology.md). + +--- + +## Directory Structure After Running + +``` +articles/ # any working folder +├── smith_2024.pdf # original PDF — ALWAYS preserved, never deleted +├── smith_2024_text.md # structured extract — reusable across sessions +└── articles_build/ # _build/ — shared build folder + └── split_smith_2024/ # split_/ + ├── smith_2024_pp1-4.pdf # 4-page chunks + ├── smith_2024_pp5-8.pdf + ├── smith_2024_pp9-12.pdf + ├── ... + └── notes.md # working copy of structured notes +``` + +**The original PDF is never deleted.** Whether Claude downloaded it via web search or you pointed it to a local file, the original always stays where it was. The split files are derivatives. If anything goes wrong — a corrupted split, a re-read with different parameters — you can always re-split from the original. + +--- + +## Limitations + +- **It is slow.** A 37-page paper requires ~4 rounds of reading with user confirmation between each. This is a deliberate trade-off: careful reading over fast reading. +- **Notes can become repetitive** if the paper revisits themes. Some manual editing of the final notes may be useful. +- **Not for triage.** If you just need to decide whether a paper is relevant, read only the first split (pages 1-4, which usually contains the abstract and introduction). You don't need the full protocol. +- **Papers under ~15 pages** can be read directly without splitting. + +--- + +## Acknowledgments + +The in-place PDF handling, persistent `_text.md` extraction, split reuse, build directory convention, and agent isolation protocol were inspired by improvements identified by [Ben Bentzin](https://www.mccombs.utexas.edu) (Associate Professor of Instruction, McCombs School of Business, University of Texas at Austin), who adapted the original skill for his own workflows and shared his findings (April 2026). His version demonstrated that subagent isolation prevents context bloat when reading multiple large PDFs in a single session — a critical reliability improvement. The implementation here is independently written but the ideas are his. diff --git a/.claude/skills/split-pdf/SKILL.md b/.claude/skills/split-pdf/SKILL.md index 7fd966f..2fad007 100644 --- a/.claude/skills/split-pdf/SKILL.md +++ b/.claude/skills/split-pdf/SKILL.md @@ -123,6 +123,18 @@ Do NOT read ahead. Do NOT read all splits at once. The pause-and-confirm protoco As you read, collect information along these dimensions and write them into `notes.md`: +0. **Bibliographic metadata** — From the first split (title page), extract: + ``` + ## Bibliographic metadata + doi: <10.xxxx/yyyy if present on the title page, else null> + authors: [LastName1, LastName2, ...] + title: + year: + venue: + venue_type: journal | working_paper | book_chapter | other + ``` + If a field is not visible on the title page, record `null`. + 1. **Research question** — What is the paper asking and why does it matter? 2. **Audience** — Which sub-community of researchers cares about this? 3. **Method** — How do they answer the question? What is the identification strategy? @@ -136,11 +148,11 @@ These questions extract what a researcher needs to **build on or replicate** the ## The Notes File -The working notes file is `notes.md` in the split subdirectory, updated incrementally after each batch. Structure it with clear headers for each of the 8 dimensions. After each batch, update whichever dimensions have new information — do not rewrite from scratch. +The working notes file is `notes.md` in the split subdirectory, updated incrementally after each batch. Structure it with clear headers for the bibliographic metadata block and each of the eight research dimensions. After each batch, update whichever dimensions have new information — do not rewrite from scratch. By the time all splits are read, the notes should contain specific data sources, variable names, equation references, sample sizes, coefficient estimates, and standard errors. Not a summary — a structured extraction. -**After all batches are complete**, write the final notes to `_text.md` in the same folder as the source PDF: +**After all batches are complete**, write the final notes to `_text.md` in the same folder as the source PDF, with the `## Bibliographic metadata` block first: ``` articles/smith_2024_text.md @@ -172,9 +184,18 @@ Text output: Process: 1. Read 3 PDF files at a time using the Read tool 2. After each batch, update the notes file with extracted content -3. Extract: research question, audience, method, data (sources, sample size, time period), +3. From the first split (title page), extract a bibliographic metadata block: + ## Bibliographic metadata + doi: <10.xxxx/yyyy if present on the title page, else null> + authors: [LastName1, LastName2, ...] + title: + year: + venue: + venue_type: journal | working_paper | book_chapter | other +4. Extract: research question, audience, method, data (sources, sample size, time period), statistical methods, findings, contributions, replication feasibility -4. Write the final structured extraction to the text output path +5. Write the final structured extraction to the text output path, with the + ## Bibliographic metadata block first, followed by the research notes. Report when done: pages read, figures/tables found, one-sentence content summary. ``` @@ -201,8 +222,4 @@ After the agent returns, the parent reads the output files (plain markdown, not | **Persist** | Save final extraction to `_text.md` alongside the source PDF | | **Confirm** | Ask user before continuing to next batch | -## Acknowledgments - -The in-place PDF handling, persistent `_text.md` extraction, split reuse, build directory convention, and agent isolation protocol were inspired by improvements identified by [Ben Bentzin](https://www.mccombs.utexas.edu) (Associate Professor of Instruction, McCombs School of Business, University of Texas at Austin), who adapted the original skill for his own workflows and shared his findings (April 2026). His version demonstrated that subagent isolation prevents context bloat when reading multiple large PDFs in a single session — a critical reliability improvement. The implementation here is independently written but the ideas are his. - -For detailed explanation of why the batched-reading method works, see [methodology.md](methodology.md). +For detailed explanation of why the batched-reading method works, see [methodology.md](methodology.md). Acknowledgments and credits live in [README.md](README.md). From 56f0e7fe04cb3d2cc3444835681ad746e5c4a7d5 Mon Sep 17 00:00:00 2001 From: Noah Miller Date: Fri, 15 May 2026 16:34:34 -0400 Subject: [PATCH 2/3] Extract split-pdf splitting script --- .claude/skills/split-pdf/SKILL.md | 37 ++------------- .claude/skills/split-pdf/scripts/split.py | 58 +++++++++++++++++++++++ 2 files changed, 62 insertions(+), 33 deletions(-) create mode 100755 .claude/skills/split-pdf/scripts/split.py diff --git a/.claude/skills/split-pdf/SKILL.md b/.claude/skills/split-pdf/SKILL.md index 2fad007..751eb7f 100644 --- a/.claude/skills/split-pdf/SKILL.md +++ b/.claude/skills/split-pdf/SKILL.md @@ -43,46 +43,17 @@ If found, ask: This prevents redundant re-reading of papers you have already processed. The `_text.md` file is a structured plain-text extraction that is far cheaper to read than re-processing the PDF page images. -**If no extract exists, check for existing splits.** Determine the build directory: - -```python -import os -folder_path = os.path.dirname(os.path.abspath(pdf_path)) -foldername = os.path.basename(folder_path) -pdf_basename = os.path.splitext(os.path.basename(pdf_path))[0] -build_dir = os.path.join(folder_path, foldername + '_build') -split_dir = os.path.join(build_dir, 'split_' + pdf_basename) -``` +**If no extract exists, check for existing splits.** Use the build directory convention `_build/split_/`. If `split_dir` already exists and contains `.pdf` files, ask: > "Splits already exist for `` (N chunks in `_build/split_/`). Reuse existing splits, or re-split from scratch?" - **Reuse**: skip splitting, proceed to Step 3 using the existing files in `split_dir` - **Re-split**: delete the existing split folder, then proceed with splitting below -Create splits in `_build/split_/` and run the splitting script: - -```python -from PyPDF2 import PdfReader, PdfWriter -import os, sys - -def split_pdf(input_path, output_dir, pages_per_chunk=4): - os.makedirs(output_dir, exist_ok=True) - reader = PdfReader(input_path) - total = len(reader.pages) - prefix = os.path.splitext(os.path.basename(input_path))[0] - - for start in range(0, total, pages_per_chunk): - end = min(start + pages_per_chunk, total) - writer = PdfWriter() - for i in range(start, end): - writer.add_page(reader.pages[i]) - - out_name = f"{prefix}_pp{start+1}-{end}.pdf" - out_path = os.path.join(output_dir, out_name) - with open(out_path, "wb") as f: - writer.write(f) +Create splits by running: - print(f"Split {total} pages into {-(-total // pages_per_chunk)} chunks in {output_dir}") +```bash +python3 ~/.claude/skills/split-pdf/scripts/split.py path/to/paper.pdf ``` **Directory convention:** diff --git a/.claude/skills/split-pdf/scripts/split.py b/.claude/skills/split-pdf/scripts/split.py new file mode 100755 index 0000000..0113d2d --- /dev/null +++ b/.claude/skills/split-pdf/scripts/split.py @@ -0,0 +1,58 @@ +#!/usr/bin/env python3 +"""Split a PDF into fixed-size page chunks using the skill directory convention.""" + +from __future__ import annotations + +import argparse +import math +from pathlib import Path + +from PyPDF2 import PdfReader, PdfWriter + + +def default_split_dir(pdf_path: Path) -> Path: + folder_path = pdf_path.resolve().parent + folder_name = folder_path.name + return folder_path / f"{folder_name}_build" / f"split_{pdf_path.stem}" + + +def split_pdf(input_path: Path, output_dir: Path, pages_per_chunk: int) -> tuple[int, int]: + output_dir.mkdir(parents=True, exist_ok=True) + reader = PdfReader(str(input_path)) + total_pages = len(reader.pages) + + for start in range(0, total_pages, pages_per_chunk): + end = min(start + pages_per_chunk, total_pages) + writer = PdfWriter() + + for page_index in range(start, end): + writer.add_page(reader.pages[page_index]) + + output_path = output_dir / f"{input_path.stem}_pp{start + 1}-{end}.pdf" + with output_path.open("wb") as handle: + writer.write(handle) + + return total_pages, math.ceil(total_pages / pages_per_chunk) + + +def main() -> None: + parser = argparse.ArgumentParser(description="Split a PDF into fixed-size page chunks.") + parser.add_argument("pdf_path", type=Path, help="PDF to split") + parser.add_argument("--output-dir", type=Path, default=None, help="Directory for split PDFs") + parser.add_argument("--pages-per-chunk", type=int, default=4, help="Pages per split PDF") + args = parser.parse_args() + + if args.pages_per_chunk < 1: + raise SystemExit("--pages-per-chunk must be at least 1") + + pdf_path = args.pdf_path.expanduser().resolve() + if not pdf_path.is_file(): + raise SystemExit(f"PDF not found: {pdf_path}") + + output_dir = args.output_dir.expanduser().resolve() if args.output_dir else default_split_dir(pdf_path) + total_pages, chunk_count = split_pdf(pdf_path, output_dir, args.pages_per_chunk) + print(f"Split {total_pages} pages into {chunk_count} chunks in {output_dir}") + + +if __name__ == "__main__": + main() From a597d1d781e55cdfc1ed541982b886b09e2b0b8b Mon Sep 17 00:00:00 2001 From: Noah Miller Date: Sat, 23 May 2026 17:16:51 -0400 Subject: [PATCH 3/3] Update split-pdf backend to pypdf PyPDF2 is unmaintained. pypdf is the maintained successor with a drop-in-compatible PdfReader/PdfWriter API. Swap the import in scripts/split.py and update SKILL.md / README.md text. --- .claude/skills/split-pdf/README.md | 4 ++-- .claude/skills/split-pdf/SKILL.md | 2 +- .claude/skills/split-pdf/scripts/split.py | 2 +- skills/split-pdf/README.md | 4 ++-- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/.claude/skills/split-pdf/README.md b/.claude/skills/split-pdf/README.md index c925565..2d4a169 100644 --- a/.claude/skills/split-pdf/README.md +++ b/.claude/skills/split-pdf/README.md @@ -6,7 +6,7 @@ ## What This Skill Does -You give Claude a paper — either a local PDF file or a search query like "Gentzkow Shapiro 2014 competition newspapers" — and it does the rest. It finds the paper online and downloads it (or uses your local file in place), splits it into 4-page chunks using PyPDF2, then reads those chunks in small batches (3 at a time, ~12 pages), pausing between each batch for your review. As it reads, it writes structured notes into a `notes.md` file, extracting a bibliographic metadata block plus eight research dimensions. When finished, it saves a persistent `_text.md` extraction alongside the source PDF so future invocations can skip re-reading entirely. +You give Claude a paper — either a local PDF file or a search query like "Gentzkow Shapiro 2014 competition newspapers" — and it does the rest. It finds the paper online and downloads it (or uses your local file in place), splits it into 4-page chunks using pypdf, then reads those chunks in small batches (3 at a time, ~12 pages), pausing between each batch for your review. As it reads, it writes structured notes into a `notes.md` file, extracting a bibliographic metadata block plus eight research dimensions. When finished, it saves a persistent `_text.md` extraction alongside the source PDF so future invocations can skip re-reading entirely. --- @@ -32,7 +32,7 @@ Split the PDF into 4-page chunks, read 3 chunks at a time (~12 pages), and write |------|--------| | **Acquire** | Download the PDF (via web search) or use a local file in place | | **Check** | Look for existing `_text.md` extract or existing splits — offer to reuse | -| **Split** | PyPDF2 splits into 4-page chunks in `_build/split_/` | +| **Split** | pypdf splits into 4-page chunks in `_build/split_/` | | **Read** | Read 3 splits at a time, pause after each batch | | **Extract** | Update running `notes.md` with structured information | | **Persist** | Save final extraction to `_text.md` alongside the source PDF | diff --git a/.claude/skills/split-pdf/SKILL.md b/.claude/skills/split-pdf/SKILL.md index 751eb7f..0acb0d9 100644 --- a/.claude/skills/split-pdf/SKILL.md +++ b/.claude/skills/split-pdf/SKILL.md @@ -74,7 +74,7 @@ The build directory convention (`_build/`) keeps split artifacts, co The original PDF remains permanently. The splits are working copies. If anything goes wrong, you can always re-split from the original. -If PyPDF2 is not installed, install it: `pip install PyPDF2` +If pypdf is not installed, install it: `pip install pypdf` ## Step 3: Read in Batches of 3 Splits diff --git a/.claude/skills/split-pdf/scripts/split.py b/.claude/skills/split-pdf/scripts/split.py index 0113d2d..03525de 100755 --- a/.claude/skills/split-pdf/scripts/split.py +++ b/.claude/skills/split-pdf/scripts/split.py @@ -7,7 +7,7 @@ import math from pathlib import Path -from PyPDF2 import PdfReader, PdfWriter +from pypdf import PdfReader, PdfWriter def default_split_dir(pdf_path: Path) -> Path: diff --git a/skills/split-pdf/README.md b/skills/split-pdf/README.md index 3cdde58..d70111f 100644 --- a/skills/split-pdf/README.md +++ b/skills/split-pdf/README.md @@ -6,7 +6,7 @@ ## What This Skill Does -You give Claude a paper — either a local PDF file or a search query like "Gentzkow Shapiro 2014 competition newspapers" — and it does the rest. It finds the paper online and downloads it (or uses your local file in place), splits it into 4-page chunks using PyPDF2, then reads those chunks in small batches (3 at a time, ~12 pages), pausing between each batch for your review. As it reads, it writes structured notes into a `notes.md` file, extracting specific information across 8 dimensions. When finished, it saves a persistent `_text.md` extraction alongside the source PDF so future invocations can skip re-reading entirely. +You give Claude a paper — either a local PDF file or a search query like "Gentzkow Shapiro 2014 competition newspapers" — and it does the rest. It finds the paper online and downloads it (or uses your local file in place), splits it into 4-page chunks using pypdf, then reads those chunks in small batches (3 at a time, ~12 pages), pausing between each batch for your review. As it reads, it writes structured notes into a `notes.md` file, extracting specific information across 8 dimensions. When finished, it saves a persistent `_text.md` extraction alongside the source PDF so future invocations can skip re-reading entirely. --- @@ -32,7 +32,7 @@ Split the PDF into 4-page chunks, read 3 chunks at a time (~12 pages), and write |------|--------| | **Acquire** | Download the PDF (via web search) or use a local file in place | | **Check** | Look for existing `_text.md` extract or existing splits — offer to reuse | -| **Split** | PyPDF2 splits into 4-page chunks in `_build/split_/` | +| **Split** | pypdf splits into 4-page chunks in `_build/split_/` | | **Read** | Read 3 splits at a time, pause after each batch | | **Extract** | Update running `notes.md` with structured information | | **Persist** | Save final extraction to `_text.md` alongside the source PDF |