diff --git a/.claude/skills/read-pdf/README.md b/.claude/skills/read-pdf/README.md new file mode 100644 index 0000000..6ddb08a --- /dev/null +++ b/.claude/skills/read-pdf/README.md @@ -0,0 +1,195 @@ +# `/read-pdf` — Download, Convert, Split, and Deep-Read Academic Papers + +`/read-pdf` is the canonical academic-paper reading skill. By default, it uses python:marker to convert the PDF to markdown locally before extracting structured notes. With `--split`, it uses a split-PDF vision-batch path. + +**Skill location:** [`.claude/skills/read-pdf/SKILL.md`](../../.claude/skills/read-pdf/SKILL.md) + +--- + +## What This Skill Does + +You give Claude a paper — either a local PDF file or a search query — and it does the rest. It finds and downloads the paper (or uses your local file in place), reads it through the selected backend, and writes a persistent `_text.md` extraction alongside the source PDF. + +Both modes produce the same output contract: bibliographic metadata, plain-English synthesis, and 12-dimension research notes. + +| Mode | Command | Best for | +|---|---|---| +| Marker conversion | `/read-pdf ` | Tables, equations, figures, repeated processing, batch ingest | +| Split vision reading | `/read-pdf --split ` | Triage, converter failures, no marker setup | + +--- + +## Mode Choice + +Default mode converts the PDF to layout-aware markdown before extraction. Use it for normal paper ingest, tables, equations, figures, repeated processing, and batch wiki updates. + +Use `--split` when marker setup is unavailable, marker cannot parse a malformed PDF, or first-split triage is enough. Split mode has two limitations: + +1. **Equation fidelity.** PDF page images render math as bitmaps. Vision-reading bitmaps can produce approximate LaTeX transcriptions. + +2. **Table structure.** Complex tables (multi-column headers, merged cells, footnotes) are harder to transcribe accurately from images than from a layout-aware text conversion. + +--- + +## Default Mode + +Convert the PDF to markdown with python:marker (layout-aware, GPU-accelerated), build bounded chunks, then extract through worker notes and one neutral synthesis pass. + +### How It Works + +| Step | Action | +|------|--------| +| **Acquire** | Download the PDF (via web search) or use a local file in place | +| **Install** | `install.py` sets up the marker venv on first run (~500 MB, one-time), then reuses it; monthly advisory check for marker major updates | +| **Check cache** | SHA-256 hash check — skip re-conversion if markdown already cached | +| **Convert** | `convert.py` runs marker and writes `markdown.md` to a content-hash cache | +| **Collision** | If `_text.md` already exists, ask: overwrite or save as `_text2.md`? | +| **Check extract cache** | If `text.md` exists in the converter cache, copy it beside the PDF and skip extraction | +| **Prepare substrate** | Split cached `markdown.md` into bounded chunk files plus `manifest.json` | +| **Extract** | Workers read assigned chunks; synthesis reads worker notes and writes bibliographic metadata, plain-English synthesis, and 12-dimension notes | +| **Persist** | Save final extraction to `_text.md` alongside the source PDF and to `text.md` in the converter cache | + +### Usage + +``` +/read-pdf path/to/paper.pdf +/read-pdf "Gentzkow Shapiro Sinkinson 2014 competition newspapers" +/read-pdf --split path/to/paper.pdf +``` + +You must tell Claude what paper to read. Provide either a local file path or a search query specific enough to find the paper. + +--- + +## Split Mode + +`/read-pdf --split` uses this directory convention: + +```text +articles/ +├── smith_2024.pdf +├── smith_2024_text.md +└── articles_build/ + └── split_smith_2024/ + ├── smith_2024_pp1-4.pdf + ├── smith_2024_pp5-8.pdf + ├── smith_2024_pp9-12.pdf + └── notes.md +``` + +The splitter script is: + +```bash +python3 ~/.claude/skills/read-pdf/scripts/split.py path/to/paper.pdf +``` + +### What Gets Extracted + +Both modes extract the same 12 dimensions, plus a bibliographic metadata block and plain-English synthesis at the top of `_text.md`: + +``` +## Bibliographic metadata +doi: <10.xxxx/yyyy or null> +authors: [LastName1, LastName2, ...] +title: +year: +venue: +venue_type: journal | working_paper | book_chapter | other +``` + +1. **Research question** — What is the paper asking and why does it matter? +2. **Audience** — Which sub-community of researchers cares about this? +3. **Method** — How do they answer the question? What is the identification strategy? +4. **Target parameter** — What estimand or causal/statistical object is targeted? +5. **Data** — What data do they use? Where did they find it? Unit of observation? Sample size? Time period? +6. **Statistical methods** — What econometric or statistical techniques? Key specifications? +7. **Findings** — Main results? Key coefficient estimates and standard errors? +8. **Contributions** — What is learned that we didn't know before? +9. **Replication feasibility** — Public data? Replication archive? Data appendix? URLs? +10. **Tables** — Inventory tables; extract machine-readable tables when central. +11. **Figures** — Inventory figures, captions, and key visual claims. +12. **Equations / formal objects** — Inventory equations, model primitives, algorithms, propositions, and labeled specifications. + +--- + +## Key Features + +### Conversion backend: marker + +The conversion backend is **marker** (`marker-pdf`). Selected after a head-to-head bake-off against docling on a representative set of empirical-economics PDFs; marker won on equation fidelity, table structure, and figure extraction quality. + +Backend selection is fixed in `convert.py`. There is no runtime override — if the bake-off needs to be redone for a future backend candidate, edit the `BACKEND` constant in `convert.py` explicitly so the cache namespace and venv are regenerated cleanly. + +`install.py` installs the current PyPI `marker-pdf` release only when the marker venv is first created. If marker already imports cleanly, setup reuses it and performs at most one lightweight PyPI check every 30 days. It warns only when PyPI has crossed a marker major-version boundary, and it never auto-upgrades. + +If the user opts into a major upgrade, run: + +```bash +python3 ~/.claude/skills/read-pdf/install.py --upgrade-marker +``` + +Existing cached conversions remain in place. To force fresh conversions after upgrading, delete selected cache entries under `~/.cache/claude-pdf-converter/cache/marker/`, or delete that whole directory. Rebuilding a large cache can be very time-consuming. + +### Born-digital PDFs and OCR + +Most journal PDFs already contain an embedded text layer. For those files, `convert.py` samples the first pages with `pdftotext` and tells marker to use the embedded text rather than re-OCRing the whole document. Marker still performs layout, table, and selected region recognition, but avoids the extremely slow full-document OCR path. If the text-layer sample is missing or too sparse, marker keeps OCR enabled for scanned PDFs. + +### GPU acceleration + +Auto-detected: NVIDIA CUDA → CPU. MPS on Apple Silicon is excluded — surya's layout model crashes at runtime on MPS with an index-bounds error (some surya sub-models already refuse MPS; the layout model does not and fails mid-conversion). A 3–5× speedup on CUDA boxes. No flags needed on any platform. + +### Content-hash cache + +Conversions are cached by SHA-256 of the source PDF bytes at `~/.cache/claude-pdf-converter/cache/marker//`. Re-converting the same PDF (even under a different filename, even in a different project) is a no-op — the cached `markdown.md` is returned immediately. The cache is shared across all projects on the machine. + +Cache entries are not auto-evicted. To force a re-conversion: +```bash +rm -rf ~/.cache/claude-pdf-converter/cache/marker// +``` +To wipe the entire cache (e.g., after a marker upgrade, if you explicitly want all conversions rerun): +```bash +rm -rf ~/.cache/claude-pdf-converter/cache/ +``` +The venv at `~/.cache/claude-pdf-converter/venv-marker/` is untouched. + +### `_text.md` collision handling + +If a `_text.md` already exists alongside the PDF, default mode asks whether to overwrite it or save the new extraction as `_text2.md`. Split mode asks whether to reuse the existing extract or re-read from scratch. + +### Agent isolation protocol + +When another skill calls `/read-pdf`, heavy reading runs inside a subagent. The mode-specific protocols live in: + +- `isolation_read.md` for marker mode. +- `isolation_split.md` for `--split` mode. + +--- + +## Mode Tradeoffs + +| | `--split` | default marker mode | +|---|---|---| +| **Reading mechanism** | Claude vision-reads PDF page images | Marker converts to markdown; Claude reads text | +| **Setup required** | None | `install.py` (~500 MB, one-time) | +| **First-run latency** | None | ~1–3 min (model download + conversion) | +| **Subsequent runs** | — | Instant if cached | +| **Equation fidelity** | Good (vision-based) | Better (native LaTeX extraction) | +| **Table structure** | Good | Better (layout-aware) | +| **Works without internet** | No (unless PDF already local) | Yes (after install) | +| **Output format** | `_text.md` | `_text.md` (same format) | + +Both modes produce identical `_text.md` output format and can be used interchangeably by downstream skills like `/bib-update` and `/wiki-update`. + +--- + +## Limitations + +- **Requires local setup.** First run downloads ~500 MB of models. Not suitable for environments where you can't write to `~/.cache/`. +- **Conversion can fail on malformed PDFs.** If `convert.py` errors, the default mode stops — it does not fall back silently. Use `--split` if you want the vision-batch fallback. +- **Default mode is not ideal for triage.** If you just need to decide whether a paper is relevant, use `--split` and read the first split. + +--- + +## Origin + +This skill is maintained in [Scott Cunningham](https://github.com/scunning1975/MixtapeTools)'s MixtapeTools repository. diff --git a/.claude/skills/read-pdf/SKILL.md b/.claude/skills/read-pdf/SKILL.md new file mode 100644 index 0000000..11b208b --- /dev/null +++ b/.claude/skills/read-pdf/SKILL.md @@ -0,0 +1,209 @@ +--- +name: read-pdf +description: Canonical academic-PDF reading skill. By default, downloads or uses a local PDF, converts it to clean markdown via a local layout-aware converter, then writes structured `_text.md` notes. Use `--split` to force the split-PDF vision path: split into 4-page chunks and read 3 chunks at a time. Use default mode for tables, equations, figures, repeated processing, and batch ingest; use `--split` for triage, converter failures, or environments where marker setup is not available. +allowed-tools: Bash(python3:*), Bash(curl:*), Bash(wget:*), Bash(mkdir:*), Bash(rm:*), Read, Write, WebSearch, WebFetch, Agent +argument-hint: [--split] [pdf-path-or-search-query] +--- + +# Read-PDF: Download, Convert, and Deep-Read Academic Papers + +Takes a PDF (local or searched) and produces a structured `_text.md` extraction with a bibliographic metadata block, a plain-English synthesis, and 12-dimension research notes. + +Default mode converts the PDF to markdown locally using python:marker, prepares bounded source chunks, then reads those chunks through a fanout-first extraction workflow. This preserves equation fidelity, table structure, and figure references without image-based context bloat or whole-file `Read` failures. + +`--split` mode splits into 4-page chunks, reads exactly 3 chunks at a time, updates running notes, then writes the same `_text.md` contract. + +## When This Skill Is Invoked + +The user wants to read, review, or summarize an academic paper. The input is either: +- A file path to a local PDF (e.g., `~/Documents/papers/smith_2024.pdf`) +- A search query or paper title (e.g., `"Gentzkow Shapiro Sinkinson 2014 competition newspapers"`) + +**Important:** You cannot search for a paper you don't know exists. Provide either a file path or a specific query. If the user invokes this skill without specifying a paper, ask them. + +## Mode selection + +- **Default marker mode:** use unless the user explicitly asks for `--split`, triage-only reading, or no local converter setup. +- **`--split` mode:** use when the user invokes `/read-pdf --split`, invokes `/split-pdf`, needs first-split triage, or marker conversion fails and the user wants the vision-batch fallback. + +## Prerequisites + +- **Python ≥ 3.10** must be available. `install.py` refuses to proceed on Python 3.9 or older. If needed: `brew install python@3.12`, `apt install python3.11`, or python.org installer. +- **Optional GPU acceleration** is auto-detected: NVIDIA CUDA → CPU. (MPS on Apple Silicon is excluded — surya's layout model crashes on MPS at runtime.) + +These prerequisites apply only to default marker mode. `--split` mode requires pypdf for `scripts/split.py`; if missing, install it with `python3 -m pip install pypdf`. + +## Step 1: Acquire the PDF + +**If a local file path is provided:** +- Verify the file exists +- Use the PDF in place. The working directory is the folder containing the PDF. +- Proceed to Step 2 + +**If a search query or paper title is provided:** +1. Use WebSearch to find the paper +2. Use WebFetch or Bash (curl/wget) to download the PDF +3. Save it to the current working directory +4. Proceed to Step 2 + +**CRITICAL: Always preserve the original PDF.** Never delete, move, or overwrite it at any point in this workflow. + +## Default marker mode + +### Step 2: Ensure the converter is installed + +```bash +python3 ~/.claude/skills/read-pdf/install.py +``` + +Idempotent. First run creates a venv at `~/.cache/claude-pdf-converter/venv-marker/` and downloads marker models (~500 MB, 1–3 min). Later runs reuse that venv if `marker` imports cleanly; they do **not** auto-upgrade marker. + +Once every 30 days, `install.py` performs a lazy PyPI check for marker major-version updates. If it prints a `read-pdf notice: marker-pdf has a major update available` advisory, pause and surface it to the user. Ask whether they want to upgrade now with: + +```bash +python3 ~/.claude/skills/read-pdf/install.py --upgrade-marker +``` + +Do not purge caches automatically. Explain that existing cached conversions remain valid but were produced by the older marker version. If the user wants fresh conversions after upgrading, delete selected cache entries under `~/.cache/claude-pdf-converter/cache/marker/`, or delete that whole directory; rebuilding a large cache can be very time-consuming. + +Surface the "First run" message to the user verbatim if it appears — they should know why this invocation is slow. + +### Step 3: Convert + +**Before converting, check for a cached conversion.** Compute the SHA-256 hash of the PDF and check whether `markdown.md` already exists in the cache: + +```python +import hashlib, os, sys + +pdf_path = "" + +with open(pdf_path, 'rb') as f: + pdf_hash = hashlib.sha256(f.read()).hexdigest() + +markdown_path = os.path.expanduser( + f'~/.cache/claude-pdf-converter/cache/marker/{pdf_hash}/markdown.md' +) +print(markdown_path if os.path.exists(markdown_path) else "NOT_CACHED") +``` + +- **If cached:** tell the user "Using cached markdown conversion (SHA-256 match), skipping re-conversion." Use the printed path as `markdown_path`. +- **If not cached:** run: + ```bash + python3 ~/.claude/skills/read-pdf/convert.py "" + ``` + It prints the absolute path to `markdown.md` on success and exits 0. For born-digital PDFs with a usable embedded text layer, `convert.py` uses that text layer and disables marker's full-document OCR path while preserving marker's layout/table processing. **Do not fall back to pdftotext or any other tool on failure** — surface the error and stop. The whole point of this skill is the layout-aware conversion; a degraded fallback produces silently-wrong output. + +### Step 4: Check for existing `_text.md` + +Look for `_text.md` in the same folder as the PDF. + +If found, ask: +> "An extract already exists (`_text.md`). Overwrite it, or save the new extraction as `_text2.md`?" + +Proceed using whichever filename the user chooses. + +If no local extract exists, check for a cache-level neutral extract at `/text.md`. + +Run: + +```bash +python3 ~/.claude/skills/read-pdf/scripts/cache_text.py check "" +``` + +- If it prints a cache path, run: + ```bash + python3 ~/.claude/skills/read-pdf/scripts/cache_text.py pull "" "" + ``` + Then skip Steps 5–6 and notify the user: *"Using cached neutral extract from converter cache; copied to `_text.md`."* +- If it prints `NOT_CACHED`, continue to Step 5. + +### Step 5: Prepare extraction substrate + +Run the deterministic substrate builder: + +```bash +python3 ~/.claude/skills/read-pdf/scripts/prepare_substrate.py "" +``` + +It writes bounded chunk files and `manifest.json` beside the marker cache. The script performs no scholarly interpretation; it only creates a structural manifest over the converted markdown. + +### Step 6: Structured Extraction + +Use `fanout_worker.md` and `fanout_synthesis.md` with the generated manifest. Run worker bundles sequentially by default. Each worker reads only its assigned chunk paths and writes durable local notes. The synthesis step reads the manifest and worker notes, performs gap-directed rereads of specific chunk files only when needed, and writes the final structured extraction to `_text.md`. + +The final extraction follows `extraction_schema.md`: a `## Bibliographic metadata` block from the title section, then the research dimensions. Read `extraction_schema.md` before synthesis so the output contract is explicit. + +Write the final structured extraction to `_text.md` (or `_text2.md` if chosen in Step 4) in the same folder as the source PDF, with the `## Bibliographic metadata` block first. Then cache the same neutral extract: + +```bash +python3 ~/.claude/skills/read-pdf/scripts/cache_text.py push "" "" +``` + +Then notify the user: *"Extract saved to `_text.md` alongside the source PDF and cached as `text.md` in the converter cache."* + +## `--split` mode + +Use this branch only when selected by the Mode selection rules above. + +**Critical rule:** Never read a full PDF in split mode. Only read the 4-page split files, and only 3 splits at a time (~12 pages). + +### Step S2: Reuse or split + +1. Look for `_text.md` next to the PDF. If found, ask: *"An extract already exists (`_text.md`). Use it, or re-read from scratch?"* On **Use**, read `_text.md` as the source notes and skip the rest of split mode. On **Re-read**, continue. +2. Look for `_build/split_/*.pdf`. If found, ask: *"Splits already exist (N chunks). Reuse, or re-split?"* On **Reuse**, proceed with existing files. On **Re-split**, delete the split folder and continue. + +Create splits by running: + +```bash +python3 ~/.claude/skills/read-pdf/scripts/split.py path/to/paper.pdf +``` + +Directory convention: + +```text +articles/ +├── smith_2024.pdf +├── smith_2024_text.md +└── articles_build/ + └── split_smith_2024/ + ├── smith_2024_pp1-4.pdf + ├── smith_2024_pp5-8.pdf + ├── smith_2024_pp9-12.pdf + └── notes.md +``` + +### Step S3: Read in batches of 3 splits + +Read exactly 3 split files at a time. After each batch: + +1. Read the 3 split PDFs using the Read tool. +2. Update `notes.md` in the split directory. +3. Pause and tell the user: *"I have finished reading splits [X-Y] and updated the notes. I have [N] more splits remaining. Would you like me to continue with the next 3?"* +4. Wait for confirmation before reading the next batch. + +Do not read ahead. + +### Step S4: Structured extraction + +As you read, collect notes into `notes.md` following `extraction_schema.md`. After all batches are complete, write the final notes to `_text.md` next to the source PDF, with the `## Bibliographic metadata` block first. Keep both `notes.md` and `_text.md`. + +## Agent Isolation + +When `/read-pdf` is invoked by another skill or workflow, the heavy reading step must run in a subagent. See `agent_isolation.md` for the mode router and `isolation_read.md` / `isolation_split.md` for branch-specific launch patterns. + +## Files in this skill + +- `SKILL.md` — this file (acquire → default marker mode or `--split` mode → extract workflow) +- `extraction_schema.md` — bibliographic metadata block + 8 research dimensions +- `fanout_worker.md` — bounded worker-note prompt for marker chunks +- `fanout_synthesis.md` — synthesis prompt for worker notes and final `_text.md` +- `agent_isolation.md` — isolation mode router +- `isolation_common.md` — shared parent/subagent rule +- `isolation_read.md` — marker-mode isolation pattern +- `isolation_split.md` — split-mode isolation pattern +- `install.py` — idempotent marker venv installer with monthly advisory check +- `convert.py` — PDF → markdown converter (writes to SHA-256-keyed cache) +- `scripts/prepare_substrate.py` — marker markdown → bounded chunks + manifest +- `scripts/cache_text.py` — check/pull/push project-neutral `text.md` extracts in the converter cache +- `scripts/split.py` — pypdf 4-page splitter used by `--split` mode and downstream fallbacks +- `README.md` — backend details, cache management, GPU notes diff --git a/.claude/skills/read-pdf/agent_isolation.md b/.claude/skills/read-pdf/agent_isolation.md new file mode 100644 index 0000000..967666a --- /dev/null +++ b/.claude/skills/read-pdf/agent_isolation.md @@ -0,0 +1,8 @@ +# Agent Isolation Protocol + +Read `isolation_common.md` first. + +Then choose the branch: + +- Default marker mode: `isolation_read.md` +- `--split` mode: `isolation_split.md` diff --git a/.claude/skills/read-pdf/convert.py b/.claude/skills/read-pdf/convert.py new file mode 100644 index 0000000..5e99ec9 --- /dev/null +++ b/.claude/skills/read-pdf/convert.py @@ -0,0 +1,326 @@ +#!/usr/bin/env python3 +""" +read-pdf converter — PDF → markdown + figures (marker backend). + +Caches by SHA-256 of the PDF bytes. Re-running on the same content is free. + +Usage: + python3 convert.py + +Prints the absolute path to the cached markdown.md on success (exit 0). +On backend failure, exits non-zero with the error on stderr — no fallback. + +Cache layout: + ~/.cache/claude-pdf-converter/cache/marker// + markdown.md # conversion with inline ![](figures/*) + figures/* # extracted figures with byte-matching extensions + meta.json # backend, version, page/figure counts, source path +""" + +import hashlib +import importlib.metadata +import json +import os +import re +import subprocess +import sys +import time +from pathlib import Path + +BACKEND = "marker" +CACHE_ROOT = Path.home() / ".cache" / "claude-pdf-converter" +CACHE_DIR = CACHE_ROOT / "cache" / BACKEND +VENV_PYTHON = CACHE_ROOT / f"venv-{BACKEND}" / "bin" / "python" +SKILL_DIR = Path(__file__).resolve().parent + + +def detect_torch_device() -> str: + """Pick best available torch device: cuda > cpu. MPS excluded — surya's layout + model crashes on Apple Silicon MPS with an index-bounds error at runtime.""" + try: + import torch + except ImportError: + return "cpu" + if torch.cuda.is_available(): + return "cuda" + return "cpu" + + +def normalize_footnotes(text: str) -> str: + """ + Rewrite marker's bare-number footnote encoding as Pandoc-style markdown footnotes. + + Marker places footnote superscripts as bare digits attached to the preceding + word/punctuation, then dumps the footnote body as a standalone paragraph + starting with the matching number at the next page-break boundary. This + function detects matched anchor/definition pairs and rewrites them: + + ...coefficient.12 We then... → ...coefficient.[^12] We then... + 12The county-level cluster... → (removed from body) + + A definitions block is appended at the end of the document: + + [^12]: The county-level cluster... + + Guards: code fences, table rows, display-math paragraphs, and numbered list + items (digit followed by ". " or ") ") are left untouched. + Only numbers that appear as BOTH an anchor and a definition are rewritten — + this is the primary false-positive guard. + """ + paragraphs = re.split(r'\n\n+', text) + + # --- Pass 1: find definition paragraphs --- + # Matches: bare 1–3 digit number at paragraph start, NOT followed by ". " + # or ") " (numbered list items), then optional whitespace, then the body. + # No mandatory space between number and body (handles OCR gaps in old scans). + fn_def_re = re.compile(r'^(\d{1,3})(?!\.\s|\)\s)\s*(\S.+)', re.DOTALL) + + footnote_defs: dict[str, str] = {} + def_para_indices: set[int] = set() + in_fence = False + + for i, para in enumerate(paragraphs): + stripped = para.strip() + # Track code-fence state across paragraphs + if stripped.count('```') % 2 != 0: + in_fence = not in_fence + if in_fence: + continue + # Skip tables, display math, and code fences + if re.match(r'\s*(\||```|\$\$)', stripped): + continue + m = fn_def_re.match(stripped) + if m: + num, body = m.group(1), m.group(2).strip() + if body and not body.isdigit(): + footnote_defs[num] = body + def_para_indices.add(i) + + if not footnote_defs: + return text + + # --- Pass 2: replace anchors in body paragraphs --- + # Anchor: one of the known footnote numbers immediately following a word + # character or sentence-ending punctuation, not preceded by '[' (citation). + # Lookahead: whitespace, sentence punctuation, closing bracket, or EOL. + nums_alt = '|'.join(re.escape(n) for n in sorted(footnote_defs, key=lambda x: -len(x))) + anchor_re = re.compile( + r'(?<=[a-zA-Z.,;:!?\'")\]])(? str: + """Return a common lowercase suffix for a PIL image format.""" + if image_format == "JPEG": + return ".jpg" + if image_format == "PNG": + return ".png" + if image_format: + return f".{image_format.lower()}" + return fallback.lower() + + +def normalize_cached_figures(out_dir: Path) -> None: + """Ensure cached figure refs include figures/ and extensions match bytes.""" + figures_dir = out_dir / "figures" + md_path = out_dir / "markdown.md" + if not figures_dir.is_dir() or not md_path.is_file(): + return + + try: + from PIL import Image + except ImportError: + return + + rewrites: dict[str, str] = {} + for path in figures_dir.iterdir(): + if not path.is_file(): + continue + + try: + with Image.open(path) as image: + suffix = canonical_image_suffix(image.format, path.suffix) + except Exception as exc: # pragma: no cover + print(f"warn: cached figure {path.name} inspection failed: {exc}", file=sys.stderr) + continue + + canonical_path = path.with_suffix(suffix) + new_ref = f"figures/{canonical_path.name}" + + if canonical_path != path and not canonical_path.exists(): + path.rename(canonical_path) + + rewrites[path.name] = new_ref + rewrites[f"figures/{path.name}"] = new_ref + + if rewrites: + text = md_path.read_text(encoding="utf-8", errors="replace") + for old_ref in sorted(rewrites, key=len, reverse=True): + new_ref = rewrites[old_ref] + text = text.replace(old_ref, new_ref) + md_path.write_text(text, encoding="utf-8") + + +def sha256_of(path: Path) -> str: + h = hashlib.sha256() + with path.open("rb") as f: + for chunk in iter(lambda: f.read(1 << 20), b""): + h.update(chunk) + return h.hexdigest() + + +def text_layer_chars(path: Path, pages: int = 3) -> int: + """Return non-whitespace chars extracted from the PDF text layer sample.""" + try: + result = subprocess.run( + ["pdftotext", "-l", str(pages), str(path), "-"], + check=False, + capture_output=True, + text=True, + timeout=30, + ) + except (FileNotFoundError, subprocess.TimeoutExpired): + return 0 + if result.returncode != 0: + return 0 + return sum(1 for ch in result.stdout if not ch.isspace()) + + +def in_venv() -> bool: + return Path(sys.prefix).resolve() == VENV_PYTHON.parent.parent.resolve() + + +def reexec_in_venv(args: list[str]) -> None: + """Re-run this script under the backend venv's Python.""" + if not VENV_PYTHON.exists(): + installer = SKILL_DIR / "install.py" + subprocess.run([sys.executable, str(installer)], check=True) + os.execv(str(VENV_PYTHON), [str(VENV_PYTHON), str(Path(__file__).resolve()), *args]) + + +def convert_with_marker(pdf_path: Path, out_dir: Path) -> dict: + from marker.converters.pdf import PdfConverter + from marker.models import create_model_dict + from marker.output import text_from_rendered + + text_chars = text_layer_chars(pdf_path) + use_text_layer = text_chars >= 500 + # paginate_output makes marker emit `{N}` page boundary markers in + # the markdown stream, where N is the 0-based page index. prepare_substrate + # consumes those to populate per-chunk page_anchors so reader agents can + # cite "p. N" instead of line ranges. + config: dict = {"paginate_output": True} + if use_text_layer: + config["disable_ocr"] = True + converter = PdfConverter(artifact_dict=create_model_dict(), config=config) + rendered = converter(str(pdf_path)) + text, _, images = text_from_rendered(rendered) + + figures_dir = out_dir / "figures" + figures_dir.mkdir(exist_ok=True) + + image_path_rewrites: dict[str, str] = {} + fig_count = 0 + for name, img in (images or {}).items(): + source_name = Path(name).name + try: + suffix = canonical_image_suffix(getattr(img, "format", None), Path(source_name).suffix) + out_name = figures_dir / f"{Path(source_name).stem}{suffix}" + img.save(out_name) + new_ref = f"figures/{out_name.name}" + image_path_rewrites[source_name] = new_ref + image_path_rewrites[f"figures/{source_name}"] = new_ref + fig_count += 1 + except Exception as exc: # pragma: no cover + print(f"warn: figure {name} save failed: {exc}", file=sys.stderr) + + text = normalize_footnotes(text) + for old_path in sorted(image_path_rewrites, key=len, reverse=True): + text = text.replace(old_path, image_path_rewrites[old_path]) + (out_dir / "markdown.md").write_text(text, encoding="utf-8") + + return { + "backend": "marker", + "backend_package": "marker-pdf", + "backend_package_version": importlib.metadata.version("marker-pdf"), + "page_count": None, + "figure_count": fig_count, + "text_layer_chars_sample": text_chars, + "ocr_disabled": use_text_layer, + "equation_extraction_mode": "native", # marker emits LaTeX directly + } + + +def main() -> int: + if len(sys.argv) != 2: + print("usage: convert.py ", file=sys.stderr) + return 2 + + pdf_path = Path(sys.argv[1]).expanduser().resolve() + if not pdf_path.is_file(): + print(f"error: not a file: {pdf_path}", file=sys.stderr) + return 2 + + if not in_venv(): + reexec_in_venv([str(pdf_path)]) + + # Marker reads TORCH_DEVICE at import time. Set before importing the + # backend, after we're inside the venv (so torch is the venv's torch). + if "TORCH_DEVICE" not in os.environ: + os.environ["TORCH_DEVICE"] = detect_torch_device() + + digest = sha256_of(pdf_path) + out_dir = CACHE_DIR / digest + md_path = out_dir / "markdown.md" + if md_path.is_file(): + normalize_cached_figures(out_dir) + print(str(md_path)) + return 0 + + out_dir.mkdir(parents=True, exist_ok=True) + started = time.time() + info = convert_with_marker(pdf_path, out_dir) + info.update( + { + "source_path": str(pdf_path), + "sha256": digest, + "elapsed_seconds": round(time.time() - started, 2), + "torch_device": os.environ.get("TORCH_DEVICE", "cpu"), + "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()), + } + ) + (out_dir / "meta.json").write_text( + json.dumps(info, indent=2), encoding="utf-8" + ) + normalize_cached_figures(out_dir) + print(str(md_path)) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/.claude/skills/read-pdf/extraction_schema.md b/.claude/skills/read-pdf/extraction_schema.md new file mode 100644 index 0000000..47c8931 --- /dev/null +++ b/.claude/skills/read-pdf/extraction_schema.md @@ -0,0 +1,42 @@ +# Extraction Schema + +The structured-extraction contract shared by `/read-pdf` default mode, `/read-pdf --split` mode, and downstream `wiki-update` marker ingest. Output is a single project-neutral markdown file (`_text.md`) consisting of an optional title, a bibliographic metadata block, plain-English synthesis, and research notes in that order. The bibliographic metadata block must not appear after the research dimensions. + +## Bibliographic metadata (always first) + +From the title page (or title section of the converted markdown), extract: + +``` +## Bibliographic metadata +doi: <10.xxxx/yyyy if present on the title page, else null> +authors: [LastName1, LastName2, ...] +title: +year: +venue: +venue_type: journal | working_paper | book_chapter | other +``` + +If a field is not visible on the title page, record `null`. Do not guess. + +## Plain-English synthesis + +Hard cap: ~200 words. No jargon. Cover the research question, why it matters, what the paper estimates and how in plain terms, what it finds, and the main take-away. + +## Research dimensions + +1. **Research question** — What is the paper asking and why does it matter? +2. **Audience** — Which sub-community of researchers cares about this? +3. **Method** — How do they answer the question? What is the identification strategy? +4. **Target parameter** — What estimand or causal/statistical object is being targeted? +5. **Data** — What data do they use? Where precisely did they find it? What is the unit of observation? Sample size? Time period? +6. **Statistical methods / specifications** — What econometric or statistical techniques do they use? What are the key specifications? +7. **Findings** — What are the main results? Key coefficient estimates and standard errors? +8. **Contributions** — What is learned from this exercise that we didn't know before? +9. **Replication feasibility** — Is the data publicly available? Is there a replication archive? A data appendix? URLs for the underlying data? +10. **Tables** — Inventory tables, extracting machine-readable tables when central to understanding or replication. +11. **Figures** — Inventory figures, captions, and key visual claims. +12. **Equations / formal objects** — Inventory equations, formal models, propositions, algorithms, and labeled specifications. + +## Tone + +A structured extraction more detailed and specific than a typical summary — what a researcher needs to **build on or replicate** the work. By the time the extraction is finished, the notes should contain specific data sources, variable names, equation references, sample sizes, coefficient estimates, and standard errors. Not a summary — a structured extraction. diff --git a/.claude/skills/read-pdf/fanout_synthesis.md b/.claude/skills/read-pdf/fanout_synthesis.md new file mode 100644 index 0000000..8916184 --- /dev/null +++ b/.claude/skills/read-pdf/fanout_synthesis.md @@ -0,0 +1,37 @@ +# Fanout Synthesis Prompt + +Use this prompt after all worker bundles have durable notes. + +## Inputs + +- `manifest.json` +- all worker note paths +- output `_text.md` path +- `extraction_schema.md` + +## Task + +Read the manifest and every worker note. Write one coherent, project-neutral `_text.md` using `extraction_schema.md`. + +Required output order: + +1. Optional top-level paper title (`# ...`) if useful. +2. `## Bibliographic metadata` +3. `## Plain-English synthesis` +4. `## Research dimensions`, with dimensions 1 through 12 in schema order. + +Do not put the bibliographic metadata block after the research dimensions. + +## Rules + +- Treat worker notes as local evidence, not final interpretation. +- Do gap-directed rereads only: reread source chunks when notes omit a needed table, figure, equation, result, or ambiguous claim. +- Do not read the full marker `markdown.md`. +- Do not read project wiki pages, project context files, citation-overlap JSON, or downstream workflow files. +- Do not write source pages, concept/wiki pages, index entries, log entries, or figure files. +- Preserve exact coefficients, standard errors, sample details, equation labels, and table/figure captions when available. +- Keep `_text.md` project-neutral. Downstream skills apply project relevance gates after this file exists. + +## Outputs + +- `_text.md` diff --git a/.claude/skills/read-pdf/fanout_worker.md b/.claude/skills/read-pdf/fanout_worker.md new file mode 100644 index 0000000..e7e40ba --- /dev/null +++ b/.claude/skills/read-pdf/fanout_worker.md @@ -0,0 +1,68 @@ +# Fanout Worker Prompt + +Use this prompt for one bounded worker bundle from `prepare_substrate.py`. + +## Inputs + +- bundle id and bundle excerpt from `worker_bundles` +- output note path +- position: `front_matter`, `body`, `back_matter`, or `full_paper` + +## Task + +Read only assigned chunk paths. Write local extraction notes to the output note path. Do not write paper-level conclusions, `_text.md`, wiki pages, index entries, or log entries. + +## Position-Specific Emphasis + +`front_matter`: +- Bibliographic candidates: title, authors, year, venue, DOI. +- Abstract, introduction framing, research question, stated contribution. +- Any early equations, figures, or tables. + +`body`: +- Local evidence only: methods, data, specifications, findings, tables, figures, and equations found in assigned chunks. +- Do not reconstruct bibliography unless assigned chunks contain new or contradictory metadata. + +`back_matter`: +- Robustness, appendices, limitations, replication/data availability, and references-section clues. +- Record references only when they matter for DOI/bibliographic candidates or citation-overlap checks. + +`full_paper`: +- Apply all extraction categories across the assigned chunks. Do not over-prioritize title/abstract material just because the paper fit in one bundle. + +## Note Format + +```markdown +# Worker notes: + +## Source chunks +- + +## Local extraction +- Research question / motivation evidence: +- Method / identification evidence: +- Target parameter evidence: +- Data evidence: +- Statistical methods / specifications: +- Findings: +- Contributions: +- Replication feasibility: + +## Formal-object inventory +- Tables: +- Figures: +- Equations/specifications: +- Other formal objects: + +## Bibliographic candidates +- doi: +- authors: +- title: +- year: +- venue: + +## Unresolved gaps +- +``` + +Preserve exact numbers, equation labels, table/figure captions, and page anchors when present. Keep notes compact, but do not omit formal objects. diff --git a/.claude/skills/read-pdf/install.py b/.claude/skills/read-pdf/install.py new file mode 100644 index 0000000..affbbac --- /dev/null +++ b/.claude/skills/read-pdf/install.py @@ -0,0 +1,338 @@ +#!/usr/bin/env python3 +""" +read-pdf installer — sets up the local PDF→markdown converter (marker backend). + +Idempotent. First run creates a venv at ~/.cache/claude-pdf-converter/venv-marker/, +installs marker-pdf, and warms up model downloads. Subsequent runs reuse the +existing install if marker imports cleanly. They do not check PyPI or +auto-upgrade marker, but they run a lazy monthly check for marker major-version +updates and surface an advisory when one is available. + +The venv lives outside any git repo so that backend models (~hundreds of MB) +do not pollute the skills checkout. + +OS-agnostic: searches for a Python ≥ 3.10 across macOS, Linux, and Windows +common install locations. If none is found, prints a platform-aware install hint. +""" + +import argparse +import json +import os +import platform +import re +import shutil +import subprocess +import sys +import time +import urllib.error +import urllib.request +from pathlib import Path + +BACKEND = "marker" +PACKAGE = "marker-pdf" +PINS = [PACKAGE] # unpinned first install; no automatic upgrades after setup +CHECK_INTERVAL_SECONDS = 30 * 24 * 60 * 60 + +PY_MIN = (3, 10) + +# Names to try on PATH. Cross-platform: same on macOS/Linux/Windows because +# python.org and most package managers install with these names. +PY_NAMES = ["python3.13", "python3.12", "python3.11", "python3.10", "python3", "python"] + +# Absolute fallback paths by OS, only consulted if PATH search fails. +def _fallback_paths() -> list[str]: + sysname = platform.system() + if sysname == "Darwin": + return [ + f"/Library/Frameworks/Python.framework/Versions/{v}/bin/python{v}" + for v in ("3.13", "3.12", "3.11", "3.10") + ] + [ + f"/opt/homebrew/bin/python{v}" for v in ("3.13", "3.12", "3.11", "3.10") + ] + [ + f"/usr/local/bin/python{v}" for v in ("3.13", "3.12", "3.11", "3.10") + ] + if sysname == "Linux": + return [f"/usr/bin/python{v}" for v in ("3.13", "3.12", "3.11", "3.10")] + if sysname == "Windows": + # py launcher handles version selection on Windows + return ["py", "-3.13", "py", "-3.12", "py", "-3.11", "py", "-3.10"] + return [] + + +def _install_hint() -> str: + sysname = platform.system() + if sysname == "Darwin": + return "Install Python 3.10+ via `brew install python@3.12` or python.org installer." + if sysname == "Linux": + return "Install Python 3.10+ via your package manager (e.g. `apt install python3.12` or `dnf install python3.12`)." + if sysname == "Windows": + return "Install Python 3.10+ via `winget install Python.Python.3.12` or python.org installer." + return "Install Python 3.10 or newer." + + +def _check_version(path: str) -> bool: + try: + out = subprocess.check_output( + [path, "-c", "import sys; print('%d.%d' % sys.version_info[:2])"], + text=True, stderr=subprocess.DEVNULL, + ).strip() + major, minor = (int(x) for x in out.split(".")) + return (major, minor) >= PY_MIN + except Exception: + return False + + +def find_python() -> str: + """Return path to a Python ≥3.10. Prefers the running interpreter if it qualifies.""" + if sys.version_info >= PY_MIN: + return sys.executable + for name in PY_NAMES: + path = shutil.which(name) + if path and _check_version(path): + return path + for cand in _fallback_paths(): + path = cand if Path(cand).exists() else shutil.which(cand) + if path and _check_version(path): + return path + print( + f"error: need Python ≥{PY_MIN[0]}.{PY_MIN[1]} but found only " + f"{sys.version_info.major}.{sys.version_info.minor}.\n" + f"{_install_hint()}", + file=sys.stderr, + ) + sys.exit(2) + + +CACHE_ROOT = Path.home() / ".cache" / "claude-pdf-converter" +VENV_DIR = CACHE_ROOT / f"venv-{BACKEND}" +CHECK_FILE = CACHE_ROOT / f"version-check-{BACKEND}.json" + + +def venv_python() -> Path: + # Windows venvs put python in Scripts/, not bin/ + if platform.system() == "Windows": + return VENV_DIR / "Scripts" / "python.exe" + return VENV_DIR / "bin" / "python" + + +def venv_exists() -> bool: + return venv_python().exists() + + +def backend_imports() -> bool: + if not venv_exists(): + return False + result = subprocess.run( + [str(venv_python()), "-c", "import marker"], + capture_output=True, + ) + return result.returncode == 0 + + +def installed_marker_version() -> str | None: + """Return the installed marker-pdf version inside the backend venv.""" + if not venv_exists(): + return None + result = subprocess.run( + [ + str(venv_python()), + "-c", + ( + "import importlib.metadata; " + f"print(importlib.metadata.version('{PACKAGE}'))" + ), + ], + capture_output=True, + text=True, + ) + if result.returncode != 0: + return None + return result.stdout.strip() + + +def version_major(version: str | None) -> int | None: + """Extract a simple PEP 440-style leading major version.""" + if not version: + return None + match = re.match(r"^\s*(\d+)", version) + return int(match.group(1)) if match else None + + +def latest_marker_version() -> str | None: + """Fetch the latest marker-pdf release from PyPI. Network failures are nonfatal.""" + url = f"https://pypi.org/pypi/{PACKAGE}/json" + try: + with urllib.request.urlopen(url, timeout=5) as response: + payload = json.load(response) + except (OSError, urllib.error.URLError, TimeoutError, json.JSONDecodeError): + return latest_marker_version_from_pip() + return payload.get("info", {}).get("version") + + +def latest_marker_version_from_pip() -> str | None: + """Use pip's index command as a fallback when Python TLS certs are unavailable.""" + if not venv_exists(): + return None + try: + result = subprocess.run( + [str(venv_python()), "-m", "pip", "index", "versions", PACKAGE], + capture_output=True, + text=True, + timeout=30, + ) + except (OSError, subprocess.TimeoutExpired): + return None + if result.returncode != 0: + return None + match = re.search(r"^\s*LATEST:\s*(\S+)", result.stdout, re.MULTILINE) + if match: + return match.group(1) + match = re.search(rf"^{re.escape(PACKAGE)}\s+\(([^)]+)\)", result.stdout) + return match.group(1) if match else None + + +def monthly_check_due(force: bool = False) -> bool: + """Return true when the lazy update check should hit PyPI.""" + if force: + return True + try: + checked_at = json.loads(CHECK_FILE.read_text(encoding="utf-8")).get( + "checked_at", 0 + ) + except (FileNotFoundError, json.JSONDecodeError, OSError): + return True + return (time.time() - float(checked_at)) >= CHECK_INTERVAL_SECONDS + + +def record_version_check(installed: str | None, latest: str | None) -> None: + """Persist the last version check so normal invocations avoid network calls.""" + CACHE_ROOT.mkdir(parents=True, exist_ok=True) + CHECK_FILE.write_text( + json.dumps( + { + "checked_at": time.time(), + "package": PACKAGE, + "installed_version": installed, + "latest_version": latest, + }, + indent=2, + ), + encoding="utf-8", + ) + + +def check_for_major_update(force: bool = False) -> None: + """Warn when PyPI has crossed a marker-pdf major-version boundary.""" + if not monthly_check_due(force): + return + installed = installed_marker_version() + latest = latest_marker_version() + if latest is None: + return + record_version_check(installed, latest) + installed_major = version_major(installed) + latest_major = version_major(latest) + if ( + installed + and latest + and installed_major is not None + and latest_major is not None + and latest_major > installed_major + ): + print( + "\nread-pdf notice: marker-pdf has a major update available.\n" + f" installed: {installed}\n" + f" latest: {latest}\n" + "Major marker updates may change PDF conversion behavior. " + "Upgrade only when you are ready to review conversion output.\n" + f"To upgrade: {sys.executable} {Path(__file__).resolve()} --upgrade-marker\n" + "Existing cached conversions will be left in place. To force " + "re-conversion after upgrading, delete selected cache entries under " + f"{CACHE_ROOT / 'cache' / BACKEND}, or delete that whole directory. " + "Large caches may take a long time to rebuild.", + flush=True, + ) + + +def create_venv() -> None: + CACHE_ROOT.mkdir(parents=True, exist_ok=True) + print( + f"First run: creating venv at {VENV_DIR} and installing " + f"{BACKEND} (~500MB, 1–3 min, one-time).", + flush=True, + ) + base_python = find_python() + subprocess.run([base_python, "-m", "venv", str(VENV_DIR)], check=True) + subprocess.run( + [str(venv_python()), "-m", "pip", "install", "--upgrade", "pip"], + check=True, + ) + subprocess.run( + [str(venv_python()), "-m", "pip", "install", *PINS], + check=True, + ) + + +def upgrade_marker() -> None: + """Explicit opt-in marker upgrade; never called during normal setup.""" + if not venv_exists(): + create_venv() + before = installed_marker_version() or "not installed" + subprocess.run( + [str(venv_python()), "-m", "pip", "install", "--upgrade", PACKAGE], + check=True, + ) + after = installed_marker_version() or "unknown" + print(f"marker-pdf upgraded: {before} -> {after}", flush=True) + record_version_check(after, after) + + +def warmup_models() -> None: + """Trigger first-run model download so the first conversion is fast.""" + print("Downloading layout/OCR models (one-time)...", flush=True) + subprocess.run( + [str(venv_python()), "-c", + "from marker.models import create_model_dict; create_model_dict()"], + check=True, + ) + + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser(description="Install the read-pdf marker backend.") + parser.add_argument( + "--upgrade-marker", + action="store_true", + help="Explicitly upgrade marker-pdf in the read-pdf venv.", + ) + parser.add_argument( + "--force-version-check", + action="store_true", + help="Check PyPI for marker-pdf major-version updates even if checked recently.", + ) + return parser.parse_args() + + +def main() -> int: + args = parse_args() + if args.upgrade_marker: + upgrade_marker() + warmup_models() + return 0 + if backend_imports(): + print( + "read-pdf setup already present. Reusing existing marker install " + "(no automatic update).", + flush=True, + ) + check_for_major_update(force=args.force_version_check) + return 0 + if not venv_exists(): + create_venv() + warmup_models() + record_version_check(installed_marker_version(), installed_marker_version()) + print(f"read-pdf setup complete. Backend: {BACKEND}", flush=True) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/.claude/skills/read-pdf/isolation_common.md b/.claude/skills/read-pdf/isolation_common.md new file mode 100644 index 0000000..6321796 --- /dev/null +++ b/.claude/skills/read-pdf/isolation_common.md @@ -0,0 +1,10 @@ +# Isolation Common + +When `/read-pdf` is invoked by another skill or workflow, heavy reading runs in a subagent. The parent may run lightweight shell steps, choose the mode, check cache/extract collisions, and read the final `_text.md`. The parent must not read bulky intermediate inputs (`markdown.md` or split PDF images) directly. + +Use: + +- `isolation_read.md` for default marker mode. +- `isolation_split.md` for `--split` mode. + +Standalone invocations may read in the main conversation because there is no larger workflow context to protect. diff --git a/.claude/skills/read-pdf/isolation_read.md b/.claude/skills/read-pdf/isolation_read.md new file mode 100644 index 0000000..9c0e28e --- /dev/null +++ b/.claude/skills/read-pdf/isolation_read.md @@ -0,0 +1,27 @@ +# Isolation: Default Marker Mode + +The parent runs install/cache/convert steps. If `cache_text.py check ` finds a cached neutral extract, the parent runs `cache_text.py pull ` and skips extraction. Otherwise the parent prepares the extraction substrate, launches bounded workers, then runs one synthesis bottleneck. + +```text +Prepare converted marker markdown for bounded extraction. + +Markdown input: +Text output: +Manifest: /substrate/manifest.json +Schema: ~/.claude/skills/read-pdf/extraction_schema.md +Worker prompt: ~/.claude/skills/read-pdf/fanout_worker.md +Synthesis: ~/.claude/skills/read-pdf/fanout_synthesis.md + +Process: +1. Parent runs: + python3 ~/.claude/skills/read-pdf/scripts/prepare_substrate.py +2. Parent launches worker bundles sequentially from manifest.worker_bundles. +3. Each worker reads only assigned chunk paths and writes one durable note file. +4. Synthesis reads manifest + worker notes, gap-rereads specific chunks only when needed, and writes . +5. Parent runs: + python3 ~/.claude/skills/read-pdf/scripts/cache_text.py push + +Report when done: page count if available, figures/tables found, one-sentence content summary. +``` + +After the subagent returns, the parent reads `_text.md` only. diff --git a/.claude/skills/read-pdf/isolation_split.md b/.claude/skills/read-pdf/isolation_split.md new file mode 100644 index 0000000..2deaf75 --- /dev/null +++ b/.claude/skills/read-pdf/isolation_split.md @@ -0,0 +1,23 @@ +# Isolation: Split Mode + +The parent acquires the PDF, resolves existing extract/split reuse, and runs `scripts/split.py` if needed. The parent then launches a subagent for split-PDF reading and extraction. + +```text +Read PDF split files and produce structured extraction notes. + +Split directory: +Files, in order: +Notes output: +Text output: +Schema: ~/.claude/skills/read-pdf/extraction_schema.md + +Process: +1. Read 3 PDF files at a time using the Read tool. +2. After each batch, update with extracted content. +3. Extract the bibliographic metadata block and 12 research dimensions as specified in extraction_schema.md. +4. Write the final structured extraction to , with the ## Bibliographic metadata block first. + +Report when done: splits read, figures/tables found, one-sentence content summary. +``` + +After the subagent returns, the parent reads `_text.md` only. diff --git a/.claude/skills/read-pdf/scripts/cache_text.py b/.claude/skills/read-pdf/scripts/cache_text.py new file mode 100755 index 0000000..73b0a91 --- /dev/null +++ b/.claude/skills/read-pdf/scripts/cache_text.py @@ -0,0 +1,77 @@ +#!/usr/bin/env python3 +"""Manage neutral read-pdf extracts in the converter cache.""" + +from __future__ import annotations + +import argparse +import shutil +from pathlib import Path + + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser(description="Check, pull, or push cached text.md extracts.") + subparsers = parser.add_subparsers(dest="command", required=True) + + check = subparsers.add_parser("check", help="Print cached text.md path or NOT_CACHED.") + check.add_argument("markdown", type=Path, help="Path to converter cache markdown.md") + + pull = subparsers.add_parser("pull", help="Copy cached text.md to a local _text.md path.") + pull.add_argument("markdown", type=Path, help="Path to converter cache markdown.md") + pull.add_argument("local_text", type=Path, help="Destination local _text.md path") + + push = subparsers.add_parser("push", help="Copy local _text.md to cache text.md.") + push.add_argument("markdown", type=Path, help="Path to converter cache markdown.md") + push.add_argument("local_text", type=Path, help="Source local _text.md path") + + return parser.parse_args() + + +def cache_text_path(markdown_path: Path) -> Path: + markdown_path = markdown_path.expanduser().resolve() + if not markdown_path.is_file(): + raise SystemExit(f"markdown not found: {markdown_path}") + if markdown_path.name != "markdown.md": + raise SystemExit(f"expected markdown.md, got: {markdown_path.name}") + return markdown_path.parent / "text.md" + + +def require_nonempty_file(path: Path, label: str) -> Path: + path = path.expanduser().resolve() + if not path.is_file(): + raise SystemExit(f"{label} not found: {path}") + if path.stat().st_size == 0: + raise SystemExit(f"{label} is empty: {path}") + return path + + +def main() -> int: + args = parse_args() + cached_text = cache_text_path(args.markdown) + + if args.command == "check": + print(cached_text if cached_text.is_file() and cached_text.stat().st_size > 0 else "NOT_CACHED") + return 0 + + if args.command == "pull": + source = require_nonempty_file(cached_text, "cached text.md") + destination = args.local_text.expanduser().resolve() + if destination.exists(): + raise SystemExit(f"destination already exists: {destination}") + if not destination.parent.is_dir(): + raise SystemExit(f"destination parent not found: {destination.parent}") + shutil.copy2(source, destination) + print(destination) + return 0 + + if args.command == "push": + source = require_nonempty_file(args.local_text, "local _text.md") + cached_text.parent.mkdir(parents=True, exist_ok=True) + shutil.copy2(source, cached_text) + print(cached_text) + return 0 + + raise SystemExit(f"unknown command: {args.command}") + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/.claude/skills/read-pdf/scripts/prepare_substrate.py b/.claude/skills/read-pdf/scripts/prepare_substrate.py new file mode 100755 index 0000000..6184ec1 --- /dev/null +++ b/.claude/skills/read-pdf/scripts/prepare_substrate.py @@ -0,0 +1,405 @@ +#!/usr/bin/env python3 +"""Prepare bounded marker-markdown chunks for agent extraction. + +This script is intentionally mechanical: it reads marker's ``markdown.md``, +splits it into bounded source chunks, and writes a manifest that helps agents +navigate the chunks without asking any script to summarize the paper. +""" + +from __future__ import annotations + +import argparse +import hashlib +import json +import re +from dataclasses import dataclass +from pathlib import Path +from typing import Iterable + + +DEFAULT_CHUNK_CHAR_LIMIT = 24000 +DEFAULT_TINY_CHUNK_CHAR_LIMIT = 4000 +DEFAULT_WORKER_SOURCE_CHAR_LIMIT = 50000 + +# Heading hygiene: marker sometimes promotes long paragraphs to `#` lines and +# mangles section numbers like `1.6` into `1.[^6]`. We sanitize and guard so +# that the manifest's heading/navigation signal stays useful on non-academic +# PDFs (engineering references, reports with quirky front-matter). +HEADING_RE = re.compile(r"^(#{1,6})\s+(.+?)\s*$") +HEADING_FOOTNOTE_ARTIFACT_RE = re.compile(r"\[\^([^\]]+)\]") +HEADING_SUP_FOOTNOTE_RE = re.compile(r"[^<]*", re.IGNORECASE) +HEADING_EMPHASIS_RE = re.compile(r"\*+") +HEADING_WHITESPACE_RE = re.compile(r"\s+") +HEADING_MAX_LENGTH = 120 +HEADING_DISPLAY_CAP = 80 +MERGED_HEADING_MAX_PARTS = 2 +# Marker's markdown renderer emits `{N}` between pages when run +# with paginate_output=True. N is the 0-based page index (see convert.py). +# We normalize to 1-based `page-N` strings so the manifest stays +# human-friendly. Anchor 'page-12' means physical page 12 in the source PDF. +PAGE_ANCHOR_RE = re.compile(r"^\{(\d+)\}-{20,}", re.MULTILINE) +FIGURE_RE = re.compile(r"!\[[^\]]*\]\(([^)]+)\)") +DOI_RE = re.compile(r"10\.\d{4,9}/[^\s\])}>\"']+", re.IGNORECASE) + + +def sanitize_heading(text: str) -> str: + # Strip footnote artifacts (`1.[^6]` -> `1.6`), HTML superscript + # footnote refs (`1`), markdown emphasis, and collapse + # whitespace. Leaves real heading content untouched. + text = HEADING_FOOTNOTE_ARTIFACT_RE.sub(r"\1", text) + text = HEADING_SUP_FOOTNOTE_RE.sub("", text) + text = HEADING_EMPHASIS_RE.sub("", text) + text = HEADING_WHITESPACE_RE.sub(" ", text).strip() + return text + + +def format_merged_heading(parts: list[str]) -> str: + if len(parts) <= MERGED_HEADING_MAX_PARTS: + return " / ".join(parts) + head = " / ".join(parts[:MERGED_HEADING_MAX_PARTS]) + extra = len(parts) - MERGED_HEADING_MAX_PARTS + return f"{head} (+{extra} more)" + + +def display_heading(heading: str, cap: int = HEADING_DISPLAY_CAP) -> str: + if len(heading) <= cap: + return heading + return heading[: cap - 3].rstrip() + "..." + + +@dataclass +class Section: + heading: str + level: int + start_line: int + end_line: int + text: str + + +@dataclass +class Chunk: + index: int + heading: str + path: Path + start_line: int + end_line: int + char_count: int + sha256: str + page_anchors: list[str] + figures: list[str] + doi_candidates: list[str] + + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser( + description="Create bounded chunk files and manifest for marker markdown." + ) + parser.add_argument("markdown", type=Path, help="Path to marker markdown.md") + parser.add_argument( + "--output-dir", + type=Path, + default=None, + help="Directory for chunks and manifest; defaults beside markdown.md", + ) + parser.add_argument( + "--chunk-char-limit", + type=int, + default=DEFAULT_CHUNK_CHAR_LIMIT, + help=f"Hard max characters per chunk (default {DEFAULT_CHUNK_CHAR_LIMIT})", + ) + parser.add_argument( + "--tiny-chunk-char-limit", + type=int, + default=DEFAULT_TINY_CHUNK_CHAR_LIMIT, + help=f"Adjacent chunks below this size may be merged (default {DEFAULT_TINY_CHUNK_CHAR_LIMIT})", + ) + parser.add_argument( + "--worker-source-char-limit", + type=int, + default=DEFAULT_WORKER_SOURCE_CHAR_LIMIT, + help=f"Max source characters per worker bundle (default {DEFAULT_WORKER_SOURCE_CHAR_LIMIT})", + ) + return parser.parse_args() + + +def slugify(value: str, fallback: str) -> str: + slug = re.sub(r"[^a-zA-Z0-9]+", "-", value.lower()).strip("-") + slug = re.sub(r"-+", "-", slug) + return (slug[:70].strip("-") or fallback).lower() + + +def split_sections(lines: list[str]) -> list[Section]: + starts: list[tuple[int, int, str]] = [] + for i, line in enumerate(lines, start=1): + match = HEADING_RE.match(line) + if not match: + continue + clean = sanitize_heading(match.group(2)) + # Reject suspiciously long "headings": marker sometimes promotes a + # full body paragraph (e.g. a mailing-address block) to a `#` line. + # Treat those as body content so the section boundary isn't false. + if not clean or len(clean) > HEADING_MAX_LENGTH: + continue + starts.append((i, len(match.group(1)), clean)) + + if not starts or starts[0][0] != 1: + starts.insert(0, (1, 0, "front-matter")) + + sections: list[Section] = [] + for pos, (start, level, heading) in enumerate(starts): + end = starts[pos + 1][0] - 1 if pos + 1 < len(starts) else len(lines) + text = "".join(lines[start - 1 : end]) + sections.append(Section(heading, level, start, end, text)) + return sections + + +def split_oversized_section(section: Section, char_limit: int) -> list[Section]: + if len(section.text) <= char_limit: + return [section] + + pieces: list[Section] = [] + current_lines: list[str] = [] + current_start = section.start_line + current_chars = 0 + + def flush(end_line: int) -> None: + nonlocal current_lines, current_start, current_chars + if not current_lines: + return + label = section.heading if not pieces else f"{section.heading} part {len(pieces) + 1}" + pieces.append( + Section(label, section.level, current_start, end_line, "".join(current_lines)) + ) + current_lines = [] + current_start = end_line + 1 + current_chars = 0 + + for rel_i, line in enumerate(section.text.splitlines(keepends=True), start=0): + abs_line = section.start_line + rel_i + line_len = len(line) + paragraph_break = not line.strip() + would_exceed = current_lines and current_chars + line_len > char_limit + if would_exceed and paragraph_break: + flush(abs_line - 1) + elif would_exceed and current_chars >= int(char_limit * 0.85): + flush(abs_line - 1) + + current_lines.append(line) + current_chars += line_len + + if current_chars >= char_limit: + flush(abs_line) + + flush(section.end_line) + return pieces + + +def merge_tiny_sections(sections: Iterable[Section], tiny_limit: int, hard_limit: int) -> list[Section]: + merged: list[Section] = [] + pending: Section | None = None + # Track the original heading parts that were merged into `pending` so the + # resulting heading can be summarized (first N + "(+K more)") instead of + # concatenating every subsection name into a multi-hundred-char blob. + pending_parts: list[str] = [] + + for section in sections: + if pending is None: + pending = section + pending_parts = [section.heading] + continue + + combined_len = len(pending.text) + len(section.text) + if len(pending.text) < tiny_limit and combined_len <= hard_limit: + pending_parts.append(section.heading) + pending = Section( + heading=format_merged_heading(pending_parts), + level=min(pending.level, section.level), + start_line=pending.start_line, + end_line=section.end_line, + text=pending.text + section.text, + ) + else: + merged.append(pending) + pending = section + pending_parts = [section.heading] + + if pending is not None: + merged.append(pending) + return merged + + +def collect_page_markers(lines: list[str]) -> list[tuple[int, int]]: + # Return (line_number, page_id) for every paginate_output marker line. + # 1-based line numbers to match Section.start_line semantics. + markers: list[tuple[int, int]] = [] + for i, line in enumerate(lines, start=1): + m = PAGE_ANCHOR_RE.match(line) + if m: + markers.append((i, int(m.group(1)))) + return markers + + +def pages_for_range( + markers: list[tuple[int, int]], start_line: int, end_line: int +) -> list[str]: + # Page anchors a chunk overlaps: the most recent marker at-or-before + # start_line (carryover, so chunks that begin mid-page still get a page + # number), plus every marker inside [start_line, end_line]. + page_ids: set[int] = set() + carryover: int | None = None + for line_no, page_id in markers: + if line_no <= start_line: + carryover = page_id + elif line_no <= end_line: + page_ids.add(page_id) + else: + break + if carryover is not None: + page_ids.add(carryover) + return [f"page-{pid + 1}" for pid in sorted(page_ids)] + + +def metadata_for_chunk(text: str) -> tuple[list[str], list[str], str]: + figures = sorted(set(FIGURE_RE.findall(text))) + doi_candidates = sorted(set(match.rstrip(".,;") for match in DOI_RE.findall(text))) + digest = hashlib.sha256(text.encode("utf-8")).hexdigest() + return figures, doi_candidates, digest + + +def write_chunks( + sections: list[Section], + chunks_dir: Path, + page_markers: list[tuple[int, int]], +) -> list[Chunk]: + chunks_dir.mkdir(parents=True, exist_ok=True) + for old_chunk in chunks_dir.glob("chunk_*.md"): + old_chunk.unlink() + + chunks: list[Chunk] = [] + for index, section in enumerate(sections, start=1): + slug = slugify(section.heading, f"chunk-{index:03d}") + path = chunks_dir / f"chunk_{index:03d}-{slug}.md" + figures, doi_candidates, digest = metadata_for_chunk(section.text) + page_anchors = pages_for_range( + page_markers, section.start_line, section.end_line + ) + path.write_text(section.text, encoding="utf-8") + chunks.append( + Chunk( + index=index, + heading=section.heading, + path=path, + start_line=section.start_line, + end_line=section.end_line, + char_count=len(section.text), + sha256=digest, + page_anchors=page_anchors, + figures=figures, + doi_candidates=doi_candidates, + ) + ) + return chunks + + +def bundle_chunks(chunks: list[Chunk], source_limit: int) -> list[dict[str, object]]: + bundles: list[dict[str, object]] = [] + current: list[Chunk] = [] + current_chars = 0 + + def flush() -> None: + nonlocal current, current_chars + if not current: + return + position = "body" + if not bundles: + position = "front_matter" + bundles.append( + { + "bundle_id": f"bundle_{len(bundles) + 1:03d}", + "position": position, + "chunk_indexes": [chunk.index for chunk in current], + "chunk_paths": [str(chunk.path) for chunk in current], + "char_count": current_chars, + "headings": [display_heading(chunk.heading) for chunk in current], + } + ) + current = [] + current_chars = 0 + + for chunk in chunks: + if current and current_chars + chunk.char_count > source_limit: + flush() + current.append(chunk) + current_chars += chunk.char_count + flush() + + if len(bundles) > 1: + bundles[-1]["position"] = "back_matter" + elif len(bundles) == 1: + bundles[0]["position"] = "full_paper" + return bundles + + +def main() -> int: + args = parse_args() + markdown_path = args.markdown.expanduser().resolve() + if not markdown_path.exists(): + raise SystemExit(f"markdown not found: {markdown_path}") + + output_dir = ( + args.output_dir.expanduser().resolve() + if args.output_dir + else markdown_path.parent / "substrate" + ) + chunks_dir = output_dir / "chunks" + manifest_path = output_dir / "manifest.json" + output_dir.mkdir(parents=True, exist_ok=True) + + text = markdown_path.read_text(encoding="utf-8", errors="replace") + lines = text.splitlines(keepends=True) + page_markers = collect_page_markers(lines) + sections = split_sections(lines) + + bounded_sections: list[Section] = [] + for section in sections: + bounded_sections.extend(split_oversized_section(section, args.chunk_char_limit)) + bounded_sections = merge_tiny_sections( + bounded_sections, args.tiny_chunk_char_limit, args.chunk_char_limit + ) + + chunks = write_chunks(bounded_sections, chunks_dir, page_markers) + bundles = bundle_chunks(chunks, args.worker_source_char_limit) + + manifest = { + "schema_version": 1, + "source_markdown": str(markdown_path), + "source_sha256": hashlib.sha256(text.encode("utf-8")).hexdigest(), + "source_line_count": len(lines), + "source_char_count": len(text), + "chunk_char_limit": args.chunk_char_limit, + "worker_source_char_limit": args.worker_source_char_limit, + "chunks_dir": str(chunks_dir), + "chunks": [ + { + "index": chunk.index, + "path": str(chunk.path), + "heading": chunk.heading, + "line_range": [chunk.start_line, chunk.end_line], + "char_count": chunk.char_count, + "sha256": chunk.sha256, + "page_anchors": chunk.page_anchors, + "figures": chunk.figures, + "doi_candidates": chunk.doi_candidates, + } + for chunk in chunks + ], + "worker_bundles": bundles, + "doi_candidates": sorted({doi for chunk in chunks for doi in chunk.doi_candidates}), + } + manifest_path.write_text(json.dumps(manifest, indent=2) + "\n", encoding="utf-8") + print(manifest_path) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/.claude/skills/read-pdf/scripts/split.py b/.claude/skills/read-pdf/scripts/split.py new file mode 100755 index 0000000..03525de --- /dev/null +++ b/.claude/skills/read-pdf/scripts/split.py @@ -0,0 +1,58 @@ +#!/usr/bin/env python3 +"""Split a PDF into fixed-size page chunks using the skill directory convention.""" + +from __future__ import annotations + +import argparse +import math +from pathlib import Path + +from pypdf import PdfReader, PdfWriter + + +def default_split_dir(pdf_path: Path) -> Path: + folder_path = pdf_path.resolve().parent + folder_name = folder_path.name + return folder_path / f"{folder_name}_build" / f"split_{pdf_path.stem}" + + +def split_pdf(input_path: Path, output_dir: Path, pages_per_chunk: int) -> tuple[int, int]: + output_dir.mkdir(parents=True, exist_ok=True) + reader = PdfReader(str(input_path)) + total_pages = len(reader.pages) + + for start in range(0, total_pages, pages_per_chunk): + end = min(start + pages_per_chunk, total_pages) + writer = PdfWriter() + + for page_index in range(start, end): + writer.add_page(reader.pages[page_index]) + + output_path = output_dir / f"{input_path.stem}_pp{start + 1}-{end}.pdf" + with output_path.open("wb") as handle: + writer.write(handle) + + return total_pages, math.ceil(total_pages / pages_per_chunk) + + +def main() -> None: + parser = argparse.ArgumentParser(description="Split a PDF into fixed-size page chunks.") + parser.add_argument("pdf_path", type=Path, help="PDF to split") + parser.add_argument("--output-dir", type=Path, default=None, help="Directory for split PDFs") + parser.add_argument("--pages-per-chunk", type=int, default=4, help="Pages per split PDF") + args = parser.parse_args() + + if args.pages_per_chunk < 1: + raise SystemExit("--pages-per-chunk must be at least 1") + + pdf_path = args.pdf_path.expanduser().resolve() + if not pdf_path.is_file(): + raise SystemExit(f"PDF not found: {pdf_path}") + + output_dir = args.output_dir.expanduser().resolve() if args.output_dir else default_split_dir(pdf_path) + total_pages, chunk_count = split_pdf(pdf_path, output_dir, args.pages_per_chunk) + print(f"Split {total_pages} pages into {chunk_count} chunks in {output_dir}") + + +if __name__ == "__main__": + main() diff --git a/.claude/skills/split-pdf/SKILL.md b/.claude/skills/split-pdf/SKILL.md index 7fd966f..3fee6d2 100644 --- a/.claude/skills/split-pdf/SKILL.md +++ b/.claude/skills/split-pdf/SKILL.md @@ -1,208 +1,23 @@ --- name: split-pdf -description: Download, split, and deeply read academic PDFs. Use when asked to read, review, or summarize an academic paper. Splits PDFs into 4-page chunks, reads them in small batches, and produces structured reading notes — avoiding context window crashes and shallow comprehension. +description: Compatibility wrapper for `/read-pdf --split`. Use only when the user explicitly invokes `/split-pdf` or asks for the legacy split-PDF vision-batch workflow. For new paper-reading requests, prefer `/read-pdf`; use `/read-pdf --split` for triage, converter failures, or no marker setup. allowed-tools: Bash(python*), Bash(pip*), Bash(curl*), Bash(wget*), Bash(mkdir*), Bash(ls*), Read, Write, Edit, WebSearch, WebFetch, Agent argument-hint: [pdf-path-or-search-query] --- -# Split-PDF: Download, Split, and Deep-Read Academic Papers +# Split-PDF Compatibility Wrapper -**CRITICAL RULE: Never read a full PDF. Never.** Only read the 4-page split files, and only 3 splits at a time (~12 pages). Reading a full PDF will either crash the session with an unrecoverable "prompt too long" error — destroying all context — or produce shallow, hallucinated output. There are no exceptions. +This skill is retained for existing slash-command muscle memory. Execute the request as: -## When This Skill Is Invoked - -The user wants you to read, review, or summarize an academic paper. The input is either: -- A file path to a local PDF (e.g., `~/Documents/papers/smith_2024.pdf`) -- A search query or paper title (e.g., `"Gentzkow Shapiro Sinkinson 2014 competition newspapers"`) - -**Important:** You cannot search for a paper you don't know exists. The user MUST provide either a file path or a specific search query — an author name, a title, keywords, a year, or some combination that identifies the paper. If the user invokes this skill without specifying what paper to read, ask them. Do not guess. - -## Step 1: Acquire the PDF - -**If a local file path is provided:** -- Verify the file exists -- Use the PDF in place. The working directory is the folder containing the PDF. -- Proceed to Step 2 - -**If a search query or paper title is provided:** -1. Use WebSearch to find the paper -2. Use WebFetch or Bash (curl/wget) to download the PDF -3. Save it to the current working directory (create the directory if needed) -4. Proceed to Step 2 - -**CRITICAL: Always preserve the original PDF.** The source PDF must NEVER be deleted, moved, or overwritten at any point in this workflow. The split files are derivatives; the original is the permanent artifact. Do not clean up, do not remove, do not tidy. The original stays. - -## Step 2: Split the PDF - -**Before splitting, check for an existing extract.** Look for `_text.md` in the same folder as the PDF. - -If found, ask: -> "An extract from a previous deep-read exists (`_text.md`). Use it for this request, or re-read the PDF from scratch?" -- **Use extract**: read `_text.md` and use it as the source notes — skip the rest of Steps 2 and 3 entirely -- **Re-read**: proceed with splitting below - -This prevents redundant re-reading of papers you have already processed. The `_text.md` file is a structured plain-text extraction that is far cheaper to read than re-processing the PDF page images. - -**If no extract exists, check for existing splits.** Determine the build directory: - -```python -import os -folder_path = os.path.dirname(os.path.abspath(pdf_path)) -foldername = os.path.basename(folder_path) -pdf_basename = os.path.splitext(os.path.basename(pdf_path))[0] -build_dir = os.path.join(folder_path, foldername + '_build') -split_dir = os.path.join(build_dir, 'split_' + pdf_basename) -``` - -If `split_dir` already exists and contains `.pdf` files, ask: -> "Splits already exist for `` (N chunks in `_build/split_/`). Reuse existing splits, or re-split from scratch?" -- **Reuse**: skip splitting, proceed to Step 3 using the existing files in `split_dir` -- **Re-split**: delete the existing split folder, then proceed with splitting below - -Create splits in `_build/split_/` and run the splitting script: - -```python -from PyPDF2 import PdfReader, PdfWriter -import os, sys - -def split_pdf(input_path, output_dir, pages_per_chunk=4): - os.makedirs(output_dir, exist_ok=True) - reader = PdfReader(input_path) - total = len(reader.pages) - prefix = os.path.splitext(os.path.basename(input_path))[0] - - for start in range(0, total, pages_per_chunk): - end = min(start + pages_per_chunk, total) - writer = PdfWriter() - for i in range(start, end): - writer.add_page(reader.pages[i]) - - out_name = f"{prefix}_pp{start+1}-{end}.pdf" - out_path = os.path.join(output_dir, out_name) - with open(out_path, "wb") as f: - writer.write(f) - - print(f"Split {total} pages into {-(-total // pages_per_chunk)} chunks in {output_dir}") -``` - -**Directory convention:** -``` -articles/ # any working folder -├── smith_2024.pdf # original PDF — NEVER DELETE THIS -├── smith_2024_text.md # structured extract — created after deep-read -└── articles_build/ # _build/ — shared build folder - └── split_smith_2024/ # split_/ - ├── smith_2024_pp1-4.pdf - ├── smith_2024_pp5-8.pdf - ├── smith_2024_pp9-12.pdf - ├── notes.md # working copy — source for _text.md - └── ... -``` - -The build directory convention (`_build/`) keeps split artifacts, compilation intermediates, and other working files separate from the source material and finished outputs. Multiple PDFs in the same folder share one build directory, each with its own `split_/` subdirectory inside it. - -The original PDF remains permanently. The splits are working copies. If anything goes wrong, you can always re-split from the original. - -If PyPDF2 is not installed, install it: `pip install PyPDF2` - -## Step 3: Read in Batches of 3 Splits - -Read **exactly 3 split files at a time** (~12 pages). After each batch: - -1. **Read** the 3 split PDFs using the Read tool -2. **Update** the running notes file (`notes.md` in the split subdirectory) -3. **Pause** and tell the user: - -> "I have finished reading splits [X-Y] and updated the notes. I have [N] more splits remaining. Would you like me to continue with the next 3?" - -4. **Wait** for the user to confirm before reading the next batch - -Do NOT read ahead. Do NOT read all splits at once. The pause-and-confirm protocol is mandatory. - -## Step 4: Structured Extraction - -As you read, collect information along these dimensions and write them into `notes.md`: - -1. **Research question** — What is the paper asking and why does it matter? -2. **Audience** — Which sub-community of researchers cares about this? -3. **Method** — How do they answer the question? What is the identification strategy? -4. **Data** — What data do they use? Where precisely did they find it? What is the unit of observation? Sample size? Time period? -5. **Statistical methods** — What econometric or statistical techniques do they use? What are the key specifications? -6. **Findings** — What are the main results? Key coefficient estimates and standard errors? -7. **Contributions** — What is learned from this exercise that we didn't know before? -8. **Replication feasibility** — Is the data publicly available? Is there a replication archive? A data appendix? URLs for the underlying data? - -These questions extract what a researcher needs to **build on or replicate** the work — a structured extraction more detailed and specific than a typical summary. - -## The Notes File - -The working notes file is `notes.md` in the split subdirectory, updated incrementally after each batch. Structure it with clear headers for each of the 8 dimensions. After each batch, update whichever dimensions have new information — do not rewrite from scratch. - -By the time all splits are read, the notes should contain specific data sources, variable names, equation references, sample sizes, coefficient estimates, and standard errors. Not a summary — a structured extraction. - -**After all batches are complete**, write the final notes to `_text.md` in the same folder as the source PDF: - -``` -articles/smith_2024_text.md +```text +/read-pdf --split ``` -Then notify the user: -> "Extract saved to `smith_2024_text.md` alongside the source PDF. Future requests on this paper can reuse it without re-reading." - -This file is the persistent, reusable artifact. The `notes.md` in the build directory is the working copy. Both are kept — never delete either. - -## Agent Isolation Protocol - -**When split-pdf is invoked by another skill or workflow** (any process that continues working after the PDF has been read), the PDF reading MUST run inside a subagent to prevent context bloat in the parent conversation. - -**Why:** Each PDF page rendered by the Read tool produces image data in the conversation context. A 35-page PDF (9 chunks) can add 10-20MB of image data that accumulates permanently. After reading one or two large PDFs on top of prior work, the conversation hits the API request size limit and becomes unrecoverable: no subsequent Read calls succeed, and rewinding does not free sufficient space. - -**Pattern:** - -The parent skill handles splitting (Step 2's Python script) in its own context; this is lightweight. Then it launches an Agent to perform all the reading: - -``` -Read PDF split files and produce structured extraction notes. - -Split directory: -Files (read in this order, 3 at a time): -Notes output: -Text output: - -Process: -1. Read 3 PDF files at a time using the Read tool -2. After each batch, update the notes file with extracted content -3. Extract: research question, audience, method, data (sources, sample size, time period), - statistical methods, findings, contributions, replication feasibility -4. Write the final structured extraction to the text output path - -Report when done: pages read, figures/tables found, one-sentence content summary. -``` - -After the agent returns, the parent reads the output files (plain markdown, not PDF images) and continues its workflow. - -**Standalone invocations** (user calls `/split-pdf` directly) use the interactive protocol above with reads in the main conversation and the pause-and-confirm protocol. - -## When NOT to Split - -- Papers shorter than ~15 pages: read directly (still use the Read tool, not Bash) -- Policy briefs or non-technical documents: a rough summary is fine -- Triage only: read just the first split (pages 1-4) for abstract and introduction - -## Quick Reference - -| Step | Action | -|------|--------| -| **Acquire** | Download to the current working directory or use existing local file in place | -| **Check** | Look for existing `_text.md` extract or existing splits — offer to reuse | -| **Split** | 4-page chunks into `_build/split_/` | -| **Read** | 3 splits at a time, pause after each batch | -| **Write** | Update `notes.md` with structured extraction | -| **Persist** | Save final extraction to `_text.md` alongside the source PDF | -| **Confirm** | Ask user before continuing to next batch | - -## Acknowledgments +Follow `.claude/skills/read-pdf/SKILL.md`, specifically the `--split` mode branch. -The in-place PDF handling, persistent `_text.md` extraction, split reuse, build directory convention, and agent isolation protocol were inspired by improvements identified by [Ben Bentzin](https://www.mccombs.utexas.edu) (Associate Professor of Instruction, McCombs School of Business, University of Texas at Austin), who adapted the original skill for his own workflows and shared his findings (April 2026). His version demonstrated that subagent isolation prevents context bloat when reading multiple large PDFs in a single session — a critical reliability improvement. The implementation here is independently written but the ideas are his. +Compatibility notes: -For detailed explanation of why the batched-reading method works, see [methodology.md](methodology.md). +- The original PDF is preserved. +- Split files still use `_build/split_/`. +- The output remains `_text.md` with the same bibliographic metadata block and 8 research dimensions. +- For subagent isolation, use `.claude/skills/read-pdf/isolation_split.md`. diff --git a/.claude/skills/split-pdf/agent_isolation.md b/.claude/skills/split-pdf/agent_isolation.md new file mode 100644 index 0000000..4ea609f --- /dev/null +++ b/.claude/skills/split-pdf/agent_isolation.md @@ -0,0 +1,8 @@ +# Agent Isolation Protocol + +`/split-pdf` is now a compatibility wrapper for `/read-pdf --split`. + +Use: + +- `.claude/skills/read-pdf/isolation_common.md` +- `.claude/skills/read-pdf/isolation_split.md` diff --git a/.claude/skills/split-pdf/scripts/split.py b/.claude/skills/split-pdf/scripts/split.py new file mode 100755 index 0000000..a13eaef --- /dev/null +++ b/.claude/skills/split-pdf/scripts/split.py @@ -0,0 +1,16 @@ +#!/usr/bin/env python3 +"""Compatibility shim for the canonical read-pdf split backend.""" + +from __future__ import annotations + +import runpy +from pathlib import Path + + +CANONICAL_SPLITTER = ( + Path(__file__).resolve().parents[2] / "read-pdf" / "scripts" / "split.py" +) + + +if __name__ == "__main__": + runpy.run_path(str(CANONICAL_SPLITTER), run_name="__main__") diff --git a/skills/read-pdf/README.md b/skills/read-pdf/README.md new file mode 100644 index 0000000..0293b05 --- /dev/null +++ b/skills/read-pdf/README.md @@ -0,0 +1,51 @@ +# `/read-pdf` — Canonical Academic PDF Reader + +**Skill location:** [`.claude/skills/read-pdf/SKILL.md`](../../.claude/skills/read-pdf/SKILL.md) + +`/read-pdf` reads academic papers and writes reusable `_text.md` notes with bibliographic metadata plus 8 research dimensions. + +## Modes + +| Mode | Command | Best for | +|---|---|---| +| Marker conversion | `/read-pdf ` | Tables, equations, figures, repeated processing, batch ingest | +| Split vision reading | `/read-pdf --split ` | Triage, converter failures, no marker setup, legacy `/split-pdf` behavior | + +Default mode converts the PDF to markdown locally with marker, using: + +```bash +python3 ~/.claude/skills/read-pdf/install.py +python3 ~/.claude/skills/read-pdf/convert.py path/to/paper.pdf +``` + +Split mode creates 4-page chunks with: + +```bash +python3 ~/.claude/skills/read-pdf/scripts/split.py path/to/paper.pdf +``` + +`/split-pdf` remains as a compatibility wrapper for `/read-pdf --split`. + +## Output + +Both modes preserve the original PDF and write: + +```text +paper.pdf +paper_text.md +``` + +Split mode also writes working files under: + +```text +_build/split_/ +``` + +The structured extraction contract lives in [`.claude/skills/read-pdf/extraction_schema.md`](../../.claude/skills/read-pdf/extraction_schema.md). + +## Isolation + +When another skill calls `/read-pdf`, heavy reading runs in a subagent: + +- marker mode: [`.claude/skills/read-pdf/isolation_read.md`](../../.claude/skills/read-pdf/isolation_read.md) +- split mode: [`.claude/skills/read-pdf/isolation_split.md`](../../.claude/skills/read-pdf/isolation_split.md) diff --git a/skills/split-pdf/README.md b/skills/split-pdf/README.md index 3cdde58..177514d 100644 --- a/skills/split-pdf/README.md +++ b/skills/split-pdf/README.md @@ -1,194 +1,21 @@ -# `/split-pdf` — Download, Split, and Deep-Read Academic Papers +# `/split-pdf` — Compatibility Wrapper **Skill location:** [`.claude/skills/split-pdf/SKILL.md`](../../.claude/skills/split-pdf/SKILL.md) ---- +`/split-pdf` is retained for existing slash-command muscle memory. Treat it as: -## What This Skill Does - -You give Claude a paper — either a local PDF file or a search query like "Gentzkow Shapiro 2014 competition newspapers" — and it does the rest. It finds the paper online and downloads it (or uses your local file in place), splits it into 4-page chunks using PyPDF2, then reads those chunks in small batches (3 at a time, ~12 pages), pausing between each batch for your review. As it reads, it writes structured notes into a `notes.md` file, extracting specific information across 8 dimensions. When finished, it saves a persistent `_text.md` extraction alongside the source PDF so future invocations can skip re-reading entirely. - ---- - -## Why It Exists - -Claude Code can read PDFs, but long academic papers cause two failures: - -1. **Session crash.** PDFs are token-expensive (fonts, vector graphics, tables, math notation). A 40-page paper can exceed the context window, producing an unrecoverable "prompt too long" error that destroys the entire session and all context. - -2. **Shallow reading.** Even when the PDF fits, Claude's attention degrades over long documents — it reads the abstract carefully, skims the methodology, and often hallucinates details from the results. You get a confident summary that's subtly wrong. - -These are related but distinct problems. The first kills the session. The second produces unreliable output while the session continues normally. Splitting addresses both. - ---- - -## The Solution - -Split the PDF into 4-page chunks, read 3 chunks at a time (~12 pages), and write structured notes incrementally. - -### How It Works - -| Step | Action | -|------|--------| -| **Acquire** | Download the PDF (via web search) or use a local file in place | -| **Check** | Look for existing `_text.md` extract or existing splits — offer to reuse | -| **Split** | PyPDF2 splits into 4-page chunks in `_build/split_/` | -| **Read** | Read 3 splits at a time, pause after each batch | -| **Extract** | Update running `notes.md` with structured information | -| **Persist** | Save final extraction to `_text.md` alongside the source PDF | -| **Confirm** | Wait for user approval before continuing to next batch | - -### Usage - -``` -/split-pdf path/to/paper.pdf -/split-pdf "Gentzkow Shapiro Sinkinson 2014 competition newspapers" -``` - -**You must tell Claude what paper to read.** Claude cannot webcrawl for a paper it doesn't know exists. Provide either a local file path or a search query specific enough to find the paper — an author name, title, keywords, year, or some combination. If you just type `/split-pdf` with nothing else, Claude will ask you what you're looking for. - -### What Gets Extracted (8 Dimensions) - -The skill produces a **structured extraction** — more detailed and specific than a typical summary, organized around the dimensions a researcher needs to build on or replicate the work: - -1. **Research question** — What is the paper asking and why does it matter? -2. **Audience** — Which sub-community of researchers cares about this? -3. **Method** — How do they answer the question? What is the identification strategy? -4. **Data** — What data do they use? Where did they find it? Unit of observation? Sample size? Time period? -5. **Statistical methods** — What econometric or statistical techniques? Key specifications? -6. **Findings** — Main results? Key coefficient estimates and standard errors? -7. **Contributions** — What is learned that we didn't know before? -8. **Replication feasibility** — Public data? Replication archive? Data appendix? URLs? - ---- - -## Key Features - -### In-place PDF handling - -The skill uses the PDF wherever it already lives. No copying to a centralized `articles/` folder. This lets the skill work inside any project folder without rearranging your file structure. - -### Persistent extraction (`_text.md`) - -After all batches are read, the skill writes a structured plain-text extraction as `_text.md` next to the source PDF. On future invocations, the skill checks for this file first and offers to reuse it — skipping re-reading entirely. This saves tokens and time on previously processed papers. - -### Split reuse - -If splits already exist in the build directory from a previous run, the skill offers to reuse them instead of re-splitting. - -### Build directory convention - -Splits go into `_build/split_/` rather than directly alongside the PDF. This keeps working artifacts (splits, intermediate notes) separate from source files and finished outputs. Multiple PDFs in the same folder share one build directory. - -### Agent isolation protocol - -When another skill calls `/split-pdf` (for example, `/beautiful_deck` reading a paper before generating slides), the PDF reading runs inside a subagent. Each PDF page renders as image data that accumulates permanently in the conversation context. A 35-page paper can add 10-20MB. Without isolation, two or three large PDFs crash the session by hitting the API request size limit. The subagent reads the pages, writes plain-text output, and the parent skill only reads the text. - -Standalone invocations (user calls `/split-pdf` directly) use the interactive pause-and-confirm protocol in the main conversation. - ---- - -## Why This Design - -**Why 4-page chunks?** Small enough for careful attention, large enough to keep logical sections (a methodology subsection, a results table with discussion) together. A 40-page paper becomes 10 chunks read in 4 rounds. - -**Why 3 chunks per batch (~12 pages)?** Balances throughput against attention quality. Twelve pages is enough to make progress but not so much that comprehension degrades. - -**Why pause between batches?** So you can: -- Review intermediate output and catch errors before they compound -- Redirect the reading or ask follow-up questions -- Skip sections that aren't relevant -- Control pacing for sections that need more care - -**Why incremental notes instead of a final summary?** When Claude reads a full paper at once, it produces a summary — lossy compression. When it reads in batches and updates running notes, it accumulates detail. The final notes are richer than any one-shot summary. - -**Why persist the extraction?** A 40-page paper costs ~4 rounds of PDF image rendering. Doing that twice is waste. The `_text.md` file lets you come back to the paper weeks later without re-reading a single page. - -For the full methodology, see [`.claude/skills/split-pdf/methodology.md`](../../.claude/skills/split-pdf/methodology.md). - ---- - -## Directory Structure After Running - -``` -articles/ # any working folder -├── smith_2024.pdf # original PDF — ALWAYS preserved, never deleted -├── smith_2024_text.md # structured extract — reusable across sessions -└── articles_build/ # _build/ — shared build folder - └── split_smith_2024/ # split_/ - ├── smith_2024_pp1-4.pdf # 4-page chunks - ├── smith_2024_pp5-8.pdf - ├── smith_2024_pp9-12.pdf - ├── ... - └── notes.md # working copy of structured notes -``` - -**The original PDF is never deleted.** Whether Claude downloaded it via web search or you pointed it to a local file, the original always stays where it was. The split files are derivatives. If anything goes wrong — a corrupted split, a re-read with different parameters — you can always re-split from the original. - ---- - -## Example: Gentzkow, Shapiro & Sinkinson (AER 2014) - -This directory contains a complete worked example showing the full pipeline from start to finish. - -### What happened - -The user typed: - -``` -/split-pdf "Gentzkow Shapiro Sinkinson competition ideological diversity newspapers" +```text +/read-pdf --split ``` -Claude searched the web, found the paper on Matt Gentzkow's website at `https://web.stanford.edu/~gentzkow/research/competition.pdf`, downloaded it to a local `articles/` directory, and then split and read it. The original PDF was **kept** — only the splits were created alongside it. - -### What's in this directory - -``` -gentskow_shapiro_competition/ -├── gs_competition_pp1-4.pdf # pages 1-4 (intro, motivation, preview) -├── gs_competition_pp5-8.pdf # pages 5-8 (data, descriptive analysis) -├── gs_competition_pp9-12.pdf # pages 9-12 (descriptive results, model setup) -├── gs_competition_pp13-16.pdf # pages 13-16 (model continued, identification) -├── gs_competition_pp17-20.pdf # pages 17-20 (model structure, assumptions) -├── gs_competition_pp21-24.pdf # pages 21-24 (estimation, demand side) -├── gs_competition_pp25-28.pdf # pages 25-28 (supply estimation, parameter estimates) -├── gs_competition_pp29-32.pdf # pages 29-32 (model fit, welfare analysis) -├── gs_competition_pp33-36.pdf # pages 33-36 (policy experiments, robustness) -├── gs_competition_pp37-40.pdf # pages 37-40 (appendix, references) -├── gs_competition_pp41-42.pdf # pages 41-42 (remaining references) -└── notes.md # structured reading notes (all 8 dimensions) -``` - -The original PDF would sit one level up in `articles/` — it is not included here to keep the repo size reasonable, but in actual use it is always preserved. - -### The reading process - -The 42-page paper was split into 11 chunks (ten 4-page chunks + one 2-page chunk). Claude read them in 4 rounds: - -| Round | Splits read | Pages | What was covered | -|-------|-------------|-------|------------------| -| 1 | pp1-4, pp5-8, pp9-12 | 1-12 | Introduction, data, descriptive results | -| 2 | pp13-16, pp17-20, pp21-24 | 13-24 | Structural model, identification, estimation | -| 3 | pp25-28, pp29-32, pp33-36 | 25-36 | Parameter estimates, welfare, policy experiments | -| 4 | pp37-40, pp41-42 | 37-42 | Appendix robustness, references | - -After each round, Claude paused and asked whether to continue. The `notes.md` file was updated incrementally after each batch. - -### What the notes look like - -Open [`notes.md`](gentskow_shapiro_competition/notes.md) to see the full output. It's a structured extraction across all 8 dimensions — more detailed than a typical summary — including specific coefficient estimates, standard errors, equation numbers, exact data sources with where they were obtained, sample sizes, and a detailed assessment of replication feasibility. - ---- - -## Limitations - -- **It is slow.** A 37-page paper requires ~4 rounds of reading with user confirmation between each. This is a deliberate trade-off: careful reading over fast reading. -- **Notes can become repetitive** if the paper revisits themes. Some manual editing of the final notes may be useful. -- **Not for triage.** If you just need to decide whether a paper is relevant, read only the first split (pages 1-4, which usually contains the abstract and introduction). You don't need the full protocol. -- **Papers under ~15 pages** can be read directly without splitting. +The canonical split workflow now lives in [`.claude/skills/read-pdf/SKILL.md`](../../.claude/skills/read-pdf/SKILL.md), under `--split` mode. ---- +## What Still Works -## Acknowledgments +- Original PDF is preserved. +- Split files are written under `_build/split_/`. +- Working notes are accumulated as `notes.md`. +- Final reusable extraction is saved as `_text.md`. +- Old script path `~/.claude/skills/split-pdf/scripts/split.py` remains as a shim to the canonical `~/.claude/skills/read-pdf/scripts/split.py`. -The in-place PDF handling, persistent `_text.md` extraction, split reuse, build directory convention, and agent isolation protocol were inspired by improvements identified by [Ben Bentzin](https://www.mccombs.utexas.edu) (Associate Professor of Instruction, McCombs School of Business, University of Texas at Austin), who adapted the original skill for his own workflows and shared his findings (April 2026). His version demonstrated that subagent isolation prevents context bloat when reading multiple large PDFs in a single session — a critical reliability improvement. The implementation here is independently written but the ideas are his. +For batching rationale, see [`.claude/skills/split-pdf/methodology.md`](../../.claude/skills/split-pdf/methodology.md).