scunning1975 · nsmiller2501 · May 8, 2026
diff --git a/.claude/skills/read-pdf/README.md b/.claude/skills/read-pdf/README.md
@@ -0,0 +1,144 @@
+# `/read-pdf` — Download, Convert, and Deep-Read Academic Papers
+
+**Same workflow as `/split-pdf`, but uses python:marker to convert the PDF to markdown locally first, instead of having Claude vision-read PDF page images.** This makes equation, table, and figure extraction more faithful, and avoids image-based context bloat in the parent conversation.
+
+**Skill location:** [`.claude/skills/read-pdf/SKILL.md`](../../.claude/skills/read-pdf/SKILL.md)
+
+---
+
+## What This Skill Does
+
+You give Claude a paper — either a local PDF file or a search query — and it does the rest. It finds and downloads the paper (or uses your local file in place), converts it to clean markdown using python:marker, then reads that markdown to write structured notes. When finished, it saves a persistent `_text.md` extraction alongside the source PDF, in the same format produced by `/split-pdf`.
+
+---
+
+## Why It Exists
+
+`/split-pdf` reads PDFs by having Claude vision-read page images in batches. This works well for most papers but has two limitations:
+
+1. **Equation fidelity.** PDF page images render math as bitmaps. Vision-reading bitmaps produces approximate LaTeX transcriptions. Papers heavy with structural equations (e.g., structural IO, dynamic programming models) benefit from native math extraction.
+
+2. **Table structure.** Complex tables (multi-column headers, merged cells, footnotes) are harder to transcribe accurately from images than from a layout-aware text conversion.
+
+`/read-pdf` addresses both by running a local conversion step first. The result is a `markdown.md` file where equations are native LaTeX math mode and tables are pipe-syntax markdown — readable as text rather than image bitmaps.
+
+---
+
+## The Solution
+
+Convert the PDF to markdown with python:marker (layout-aware, GPU-accelerated), then read the text.
+
+### How It Works
+
+| Step | Action |
+|------|--------|
+| **Acquire** | Download the PDF (via web search) or use a local file in place |
+| **Install** | `install.py` sets up the marker venv on first run (~500 MB, one-time) |
+| **Check cache** | SHA-256 hash check — skip re-conversion if markdown already cached |
+| **Convert** | `convert.py` runs marker and writes `markdown.md` to a content-hash cache |
+| **Collision** | If `_text.md` already exists, ask: overwrite or save as `_text2.md`? |
+| **Extract** | Read `markdown.md`, write bibliographic metadata + 8-dimension notes |
+| **Persist** | Save final extraction to `<basename>_text.md` alongside the source PDF |
+
+### Usage
+
+```
+/read-pdf path/to/paper.pdf
+/read-pdf "Gentzkow Shapiro Sinkinson 2014 competition newspapers"
+```
+
+As with `/split-pdf`, you must tell Claude what paper to read. Provide either a local file path or a search query specific enough to find the paper.
+
+### What Gets Extracted
+
+Same 8 dimensions as `/split-pdf`, plus a bibliographic metadata block at the top of `_text.md`:
+
+```
+## Bibliographic metadata
+doi: <10.xxxx/yyyy or null>
+authors: [LastName1, LastName2, ...]
+title: <verbatim title>
+year: <year>
+venue: <journal/working paper series/etc.>
+venue_type: journal | working_paper | book_chapter | other
+```
+
+1. **Research question** — What is the paper asking and why does it matter?
+2. **Audience** — Which sub-community of researchers cares about this?
+3. **Method** — How do they answer the question? What is the identification strategy?
+4. **Data** — What data do they use? Where did they find it? Unit of observation? Sample size? Time period?
+5. **Statistical methods** — What econometric or statistical techniques? Key specifications?
+6. **Findings** — Main results? Key coefficient estimates and standard errors?
+7. **Contributions** — What is learned that we didn't know before?
+8. **Replication feasibility** — Public data? Replication archive? Data appendix? URLs?
+
+---
+
+## Key Features
+
+### Conversion backend: marker
+
+The conversion backend is **marker** (`marker-pdf`). Selected after a head-to-head bake-off against docling on a representative set of empirical-economics PDFs; marker won on equation fidelity, table structure, and figure extraction quality.
+
+Backend selection is fixed in `convert.py`. There is no runtime override — if the bake-off needs to be redone for a future backend candidate, edit the `BACKEND` constant in `convert.py` explicitly so the cache namespace and venv are regenerated cleanly.
+
+### Born-digital PDFs and OCR
+
+Most journal PDFs already contain an embedded text layer. For those files, `convert.py` samples the first pages with `pdftotext` and tells marker to use the embedded text rather than re-OCRing the whole document. Marker still performs layout, table, and selected region recognition, but avoids the extremely slow full-document OCR path. If the text-layer sample is missing or too sparse, marker keeps OCR enabled for scanned PDFs.
+
+### GPU acceleration
+
+Auto-detected: NVIDIA CUDA → CPU. MPS on Apple Silicon is excluded — surya's layout model crashes at runtime on MPS with an index-bounds error (some surya sub-models already refuse MPS; the layout model does not and fails mid-conversion). A 3–5× speedup on CUDA boxes. No flags needed on any platform.
+
+### Content-hash cache
+
+Conversions are cached by SHA-256 of the source PDF bytes at `~/.cache/claude-pdf-converter/cache/marker/<hash>/`. Re-converting the same PDF (even under a different filename, even in a different project) is a no-op — the cached `markdown.md` is returned immediately. The cache is shared across all projects on the machine.
+
+Cache entries are not auto-evicted. To force a re-conversion:
+```bash
+rm -rf ~/.cache/claude-pdf-converter/cache/marker/<hash>/
+```
+To wipe the entire cache (e.g., after a backend upgrade):
+```bash
+rm -rf ~/.cache/claude-pdf-converter/cache/
+```
+The venv at `~/.cache/claude-pdf-converter/venv-marker/` is untouched.
+
+### `_text.md` collision handling
+
+If a `_text.md` already exists alongside the PDF (e.g., from a prior `/split-pdf` run), the skill asks whether to overwrite it or save the new extraction as `_text2.md`. This lets you compare extractions from both methods on the same paper without losing earlier work.
+
+### Agent isolation protocol
+
+When another skill calls `/read-pdf`, the conversion runs in the parent context (lightweight bash call) and the reading runs inside a subagent. The subagent reads `markdown.md`, writes plain-text `_text.md`, and the parent reads only the text output. This prevents the converted markdown from accumulating token cost in a busy workflow conversation.
+
+---
+
+## `/read-pdf` vs `/split-pdf` — When to Use Which
+
+| | `/split-pdf` | `/read-pdf` |
+|---|---|---|
+| **Reading mechanism** | Claude vision-reads PDF page images | Marker converts to markdown; Claude reads text |
+| **Setup required** | None | `install.py` (~500 MB, one-time) |
+| **First-run latency** | None | ~1–3 min (model download + conversion) |
+| **Subsequent runs** | — | Instant if cached |
+| **Equation fidelity** | Good (vision-based) | Better (native LaTeX extraction) |
+| **Table structure** | Good | Better (layout-aware) |
+| **Works without internet** | No (unless PDF already local) | Yes (after install) |
+| **Output format** | `_text.md` | `_text.md` (same format) |
+
+Both skills produce identical `_text.md` output format and can be used interchangeably by downstream skills like `/bib-update` and `/wiki-update`.
+
+---
+
+## Limitations
+
+- **Requires local setup.** First run downloads ~500 MB of models. Not suitable for environments where you can't write to `~/.cache/`.
+- **Conversion can fail on malformed PDFs.** If `convert.py` errors, the skill stops — it does not fall back to a degraded alternative. Fix the PDF or use `/split-pdf` instead.
+- **Not for triage.** If you just need to decide whether a paper is relevant, use `/split-pdf` (no setup, works immediately on first split).
+
+---
+
+## Acknowledgments
+
+The in-place PDF handling, persistent `_text.md` extraction, build directory convention, and agent isolation protocol follow conventions established in the `/split-pdf` skill, where they were inspired by improvements identified by [Ben Bentzin](https://www.mccombs.utexas.edu) (Associate Professor of Instruction, McCombs School of Business, University of Texas at Austin). The marker integration (`convert.py`, `install.py`) and content-hash caching design are original to this skill.
diff --git a/.claude/skills/read-pdf/SKILL.md b/.claude/skills/read-pdf/SKILL.md
@@ -0,0 +1,164 @@
+---
+name: read-pdf
+description: Download or use a local academic PDF, convert to clean markdown locally (python:marker, layout-aware), then extract structured reading notes into `_text.md`. Same output contract as /split-pdf — bibliographic metadata block + 8-dimension research notes — but uses local conversion instead of Claude vision-reading PDF images. Preserves equation fidelity, table structure, and figure references. Use when you want higher-fidelity math/table extraction, or when you already have a local file.
+allowed-tools: Bash(python3:*), Bash(curl:*), Bash(wget:*), Bash(mkdir:*), Read, Write, WebSearch, WebFetch, Agent
+argument-hint: [pdf-path-or-search-query]
+---
+
+# Read-PDF: Download, Convert, and Deep-Read Academic Papers
+
+Same I/O contract as /split-pdf: takes a PDF (local or searched), produces a structured `_text.md` extraction with a bibliographic metadata block and 8-dimension research notes. The difference is the reading mechanism: instead of Claude vision-reading PDF page images in chunks, read-pdf converts the PDF to markdown locally using python:marker, then reads the text. This preserves equation fidelity, table structure, and figure references without image-based context bloat.
+
+## When This Skill Is Invoked
+
+The user wants to read, review, or summarize an academic paper and either: (a) wants layout-aware equation/table extraction, or (b) already has a local PDF. The input is either:
+- A file path to a local PDF (e.g., `~/Documents/papers/smith_2024.pdf`)
+- A search query or paper title (e.g., `"Gentzkow Shapiro Sinkinson 2014 competition newspapers"`)
+
+**Important:** You cannot search for a paper you don't know exists. Provide either a file path or a specific query. If the user invokes this skill without specifying a paper, ask them.
+
+## Prerequisites
+
+- **Python ≥ 3.10** must be available. `install.py` refuses to proceed on Python 3.9 or older. If needed: `brew install python@3.12`, `apt install python3.11`, or python.org installer.
+- **Optional GPU acceleration** is auto-detected: NVIDIA CUDA → CPU. (MPS on Apple Silicon is excluded — surya's layout model crashes on MPS at runtime.)
+
+## Step 1: Acquire the PDF
+
+**If a local file path is provided:**
+- Verify the file exists
+- Use the PDF in place. The working directory is the folder containing the PDF.
+- Proceed to Step 2
+
+**If a search query or paper title is provided:**
+1. Use WebSearch to find the paper
+2. Use WebFetch or Bash (curl/wget) to download the PDF
+3. Save it to the current working directory
+4. Proceed to Step 2
+
+**CRITICAL: Always preserve the original PDF.** Never delete, move, or overwrite it at any point in this workflow.
+
+## Step 2: Ensure the converter is installed
+
+```bash
+python3 ~/.claude/skills/read-pdf/install.py
+```
+
+Idempotent. First run creates a venv at `~/.cache/claude-pdf-converter/venv-marker/` and downloads marker models (~500 MB, 1–3 min). Surface the "First run" message to the user verbatim if it appears — they should know why this invocation is slow.
+
+## Step 3: Convert
+
+**Before converting, check for a cached conversion.** Compute the SHA-256 hash of the PDF and check whether `markdown.md` already exists in the cache:
+
+```python
+import hashlib, os, sys
+
+pdf_path = "<absolute-pdf-path>"
+
+with open(pdf_path, 'rb') as f:
+    pdf_hash = hashlib.sha256(f.read()).hexdigest()
+
+markdown_path = os.path.expanduser(
+    f'~/.cache/claude-pdf-converter/cache/marker/{pdf_hash}/markdown.md'
+)
+print(markdown_path if os.path.exists(markdown_path) else "NOT_CACHED")
+```
+
+- **If cached:** tell the user "Using cached markdown conversion (SHA-256 match), skipping re-conversion." Use the printed path as `markdown_path`.
+- **If not cached:** run:
+  ```bash
+  python3 ~/.claude/skills/read-pdf/convert.py "<pdf-path>"
+  ```
+  It prints the absolute path to `markdown.md` on success and exits 0. For born-digital PDFs with a usable embedded text layer, `convert.py` uses that text layer and disables marker's full-document OCR path while preserving marker's layout/table processing. **Do not fall back to pdftotext or any other tool on failure** — surface the error and stop. The whole point of this skill is the layout-aware conversion; a degraded fallback produces silently-wrong output.
+
+## Step 4: Check for existing `_text.md`
+
+Look for `<basename>_text.md` in the same folder as the PDF.
+
+If found, ask:
+> "An extract already exists (`<basename>_text.md`). Overwrite it, or save the new extraction as `<basename>_text2.md`?"
+
+Proceed using whichever filename the user chooses.
+
+## Step 5: Structured Extraction
+
+Read `markdown.md` and collect information along these dimensions:
+
+0. **Bibliographic metadata** — From the title section of the markdown, extract:
+   ```
+   ## Bibliographic metadata
+   doi: <10.xxxx/yyyy if present, else null>
+   authors: [LastName1, LastName2, ...]
+   title: <verbatim title>
+   year: <year>
+   venue: <journal/working paper series/etc., verbatim>
+   venue_type: journal | working_paper | book_chapter | other
+   ```
+   If a field is not visible, record `null`.
+
+1. **Research question** — What is the paper asking and why does it matter?
+2. **Audience** — Which sub-community of researchers cares about this?
+3. **Method** — How do they answer the question? What is the identification strategy?
+4. **Data** — What data do they use? Where precisely did they find it? Unit of observation? Sample size? Time period?
+5. **Statistical methods** — What econometric or statistical techniques? Key specifications?
+6. **Findings** — Main results? Key coefficient estimates and standard errors?
+7. **Contributions** — What is learned that we didn't know before?
+8. **Replication feasibility** — Public data? Replication archive? Data appendix? URLs?
+
+## The Output File
+
+Write the final structured extraction to `<basename>_text.md` (or `_text2.md` if chosen in Step 4) in the same folder as the source PDF, with the `## Bibliographic metadata` block first, followed by the research notes.
+
+Notify the user:
+> "Extract saved to `smith_2024_text.md` alongside the source PDF. Future requests on this paper can reuse it without re-reading."
+
+This file is the persistent, reusable artifact.
+
+## Agent Isolation Protocol
+
+**When read-pdf is invoked by another skill**, the conversion steps (Steps 2–3) run in the parent context — they are lightweight bash calls. The reading and extraction (Steps 4–5) MUST run inside a subagent. The converted `markdown.md` can be large, and reading it in the parent context of an active workflow accumulates permanent token cost. The subagent reads `markdown.md`, writes plain-text `_text.md`, and the parent reads only that.
+
+**Pattern:**
+
+The parent skill handles install.py, the SHA-256 cache check, convert.py if needed, and the `_text.md` collision check. Then it launches an Agent:
+
+```
+Read a converted markdown file and produce structured extraction notes.
+
+Markdown input: <markdown_path>
+Text output: <text_path>
+
+Process:
+1. Read <markdown_path> using the Read tool
+2. From the title section, extract a bibliographic metadata block:
+   ## Bibliographic metadata
+   doi: <10.xxxx/yyyy if present, else null>
+   authors: [LastName1, LastName2, ...]
+   title: <verbatim title>
+   year: <year>
+   venue: <journal/working paper series/etc., verbatim>
+   venue_type: journal | working_paper | book_chapter | other
+3. Extract: research question, audience, method, data (sources, sample size, time period),
+   statistical methods, findings, contributions, replication feasibility
+4. Write the final structured extraction to <text_path>, with the
+   ## Bibliographic metadata block first, followed by the research notes.
+
+Report when done: page count, figures/tables found, one-sentence content summary.
+```
+
+After the agent returns, the parent reads `_text.md` (plain text, not the large `markdown.md`) and continues its workflow.
+
+**Standalone invocations** (user calls `/read-pdf` directly) read `markdown.md` in the main conversation and write `_text.md` directly — no subagent needed for a one-off read.
+
+## Quick Reference
+
+| Step | Action |
+|------|--------|
+| **Acquire** | Download via web search or use local file in place |
+| **Install** | `python3 ~/.claude/skills/read-pdf/install.py` (idempotent; downloads models on first run) |
+| **Check cache** | SHA-256 → `~/.cache/claude-pdf-converter/cache/marker/<hash>/markdown.md` |
+| **Convert** | `python3 ~/.claude/skills/read-pdf/convert.py <pdf>` if not cached |
+| **Collision** | Ask overwrite vs `_text2.md` if `_text.md` already exists |
+| **Extract** | Bibliographic metadata + 8-dimension notes from `markdown.md` |
+| **Persist** | Save to `<basename>_text.md` alongside the source PDF |
+
+For backend details, cache management, and GPU notes, see [README.md](README.md).