From c8fc440c5f12a80e636bc054d399b5b5c8ec5be0 Mon Sep 17 00:00:00 2001
From: Noah Miller <noah.miller.012@gmail.com>
Date: Fri, 8 May 2026 11:24:45 -0400
Subject: [PATCH] Add read-pdf marker conversion skill

---
 .claude/skills/read-pdf/README.md  | 144 ++++++++++++++++
 .claude/skills/read-pdf/SKILL.md   | 164 ++++++++++++++++++
 .claude/skills/read-pdf/convert.py | 261 +++++++++++++++++++++++++++++
 .claude/skills/read-pdf/install.py | 163 ++++++++++++++++++
 skills/read-pdf/README.md          | 119 +++++++++++++
 5 files changed, 851 insertions(+)
 create mode 100644 .claude/skills/read-pdf/README.md
 create mode 100644 .claude/skills/read-pdf/SKILL.md
 create mode 100644 .claude/skills/read-pdf/convert.py
 create mode 100644 .claude/skills/read-pdf/install.py
 create mode 100644 skills/read-pdf/README.md
diff --git a/.claude/skills/read-pdf/README.md b/.claude/skills/read-pdf/README.md
new file mode 100644
index 0000000..38c6039
--- /dev/null
+++ b/.claude/skills/read-pdf/README.md
@@ -0,0 +1,144 @@
+# `/read-pdf` — Download, Convert, and Deep-Read Academic Papers
+
+**Same workflow as `/split-pdf`, but uses python:marker to convert the PDF to markdown locally first, instead of having Claude vision-read PDF page images.** This makes equation, table, and figure extraction more faithful, and avoids image-based context bloat in the parent conversation.
+
+**Skill location:** [`.claude/skills/read-pdf/SKILL.md`](../../.claude/skills/read-pdf/SKILL.md)
+
+---
+
+## What This Skill Does
+
+You give Claude a paper — either a local PDF file or a search query — and it does the rest. It finds and downloads the paper (or uses your local file in place), converts it to clean markdown using python:marker, then reads that markdown to write structured notes. When finished, it saves a persistent `_text.md` extraction alongside the source PDF, in the same format produced by `/split-pdf`.
+
+---
+
+## Why It Exists
+
+`/split-pdf` reads PDFs by having Claude vision-read page images in batches. This works well for most papers but has two limitations:
+
+1. **Equation fidelity.** PDF page images render math as bitmaps. Vision-reading bitmaps produces approximate LaTeX transcriptions. Papers heavy with structural equations (e.g., structural IO, dynamic programming models) benefit from native math extraction.
+
+2. **Table structure.** Complex tables (multi-column headers, merged cells, footnotes) are harder to transcribe accurately from images than from a layout-aware text conversion.
+
+`/read-pdf` addresses both by running a local conversion step first. The result is a `markdown.md` file where equations are native LaTeX math mode and tables are pipe-syntax markdown — readable as text rather than image bitmaps.
+
+---
+
+## The Solution
+
+Convert the PDF to markdown with python:marker (layout-aware, GPU-accelerated), then read the text.
+
+### How It Works
+
+| Step | Action |
+|------|--------|
+| **Acquire** | Download the PDF (via web search) or use a local file in place |
+| **Install** | `install.py` sets up the marker venv on first run (~500 MB, one-time) |
+| **Check cache** | SHA-256 hash check — skip re-conversion if markdown already cached |
+| **Convert** | `convert.py` runs marker and writes `markdown.md` to a content-hash cache |
+| **Collision** | If `_text.md` already exists, ask: overwrite or save as `_text2.md`? |
+| **Extract** | Read `markdown.md`, write bibliographic metadata + 8-dimension notes |
+| **Persist** | Save final extraction to `<basename>_text.md` alongside the source PDF |
+
+### Usage
+
+```
+/read-pdf path/to/paper.pdf
+/read-pdf "Gentzkow Shapiro Sinkinson 2014 competition newspapers"
+```
+
+As with `/split-pdf`, you must tell Claude what paper to read. Provide either a local file path or a search query specific enough to find the paper.
+
+### What Gets Extracted
+
+Same 8 dimensions as `/split-pdf`, plus a bibliographic metadata block at the top of `_text.md`:
+
+```
+## Bibliographic metadata
+doi: <10.xxxx/yyyy or null>
+authors: [LastName1, LastName2, ...]
+title: <verbatim title>
+year: <year>
+venue: <journal/working paper series/etc.>
+venue_type: journal | working_paper | book_chapter | other
+```
+
+1. **Research question** — What is the paper asking and why does it matter?
+2. **Audience** — Which sub-community of researchers cares about this?
+3. **Method** — How do they answer the question? What is the identification strategy?
+4. **Data** — What data do they use? Where did they find it? Unit of observation? Sample size? Time period?
+5. **Statistical methods** — What econometric or statistical techniques? Key specifications?
+6. **Findings** — Main results? Key coefficient estimates and standard errors?
+7. **Contributions** — What is learned that we didn't know before?
+8. **Replication feasibility** — Public data? Replication archive? Data appendix? URLs?
+
+---
+
+## Key Features
+
+### Conversion backend: marker
+
+The conversion backend is **marker** (`marker-pdf`). Selected after a head-to-head bake-off against docling on a representative set of empirical-economics PDFs; marker won on equation fidelity, table structure, and figure extraction quality.
+
+Backend selection is fixed in `convert.py`. There is no runtime override — if the bake-off needs to be redone for a future backend candidate, edit the `BACKEND` constant in `convert.py` explicitly so the cache namespace and venv are regenerated cleanly.
+
+### Born-digital PDFs and OCR
+
+Most journal PDFs already contain an embedded text layer. For those files, `convert.py` samples the first pages with `pdftotext` and tells marker to use the embedded text rather than re-OCRing the whole document. Marker still performs layout, table, and selected region recognition, but avoids the extremely slow full-document OCR path. If the text-layer sample is missing or too sparse, marker keeps OCR enabled for scanned PDFs.
+
+### GPU acceleration
+
+Auto-detected: NVIDIA CUDA → CPU. MPS on Apple Silicon is excluded — surya's layout model crashes at runtime on MPS with an index-bounds error (some surya sub-models already refuse MPS; the layout model does not and fails mid-conversion). A 3–5× speedup on CUDA boxes. No flags needed on any platform.
+
+### Content-hash cache
+
+Conversions are cached by SHA-256 of the source PDF bytes at `~/.cache/claude-pdf-converter/cache/marker/<hash>/`. Re-converting the same PDF (even under a different filename, even in a different project) is a no-op — the cached `markdown.md` is returned immediately. The cache is shared across all projects on the machine.
+
+Cache entries are not auto-evicted. To force a re-conversion:
+```bash
+rm -rf ~/.cache/claude-pdf-converter/cache/marker/<hash>/
+```
+To wipe the entire cache (e.g., after a backend upgrade):
+```bash
+rm -rf ~/.cache/claude-pdf-converter/cache/
+```
+The venv at `~/.cache/claude-pdf-converter/venv-marker/` is untouched.
+
+### `_text.md` collision handling
+
+If a `_text.md` already exists alongside the PDF (e.g., from a prior `/split-pdf` run), the skill asks whether to overwrite it or save the new extraction as `_text2.md`. This lets you compare extractions from both methods on the same paper without losing earlier work.
+
+### Agent isolation protocol
+
+When another skill calls `/read-pdf`, the conversion runs in the parent context (lightweight bash call) and the reading runs inside a subagent. The subagent reads `markdown.md`, writes plain-text `_text.md`, and the parent reads only the text output. This prevents the converted markdown from accumulating token cost in a busy workflow conversation.
+
+---
+
+## `/read-pdf` vs `/split-pdf` — When to Use Which
+
+| | `/split-pdf` | `/read-pdf` |
+|---|---|---|
+| **Reading mechanism** | Claude vision-reads PDF page images | Marker converts to markdown; Claude reads text |
+| **Setup required** | None | `install.py` (~500 MB, one-time) |
+| **First-run latency** | None | ~1–3 min (model download + conversion) |
+| **Subsequent runs** | — | Instant if cached |
+| **Equation fidelity** | Good (vision-based) | Better (native LaTeX extraction) |
+| **Table structure** | Good | Better (layout-aware) |
+| **Works without internet** | No (unless PDF already local) | Yes (after install) |
+| **Output format** | `_text.md` | `_text.md` (same format) |
+
+Both skills produce identical `_text.md` output format and can be used interchangeably by downstream skills like `/bib-update` and `/wiki-update`.
+
+---
+
+## Limitations
+
+- **Requires local setup.** First run downloads ~500 MB of models. Not suitable for environments where you can't write to `~/.cache/`.
+- **Conversion can fail on malformed PDFs.** If `convert.py` errors, the skill stops — it does not fall back to a degraded alternative. Fix the PDF or use `/split-pdf` instead.
+- **Not for triage.** If you just need to decide whether a paper is relevant, use `/split-pdf` (no setup, works immediately on first split).
+
+---
+
+## Acknowledgments
+
+The in-place PDF handling, persistent `_text.md` extraction, build directory convention, and agent isolation protocol follow conventions established in the `/split-pdf` skill, where they were inspired by improvements identified by [Ben Bentzin](https://www.mccombs.utexas.edu) (Associate Professor of Instruction, McCombs School of Business, University of Texas at Austin). The marker integration (`convert.py`, `install.py`) and content-hash caching design are original to this skill.
diff --git a/.claude/skills/read-pdf/SKILL.md b/.claude/skills/read-pdf/SKILL.md
new file mode 100644
index 0000000..1c59ad3
--- /dev/null
+++ b/.claude/skills/read-pdf/SKILL.md
@@ -0,0 +1,164 @@
+---
+name: read-pdf
+description: Download or use a local academic PDF, convert to clean markdown locally (python:marker, layout-aware), then extract structured reading notes into `_text.md`. Same output contract as /split-pdf — bibliographic metadata block + 8-dimension research notes — but uses local conversion instead of Claude vision-reading PDF images. Preserves equation fidelity, table structure, and figure references. Use when you want higher-fidelity math/table extraction, or when you already have a local file.
+allowed-tools: Bash(python3:*), Bash(curl:*), Bash(wget:*), Bash(mkdir:*), Read, Write, WebSearch, WebFetch, Agent
+argument-hint: [pdf-path-or-search-query]
+---
+
+# Read-PDF: Download, Convert, and Deep-Read Academic Papers
+
+Same I/O contract as /split-pdf: takes a PDF (local or searched), produces a structured `_text.md` extraction with a bibliographic metadata block and 8-dimension research notes. The difference is the reading mechanism: instead of Claude vision-reading PDF page images in chunks, read-pdf converts the PDF to markdown locally using python:marker, then reads the text. This preserves equation fidelity, table structure, and figure references without image-based context bloat.
+
+## When This Skill Is Invoked
+
+The user wants to read, review, or summarize an academic paper and either: (a) wants layout-aware equation/table extraction, or (b) already has a local PDF. The input is either:
+- A file path to a local PDF (e.g., `~/Documents/papers/smith_2024.pdf`)
+- A search query or paper title (e.g., `"Gentzkow Shapiro Sinkinson 2014 competition newspapers"`)
+
+**Important:** You cannot search for a paper you don't know exists. Provide either a file path or a specific query. If the user invokes this skill without specifying a paper, ask them.
+
+## Prerequisites
+
+- **Python ≥ 3.10** must be available. `install.py` refuses to proceed on Python 3.9 or older. If needed: `brew install python@3.12`, `apt install python3.11`, or python.org installer.
+- **Optional GPU acceleration** is auto-detected: NVIDIA CUDA → CPU. (MPS on Apple Silicon is excluded — surya's layout model crashes on MPS at runtime.)
+
+## Step 1: Acquire the PDF
+
+**If a local file path is provided:**
+- Verify the file exists
+- Use the PDF in place. The working directory is the folder containing the PDF.
+- Proceed to Step 2
+
+**If a search query or paper title is provided:**
+1. Use WebSearch to find the paper
+2. Use WebFetch or Bash (curl/wget) to download the PDF
+3. Save it to the current working directory
+4. Proceed to Step 2
+
+**CRITICAL: Always preserve the original PDF.** Never delete, move, or overwrite it at any point in this workflow.
+
+## Step 2: Ensure the converter is installed
+
+```bash
+python3 ~/.claude/skills/read-pdf/install.py
+```
+
+Idempotent. First run creates a venv at `~/.cache/claude-pdf-converter/venv-marker/` and downloads marker models (~500 MB, 1–3 min). Surface the "First run" message to the user verbatim if it appears — they should know why this invocation is slow.
+
+## Step 3: Convert
+
+**Before converting, check for a cached conversion.** Compute the SHA-256 hash of the PDF and check whether `markdown.md` already exists in the cache:
+
+```python
+import hashlib, os, sys
+
+pdf_path = "<absolute-pdf-path>"
+
+with open(pdf_path, 'rb') as f:
+    pdf_hash = hashlib.sha256(f.read()).hexdigest()
+
+markdown_path = os.path.expanduser(
+    f'~/.cache/claude-pdf-converter/cache/marker/{pdf_hash}/markdown.md'
+)
+print(markdown_path if os.path.exists(markdown_path) else "NOT_CACHED")
+```
+
+- **If cached:** tell the user "Using cached markdown conversion (SHA-256 match), skipping re-conversion." Use the printed path as `markdown_path`.
+- **If not cached:** run:
+  ```bash
+  python3 ~/.claude/skills/read-pdf/convert.py "<pdf-path>"
+  ```
+  It prints the absolute path to `markdown.md` on success and exits 0. For born-digital PDFs with a usable embedded text layer, `convert.py` uses that text layer and disables marker's full-document OCR path while preserving marker's layout/table processing. **Do not fall back to pdftotext or any other tool on failure** — surface the error and stop. The whole point of this skill is the layout-aware conversion; a degraded fallback produces silently-wrong output.
+
+## Step 4: Check for existing `_text.md`
+
+Look for `<basename>_text.md` in the same folder as the PDF.
+
+If found, ask:
+> "An extract already exists (`<basename>_text.md`). Overwrite it, or save the new extraction as `<basename>_text2.md`?"
+
+Proceed using whichever filename the user chooses.
+
+## Step 5: Structured Extraction
+
+Read `markdown.md` and collect information along these dimensions:
+
+0. **Bibliographic metadata** — From the title section of the markdown, extract:
+   ```
+   ## Bibliographic metadata
+   doi: <10.xxxx/yyyy if present, else null>
+   authors: [LastName1, LastName2, ...]
+   title: <verbatim title>
+   year: <year>
+   venue: <journal/working paper series/etc., verbatim>
+   venue_type: journal | working_paper | book_chapter | other
+   ```
+   If a field is not visible, record `null`.
+
+1. **Research question** — What is the paper asking and why does it matter?
+2. **Audience** — Which sub-community of researchers cares about this?
+3. **Method** — How do they answer the question? What is the identification strategy?
+4. **Data** — What data do they use? Where precisely did they find it? Unit of observation? Sample size? Time period?
+5. **Statistical methods** — What econometric or statistical techniques? Key specifications?
+6. **Findings** — Main results? Key coefficient estimates and standard errors?
+7. **Contributions** — What is learned that we didn't know before?
+8. **Replication feasibility** — Public data? Replication archive? Data appendix? URLs?
+
+## The Output File
+
+Write the final structured extraction to `<basename>_text.md` (or `_text2.md` if chosen in Step 4) in the same folder as the source PDF, with the `## Bibliographic metadata` block first, followed by the research notes.
+
+Notify the user:
+> "Extract saved to `smith_2024_text.md` alongside the source PDF. Future requests on this paper can reuse it without re-reading."
+
+This file is the persistent, reusable artifact.
+
+## Agent Isolation Protocol
+
+**When read-pdf is invoked by another skill**, the conversion steps (Steps 2–3) run in the parent context — they are lightweight bash calls. The reading and extraction (Steps 4–5) MUST run inside a subagent. The converted `markdown.md` can be large, and reading it in the parent context of an active workflow accumulates permanent token cost. The subagent reads `markdown.md`, writes plain-text `_text.md`, and the parent reads only that.
+
+**Pattern:**
+
+The parent skill handles install.py, the SHA-256 cache check, convert.py if needed, and the `_text.md` collision check. Then it launches an Agent:
+
+```
+Read a converted markdown file and produce structured extraction notes.
+
+Markdown input: <markdown_path>
+Text output: <text_path>
+
+Process:
+1. Read <markdown_path> using the Read tool
+2. From the title section, extract a bibliographic metadata block:
+   ## Bibliographic metadata
+   doi: <10.xxxx/yyyy if present, else null>
+   authors: [LastName1, LastName2, ...]
+   title: <verbatim title>
+   year: <year>
+   venue: <journal/working paper series/etc., verbatim>
+   venue_type: journal | working_paper | book_chapter | other
+3. Extract: research question, audience, method, data (sources, sample size, time period),
+   statistical methods, findings, contributions, replication feasibility
+4. Write the final structured extraction to <text_path>, with the
+   ## Bibliographic metadata block first, followed by the research notes.
+
+Report when done: page count, figures/tables found, one-sentence content summary.
+```
+
+After the agent returns, the parent reads `_text.md` (plain text, not the large `markdown.md`) and continues its workflow.
+
+**Standalone invocations** (user calls `/read-pdf` directly) read `markdown.md` in the main conversation and write `_text.md` directly — no subagent needed for a one-off read.
+
+## Quick Reference
+
+| Step | Action |
+|------|--------|
+| **Acquire** | Download via web search or use local file in place |
+| **Install** | `python3 ~/.claude/skills/read-pdf/install.py` (idempotent; downloads models on first run) |
+| **Check cache** | SHA-256 → `~/.cache/claude-pdf-converter/cache/marker/<hash>/markdown.md` |
+| **Convert** | `python3 ~/.claude/skills/read-pdf/convert.py <pdf>` if not cached |
+| **Collision** | Ask overwrite vs `_text2.md` if `_text.md` already exists |
+| **Extract** | Bibliographic metadata + 8-dimension notes from `markdown.md` |
+| **Persist** | Save to `<basename>_text.md` alongside the source PDF |
+
+For backend details, cache management, and GPU notes, see [README.md](README.md).
diff --git a/.claude/skills/read-pdf/convert.py b/.claude/skills/read-pdf/convert.py
new file mode 100644
index 0000000..72deebe
--- /dev/null
+++ b/.claude/skills/read-pdf/convert.py
@@ -0,0 +1,261 @@
+#!/usr/bin/env python3
+"""
+read-pdf converter — PDF → markdown + figures (marker backend).
+
+Caches by SHA-256 of the PDF bytes. Re-running on the same content is free.
+
+Usage:
+    python3 convert.py <pdf-path>
+
+Prints the absolute path to the cached markdown.md on success (exit 0).
+On backend failure, exits non-zero with the error on stderr — no fallback.
+
+Cache layout:
+    ~/.cache/claude-pdf-converter/cache/marker/<sha256>/
+        markdown.md       # verbatim conversion with inline ![](figures/...)
+        figures/*.png     # extracted figures
+        meta.json         # backend, version, page/figure counts, source path
+"""
+
+import hashlib
+import json
+import os
+import platform
+import re
+import subprocess
+import sys
+import time
+from pathlib import Path
+
+BACKEND = "marker"
+CACHE_ROOT = Path.home() / ".cache" / "claude-pdf-converter"
+CACHE_DIR = CACHE_ROOT / "cache" / BACKEND
+VENV_DIR = CACHE_ROOT / f"venv-{BACKEND}"
+VENV_PYTHON = (
+    VENV_DIR / "Scripts" / "python.exe"
+    if platform.system() == "Windows"
+    else VENV_DIR / "bin" / "python"
+)
+SKILL_DIR = Path(__file__).resolve().parent
+
+
+def detect_torch_device() -> str:
+    """Pick best available torch device: cuda > cpu. MPS excluded — surya's layout
+    model crashes on Apple Silicon MPS with an index-bounds error at runtime."""
+    try:
+        import torch
+    except ImportError:
+        return "cpu"
+    if torch.cuda.is_available():
+        return "cuda"
+    return "cpu"
+
+
+def normalize_footnotes(text: str) -> str:
+    """
+    Rewrite marker's bare-number footnote encoding as Pandoc-style markdown footnotes.
+
+    Marker places footnote superscripts as bare digits attached to the preceding
+    word/punctuation, then dumps the footnote body as a standalone paragraph
+    starting with the matching number at the next page-break boundary. This
+    function detects matched anchor/definition pairs and rewrites them:
+
+        ...coefficient.12 We then...      →  ...coefficient.[^12] We then...
+        12The county-level cluster...      →  (removed from body)
+
+    A definitions block is appended at the end of the document:
+
+        [^12]: The county-level cluster...
+
+    Guards: code fences, table rows, display-math paragraphs, and numbered list
+    items (digit followed by ". " or ") ") are left untouched.
+    Only numbers that appear as BOTH an anchor and a definition are rewritten —
+    this is the primary false-positive guard.
+    """
+    paragraphs = re.split(r'\n\n+', text)
+
+    # --- Pass 1: find definition paragraphs ---
+    # Matches: bare 1–3 digit number at paragraph start, NOT followed by ". "
+    # or ") " (numbered list items), then optional whitespace, then the body.
+    # No mandatory space between number and body (handles OCR gaps in old scans).
+    fn_def_re = re.compile(r'^(\d{1,3})(?!\.\s|\)\s)\s*(\S.+)', re.DOTALL)
+
+    footnote_defs: dict[str, str] = {}
+    def_para_indices: set[int] = set()
+    in_fence = False
+
+    for i, para in enumerate(paragraphs):
+        stripped = para.strip()
+        # Track code-fence state across paragraphs
+        if stripped.count('```') % 2 != 0:
+            in_fence = not in_fence
+        if in_fence:
+            continue
+        # Skip tables, display math, and code fences
+        if re.match(r'\s*(\||```|\$\$)', stripped):
+            continue
+        m = fn_def_re.match(stripped)
+        if m:
+            num, body = m.group(1), m.group(2).strip()
+            if body and not body.isdigit():
+                footnote_defs[num] = body
+                def_para_indices.add(i)
+
+    if not footnote_defs:
+        return text
+
+    # --- Pass 2: replace anchors in body paragraphs ---
+    # Anchor: one of the known footnote numbers immediately following a word
+    # character or sentence-ending punctuation, not preceded by '[' (citation).
+    # Lookahead: whitespace, sentence punctuation, closing bracket, or EOL.
+    nums_alt = '|'.join(re.escape(n) for n in sorted(footnote_defs, key=lambda x: -len(x)))
+    anchor_re = re.compile(
+        r'(?<=[a-zA-Z.,;:!?\'")\]])(?<!\[)(' + nums_alt + r')(?=[\s,.:;!?\n\)\]]|$)'
+    )
+
+    result_paras: list[str] = []
+    in_fence = False
+    for i, para in enumerate(paragraphs):
+        stripped = para.strip()
+        if stripped.count('```') % 2 != 0:
+            in_fence = not in_fence
+
+        if i in def_para_indices:
+            continue  # will be collected at end
+
+        # Skip anchor replacement inside protected blocks
+        if in_fence or re.match(r'\s*(\||```|\$\$)', stripped):
+            result_paras.append(para)
+        else:
+            result_paras.append(anchor_re.sub(lambda m: f'[^{m.group(1)}]', para))
+
+    # Append all definitions in numerical order
+    defs_block = '\n\n'.join(
+        f'[^{n}]: {footnote_defs[n]}'
+        for n in sorted(footnote_defs, key=int)
+    )
+    result_paras.append(defs_block)
+
+    return '\n\n'.join(result_paras)
+
+
+def sha256_of(path: Path) -> str:
+    h = hashlib.sha256()
+    with path.open("rb") as f:
+        for chunk in iter(lambda: f.read(1 << 20), b""):
+            h.update(chunk)
+    return h.hexdigest()
+
+
+def text_layer_chars(path: Path, pages: int = 3) -> int:
+    """Return non-whitespace chars extracted from the PDF text layer sample."""
+    try:
+        result = subprocess.run(
+            ["pdftotext", "-l", str(pages), str(path), "-"],
+            check=False,
+            capture_output=True,
+            text=True,
+            timeout=30,
+        )
+    except (FileNotFoundError, subprocess.TimeoutExpired):
+        return 0
+    if result.returncode != 0:
+        return 0
+    return sum(1 for ch in result.stdout if not ch.isspace())
+
+
+def in_venv() -> bool:
+    return Path(sys.prefix).resolve() == VENV_DIR.resolve()
+
+
+def reexec_in_venv(args: list[str]) -> None:
+    """Re-run this script under the backend venv's Python."""
+    if not VENV_PYTHON.exists():
+        installer = SKILL_DIR / "install.py"
+        subprocess.run([sys.executable, str(installer)], check=True)
+    os.execv(str(VENV_PYTHON), [str(VENV_PYTHON), str(Path(__file__).resolve()), *args])
+
+
+def convert_with_marker(pdf_path: Path, out_dir: Path) -> dict:
+    from marker.converters.pdf import PdfConverter
+    from marker.models import create_model_dict
+    from marker.output import text_from_rendered
+
+    text_chars = text_layer_chars(pdf_path)
+    use_text_layer = text_chars >= 500
+    config = {"disable_ocr": True} if use_text_layer else {}
+    converter = PdfConverter(artifact_dict=create_model_dict(), config=config)
+    rendered = converter(str(pdf_path))
+    text, _, images = text_from_rendered(rendered)
+
+    figures_dir = out_dir / "figures"
+    figures_dir.mkdir(exist_ok=True)
+
+    fig_count = 0
+    for name, img in (images or {}).items():
+        out_name = figures_dir / Path(name).name
+        try:
+            img.save(out_name)
+            fig_count += 1
+        except Exception as exc:  # pragma: no cover
+            print(f"warn: figure {name} save failed: {exc}", file=sys.stderr)
+
+    text = normalize_footnotes(text)
+    (out_dir / "markdown.md").write_text(text, encoding="utf-8")
+
+    return {
+        "backend": "marker",
+        "page_count": None,
+        "figure_count": fig_count,
+        "text_layer_chars_sample": text_chars,
+        "ocr_disabled": use_text_layer,
+        "equation_extraction_mode": "native",  # marker emits LaTeX directly
+    }
+
+
+def main() -> int:
+    if len(sys.argv) != 2:
+        print("usage: convert.py <pdf-path>", file=sys.stderr)
+        return 2
+
+    pdf_path = Path(sys.argv[1]).expanduser().resolve()
+    if not pdf_path.is_file():
+        print(f"error: not a file: {pdf_path}", file=sys.stderr)
+        return 2
+
+    if not in_venv():
+        reexec_in_venv([str(pdf_path)])
+
+    # Marker reads TORCH_DEVICE at import time. Set before importing the
+    # backend, after we're inside the venv (so torch is the venv's torch).
+    if "TORCH_DEVICE" not in os.environ:
+        os.environ["TORCH_DEVICE"] = detect_torch_device()
+
+    digest = sha256_of(pdf_path)
+    out_dir = CACHE_DIR / digest
+    md_path = out_dir / "markdown.md"
+    if md_path.is_file():
+        print(str(md_path))
+        return 0
+
+    out_dir.mkdir(parents=True, exist_ok=True)
+    started = time.time()
+    info = convert_with_marker(pdf_path, out_dir)
+    info.update(
+        {
+            "source_path": str(pdf_path),
+            "sha256": digest,
+            "elapsed_seconds": round(time.time() - started, 2),
+            "torch_device": os.environ.get("TORCH_DEVICE", "cpu"),
+            "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
+        }
+    )
+    (out_dir / "meta.json").write_text(
+        json.dumps(info, indent=2), encoding="utf-8"
+    )
+    print(str(md_path))
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/.claude/skills/read-pdf/install.py b/.claude/skills/read-pdf/install.py
new file mode 100644
index 0000000..5d2c7fa
--- /dev/null
+++ b/.claude/skills/read-pdf/install.py
@@ -0,0 +1,163 @@
+#!/usr/bin/env python3
+"""
+read-pdf installer — sets up the local PDF→markdown converter (marker backend).
+
+Idempotent. First run creates a venv at ~/.cache/claude-pdf-converter/venv-marker/,
+installs marker-pdf, and warms up model downloads. Subsequent runs short-circuit
+if marker imports cleanly under the venv.
+
+The venv lives outside any git repo so that backend models (~hundreds of MB)
+do not pollute the skills checkout.
+
+OS-agnostic: searches for a Python ≥ 3.10 across macOS, Linux, and Windows
+common install locations. If none is found, prints a platform-aware install hint.
+"""
+
+import platform
+import shutil
+import subprocess
+import sys
+from pathlib import Path
+
+BACKEND = "marker"
+PINS = ["marker-pdf"]  # latest; pin a version after a regression is observed
+
+PY_MIN = (3, 10)
+
+# Names to try on PATH. Cross-platform: same on macOS/Linux/Windows because
+# python.org and most package managers install with these names.
+PY_NAMES = ["python3.13", "python3.12", "python3.11", "python3.10", "python3", "python"]
+
+# Absolute fallback paths by OS, only consulted if PATH search fails.
+def _fallback_paths() -> list[list[str]]:
+    sysname = platform.system()
+    if sysname == "Darwin":
+        return [[path] for path in (
+            f"/Library/Frameworks/Python.framework/Versions/{v}/bin/python{v}"
+            for v in ("3.13", "3.12", "3.11", "3.10")
+        )] + [[path] for path in (
+            f"/opt/homebrew/bin/python{v}" for v in ("3.13", "3.12", "3.11", "3.10")
+        )] + [[path] for path in (
+            f"/usr/local/bin/python{v}" for v in ("3.13", "3.12", "3.11", "3.10")
+        )]
+    if sysname == "Linux":
+        return [[f"/usr/bin/python{v}"] for v in ("3.13", "3.12", "3.11", "3.10")]
+    if sysname == "Windows":
+        # py launcher handles version selection on Windows
+        return [["py", f"-{v}"] for v in ("3.13", "3.12", "3.11", "3.10")]
+    return []
+
+
+def _install_hint() -> str:
+    sysname = platform.system()
+    if sysname == "Darwin":
+        return "Install Python 3.10+ via `brew install python@3.12` or python.org installer."
+    if sysname == "Linux":
+        return "Install Python 3.10+ via your package manager (e.g. `apt install python3.12` or `dnf install python3.12`)."
+    if sysname == "Windows":
+        return "Install Python 3.10+ via `winget install Python.Python.3.12` or python.org installer."
+    return "Install Python 3.10 or newer."
+
+
+def _check_version(cmd: list[str]) -> bool:
+    try:
+        out = subprocess.check_output(
+            [*cmd, "-c", "import sys; print('%d.%d' % sys.version_info[:2])"],
+            text=True, stderr=subprocess.DEVNULL,
+        ).strip()
+        major, minor = (int(x) for x in out.split("."))
+        return (major, minor) >= PY_MIN
+    except Exception:
+        return False
+
+
+def find_python() -> list[str]:
+    """Return command for a Python ≥3.10. Prefers the running interpreter if it qualifies."""
+    if sys.version_info >= PY_MIN:
+        return [sys.executable]
+    for name in PY_NAMES:
+        path = shutil.which(name)
+        if path and _check_version([path]):
+            return [path]
+    for cand in _fallback_paths():
+        executable = cand[0]
+        path = executable if Path(executable).exists() else shutil.which(executable)
+        if path:
+            cmd = [path, *cand[1:]]
+            if _check_version(cmd):
+                return cmd
+    print(
+        f"error: need Python ≥{PY_MIN[0]}.{PY_MIN[1]} but found only "
+        f"{sys.version_info.major}.{sys.version_info.minor}.\n"
+        f"{_install_hint()}",
+        file=sys.stderr,
+    )
+    sys.exit(2)
+
+
+CACHE_ROOT = Path.home() / ".cache" / "claude-pdf-converter"
+VENV_DIR = CACHE_ROOT / f"venv-{BACKEND}"
+
+
+def venv_python() -> Path:
+    # Windows venvs put python in Scripts/, not bin/
+    if platform.system() == "Windows":
+        return VENV_DIR / "Scripts" / "python.exe"
+    return VENV_DIR / "bin" / "python"
+
+
+def venv_exists() -> bool:
+    return venv_python().exists()
+
+
+def backend_imports() -> bool:
+    if not venv_exists():
+        return False
+    result = subprocess.run(
+        [str(venv_python()), "-c", "import marker"],
+        capture_output=True,
+    )
+    return result.returncode == 0
+
+
+def create_venv() -> None:
+    CACHE_ROOT.mkdir(parents=True, exist_ok=True)
+    print(
+        f"First run: creating venv at {VENV_DIR} and installing "
+        f"{BACKEND} (~500MB, 1–3 min, one-time).",
+        flush=True,
+    )
+    base_python = find_python()
+    subprocess.run([*base_python, "-m", "venv", str(VENV_DIR)], check=True)
+    subprocess.run(
+        [str(venv_python()), "-m", "pip", "install", "--upgrade", "pip"],
+        check=True,
+    )
+    subprocess.run(
+        [str(venv_python()), "-m", "pip", "install", *PINS],
+        check=True,
+    )
+
+
+def warmup_models() -> None:
+    """Trigger first-run model download so the first conversion is fast."""
+    print("Downloading layout/OCR models (one-time)...", flush=True)
+    subprocess.run(
+        [str(venv_python()), "-c",
+         "from marker.models import create_model_dict; create_model_dict()"],
+        check=True,
+    )
+
+
+def main() -> int:
+    if backend_imports():
+        return 0
+    if not venv_exists():
+        create_venv()
+    warmup_models()
+    print(f"read-pdf setup complete. Backend: {BACKEND}", flush=True)
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/skills/read-pdf/README.md b/skills/read-pdf/README.md
new file mode 100644
index 0000000..3c1e22b
--- /dev/null
+++ b/skills/read-pdf/README.md
@@ -0,0 +1,119 @@
+# `/read-pdf` — Download, Convert, and Deep-Read Academic Papers
+
+**Same workflow as `/split-pdf`, but uses python:marker to convert the PDF to markdown locally first, instead of having Claude vision-read PDF page images.** This makes equation, table, and figure extraction more faithful, and avoids image-based context bloat in the parent conversation.
+
+**Skill location:** [`.claude/skills/read-pdf/SKILL.md`](../../.claude/skills/read-pdf/SKILL.md)
+
+---
+
+## What This Skill Does
+
+You give Claude a paper — either a local PDF file or a search query — and it does the rest. It finds and downloads the paper, or uses your local file in place, converts it to clean markdown using python:marker, then reads that markdown to write structured notes. When finished, it saves a persistent `_text.md` extraction alongside the source PDF, in the same format produced by `/split-pdf`.
+
+---
+
+## Why It Exists
+
+`/split-pdf` reads PDFs by having Claude vision-read page images in batches. This works well for most papers but has two limitations:
+
+1. **Equation fidelity.** PDF page images render math as bitmaps. Vision-reading bitmaps produces approximate LaTeX transcriptions. Papers heavy with structural equations benefit from native math extraction.
+2. **Table structure.** Complex tables are harder to transcribe accurately from images than from a layout-aware text conversion.
+
+`/read-pdf` addresses both by running a local conversion step first. The result is a `markdown.md` file where equations are native LaTeX math mode and tables are pipe-syntax markdown — readable as text rather than image bitmaps.
+
+---
+
+## How It Works
+
+```
+~/.cache/claude-pdf-converter/
+├── venv-marker/                         # one-time install of marker-pdf
+└── cache/
+    └── marker/
+        └── <sha256-of-pdf>/
+            ├── markdown.md              # conversion + inline ![](figures/...)
+            ├── figures/
+            │   ├── fig_1.png
+            │   └── fig_2.png
+            └── meta.json                # backend, page/figure counts, timestamp
+```
+
+| Step | Action |
+|------|--------|
+| **Acquire** | Download the PDF via web search or use a local file in place |
+| **Install** | `install.py` sets up the marker venv on first run (~500 MB, one-time) |
+| **Check cache** | SHA-256 hash check — skip re-conversion if markdown already exists |
+| **Convert** | `convert.py` runs marker and writes `markdown.md` to the content-hash cache |
+| **Collision** | If `_text.md` already exists, ask: overwrite or save as `_text2.md`? |
+| **Extract** | Read `markdown.md`, write bibliographic metadata + 8-dimension notes |
+| **Persist** | Save final extraction to `<basename>_text.md` alongside the source PDF |
+
+### Usage
+
+```
+/read-pdf path/to/paper.pdf
+/read-pdf "Gentzkow Shapiro Sinkinson 2014 competition newspapers"
+```
+
+When called by another skill, the caller can invoke `convert.py` directly via bash rather than spawning `/read-pdf` as a slash command — the script is the conversion contract.
+
+### First-run cost
+
+The first invocation on a fresh machine creates a venv at `~/.cache/claude-pdf-converter/venv-marker/` and downloads marker's layout/OCR models (~500 MB, 1–3 min). The skill prints a one-line warning so the user knows why it is slow. Every subsequent invocation skips this setup entirely.
+
+The venv lives **outside any git repo** so the model files do not pollute a checkout.
+
+---
+
+## Conversion Backend
+
+The backend is fixed to **marker** (`marker-pdf`). Marker was selected after a bake-off on empirical-economics PDFs because it performed well on equation fidelity, table structure, and figure extraction quality.
+
+Backend selection is not exposed as a runtime option. If a future backend candidate should replace marker, edit the `BACKEND` constant in `convert.py` so the cache namespace and venv are regenerated cleanly.
+
+### Born-digital PDFs and OCR
+
+Most journal PDFs already contain an embedded text layer. For those files, `convert.py` samples the first pages with `pdftotext` and tells marker to use the embedded text rather than re-OCRing the whole document. Marker still performs layout, table, and selected region recognition, but avoids the slow full-document OCR path. If the text-layer sample is missing or too sparse, marker keeps OCR enabled for scanned PDFs.
+
+### GPU acceleration
+
+Auto-detected: NVIDIA CUDA → CPU. MPS on Apple Silicon is excluded because surya's layout model crashes at runtime on MPS with an index-bounds error. No flags are needed on any platform.
+
+---
+
+## Output Contract
+
+`/read-pdf` writes the same `_text.md` format as `/split-pdf`: a bibliographic metadata block followed by eight research-note dimensions.
+
+```
+## Bibliographic metadata
+doi: <10.xxxx/yyyy or null>
+authors: [LastName1, LastName2, ...]
+title: <verbatim title>
+year: <year>
+venue: <journal/working paper series/etc.>
+venue_type: journal | working_paper | book_chapter | other
+```
+
+This means downstream skills like `/bib-update` and `/wiki-update` can consume outputs from either `/split-pdf` or `/read-pdf`.
+
+---
+
+## Failure Mode
+
+Hard fail. If marker errors on a given PDF (encrypted, malformed, OCR fails), the script exits non-zero and the caller surfaces the error. There is no silent fallback to `pdftotext` or any other tool — silent fallbacks can produce wrong conversions that look plausible on inspection.
+
+---
+
+## Limitations
+
+- **First-run is slow** — venv creation + model download takes 1–3 minutes. After that, conversion of a typical 30-page paper takes ~30s–2min depending on hardware.
+- **Requires writable cache space** at `~/.cache/claude-pdf-converter/`.
+- **Conversion can fail on malformed PDFs.** If `convert.py` errors, use `/split-pdf` instead.
+- **Cache is not auto-evicted** — re-converting the same PDF is free, but the cache grows monotonically. Wipe with `rm -rf ~/.cache/claude-pdf-converter/cache/` if needed.
+
+---
+
+## Acknowledgments
+
+The in-place PDF handling, persistent `_text.md` extraction, build directory convention, and agent isolation protocol follow conventions established in the `/split-pdf` skill, where they were inspired by improvements identified by [Ben Bentzin](https://www.mccombs.utexas.edu) (Associate Professor of Instruction, McCombs School of Business, University of Texas at Austin). The marker integration (`convert.py`, `install.py`) and content-hash caching design are original to this skill.