From c8fc440c5f12a80e636bc054d399b5b5c8ec5be0 Mon Sep 17 00:00:00 2001 From: Noah Miller Date: Fri, 8 May 2026 11:24:45 -0400 Subject: [PATCH] Add read-pdf marker conversion skill --- .claude/skills/read-pdf/README.md | 144 ++++++++++++++++ .claude/skills/read-pdf/SKILL.md | 164 ++++++++++++++++++ .claude/skills/read-pdf/convert.py | 261 +++++++++++++++++++++++++++++ .claude/skills/read-pdf/install.py | 163 ++++++++++++++++++ skills/read-pdf/README.md | 119 +++++++++++++ 5 files changed, 851 insertions(+) create mode 100644 .claude/skills/read-pdf/README.md create mode 100644 .claude/skills/read-pdf/SKILL.md create mode 100644 .claude/skills/read-pdf/convert.py create mode 100644 .claude/skills/read-pdf/install.py create mode 100644 skills/read-pdf/README.md diff --git a/.claude/skills/read-pdf/README.md b/.claude/skills/read-pdf/README.md new file mode 100644 index 0000000..38c6039 --- /dev/null +++ b/.claude/skills/read-pdf/README.md @@ -0,0 +1,144 @@ +# `/read-pdf` — Download, Convert, and Deep-Read Academic Papers + +**Same workflow as `/split-pdf`, but uses python:marker to convert the PDF to markdown locally first, instead of having Claude vision-read PDF page images.** This makes equation, table, and figure extraction more faithful, and avoids image-based context bloat in the parent conversation. + +**Skill location:** [`.claude/skills/read-pdf/SKILL.md`](../../.claude/skills/read-pdf/SKILL.md) + +--- + +## What This Skill Does + +You give Claude a paper — either a local PDF file or a search query — and it does the rest. It finds and downloads the paper (or uses your local file in place), converts it to clean markdown using python:marker, then reads that markdown to write structured notes. When finished, it saves a persistent `_text.md` extraction alongside the source PDF, in the same format produced by `/split-pdf`. + +--- + +## Why It Exists + +`/split-pdf` reads PDFs by having Claude vision-read page images in batches. This works well for most papers but has two limitations: + +1. **Equation fidelity.** PDF page images render math as bitmaps. Vision-reading bitmaps produces approximate LaTeX transcriptions. Papers heavy with structural equations (e.g., structural IO, dynamic programming models) benefit from native math extraction. + +2. **Table structure.** Complex tables (multi-column headers, merged cells, footnotes) are harder to transcribe accurately from images than from a layout-aware text conversion. + +`/read-pdf` addresses both by running a local conversion step first. The result is a `markdown.md` file where equations are native LaTeX math mode and tables are pipe-syntax markdown — readable as text rather than image bitmaps. + +--- + +## The Solution + +Convert the PDF to markdown with python:marker (layout-aware, GPU-accelerated), then read the text. + +### How It Works + +| Step | Action | +|------|--------| +| **Acquire** | Download the PDF (via web search) or use a local file in place | +| **Install** | `install.py` sets up the marker venv on first run (~500 MB, one-time) | +| **Check cache** | SHA-256 hash check — skip re-conversion if markdown already cached | +| **Convert** | `convert.py` runs marker and writes `markdown.md` to a content-hash cache | +| **Collision** | If `_text.md` already exists, ask: overwrite or save as `_text2.md`? | +| **Extract** | Read `markdown.md`, write bibliographic metadata + 8-dimension notes | +| **Persist** | Save final extraction to `_text.md` alongside the source PDF | + +### Usage + +``` +/read-pdf path/to/paper.pdf +/read-pdf "Gentzkow Shapiro Sinkinson 2014 competition newspapers" +``` + +As with `/split-pdf`, you must tell Claude what paper to read. Provide either a local file path or a search query specific enough to find the paper. + +### What Gets Extracted + +Same 8 dimensions as `/split-pdf`, plus a bibliographic metadata block at the top of `_text.md`: + +``` +## Bibliographic metadata +doi: <10.xxxx/yyyy or null> +authors: [LastName1, LastName2, ...] +title: +year: +venue: +venue_type: journal | working_paper | book_chapter | other +``` + +1. **Research question** — What is the paper asking and why does it matter? +2. **Audience** — Which sub-community of researchers cares about this? +3. **Method** — How do they answer the question? What is the identification strategy? +4. **Data** — What data do they use? Where did they find it? Unit of observation? Sample size? Time period? +5. **Statistical methods** — What econometric or statistical techniques? Key specifications? +6. **Findings** — Main results? Key coefficient estimates and standard errors? +7. **Contributions** — What is learned that we didn't know before? +8. **Replication feasibility** — Public data? Replication archive? Data appendix? URLs? + +--- + +## Key Features + +### Conversion backend: marker + +The conversion backend is **marker** (`marker-pdf`). Selected after a head-to-head bake-off against docling on a representative set of empirical-economics PDFs; marker won on equation fidelity, table structure, and figure extraction quality. + +Backend selection is fixed in `convert.py`. There is no runtime override — if the bake-off needs to be redone for a future backend candidate, edit the `BACKEND` constant in `convert.py` explicitly so the cache namespace and venv are regenerated cleanly. + +### Born-digital PDFs and OCR + +Most journal PDFs already contain an embedded text layer. For those files, `convert.py` samples the first pages with `pdftotext` and tells marker to use the embedded text rather than re-OCRing the whole document. Marker still performs layout, table, and selected region recognition, but avoids the extremely slow full-document OCR path. If the text-layer sample is missing or too sparse, marker keeps OCR enabled for scanned PDFs. + +### GPU acceleration + +Auto-detected: NVIDIA CUDA → CPU. MPS on Apple Silicon is excluded — surya's layout model crashes at runtime on MPS with an index-bounds error (some surya sub-models already refuse MPS; the layout model does not and fails mid-conversion). A 3–5× speedup on CUDA boxes. No flags needed on any platform. + +### Content-hash cache + +Conversions are cached by SHA-256 of the source PDF bytes at `~/.cache/claude-pdf-converter/cache/marker//`. Re-converting the same PDF (even under a different filename, even in a different project) is a no-op — the cached `markdown.md` is returned immediately. The cache is shared across all projects on the machine. + +Cache entries are not auto-evicted. To force a re-conversion: +```bash +rm -rf ~/.cache/claude-pdf-converter/cache/marker// +``` +To wipe the entire cache (e.g., after a backend upgrade): +```bash +rm -rf ~/.cache/claude-pdf-converter/cache/ +``` +The venv at `~/.cache/claude-pdf-converter/venv-marker/` is untouched. + +### `_text.md` collision handling + +If a `_text.md` already exists alongside the PDF (e.g., from a prior `/split-pdf` run), the skill asks whether to overwrite it or save the new extraction as `_text2.md`. This lets you compare extractions from both methods on the same paper without losing earlier work. + +### Agent isolation protocol + +When another skill calls `/read-pdf`, the conversion runs in the parent context (lightweight bash call) and the reading runs inside a subagent. The subagent reads `markdown.md`, writes plain-text `_text.md`, and the parent reads only the text output. This prevents the converted markdown from accumulating token cost in a busy workflow conversation. + +--- + +## `/read-pdf` vs `/split-pdf` — When to Use Which + +| | `/split-pdf` | `/read-pdf` | +|---|---|---| +| **Reading mechanism** | Claude vision-reads PDF page images | Marker converts to markdown; Claude reads text | +| **Setup required** | None | `install.py` (~500 MB, one-time) | +| **First-run latency** | None | ~1–3 min (model download + conversion) | +| **Subsequent runs** | — | Instant if cached | +| **Equation fidelity** | Good (vision-based) | Better (native LaTeX extraction) | +| **Table structure** | Good | Better (layout-aware) | +| **Works without internet** | No (unless PDF already local) | Yes (after install) | +| **Output format** | `_text.md` | `_text.md` (same format) | + +Both skills produce identical `_text.md` output format and can be used interchangeably by downstream skills like `/bib-update` and `/wiki-update`. + +--- + +## Limitations + +- **Requires local setup.** First run downloads ~500 MB of models. Not suitable for environments where you can't write to `~/.cache/`. +- **Conversion can fail on malformed PDFs.** If `convert.py` errors, the skill stops — it does not fall back to a degraded alternative. Fix the PDF or use `/split-pdf` instead. +- **Not for triage.** If you just need to decide whether a paper is relevant, use `/split-pdf` (no setup, works immediately on first split). + +--- + +## Acknowledgments + +The in-place PDF handling, persistent `_text.md` extraction, build directory convention, and agent isolation protocol follow conventions established in the `/split-pdf` skill, where they were inspired by improvements identified by [Ben Bentzin](https://www.mccombs.utexas.edu) (Associate Professor of Instruction, McCombs School of Business, University of Texas at Austin). The marker integration (`convert.py`, `install.py`) and content-hash caching design are original to this skill. diff --git a/.claude/skills/read-pdf/SKILL.md b/.claude/skills/read-pdf/SKILL.md new file mode 100644 index 0000000..1c59ad3 --- /dev/null +++ b/.claude/skills/read-pdf/SKILL.md @@ -0,0 +1,164 @@ +--- +name: read-pdf +description: Download or use a local academic PDF, convert to clean markdown locally (python:marker, layout-aware), then extract structured reading notes into `_text.md`. Same output contract as /split-pdf — bibliographic metadata block + 8-dimension research notes — but uses local conversion instead of Claude vision-reading PDF images. Preserves equation fidelity, table structure, and figure references. Use when you want higher-fidelity math/table extraction, or when you already have a local file. +allowed-tools: Bash(python3:*), Bash(curl:*), Bash(wget:*), Bash(mkdir:*), Read, Write, WebSearch, WebFetch, Agent +argument-hint: [pdf-path-or-search-query] +--- + +# Read-PDF: Download, Convert, and Deep-Read Academic Papers + +Same I/O contract as /split-pdf: takes a PDF (local or searched), produces a structured `_text.md` extraction with a bibliographic metadata block and 8-dimension research notes. The difference is the reading mechanism: instead of Claude vision-reading PDF page images in chunks, read-pdf converts the PDF to markdown locally using python:marker, then reads the text. This preserves equation fidelity, table structure, and figure references without image-based context bloat. + +## When This Skill Is Invoked + +The user wants to read, review, or summarize an academic paper and either: (a) wants layout-aware equation/table extraction, or (b) already has a local PDF. The input is either: +- A file path to a local PDF (e.g., `~/Documents/papers/smith_2024.pdf`) +- A search query or paper title (e.g., `"Gentzkow Shapiro Sinkinson 2014 competition newspapers"`) + +**Important:** You cannot search for a paper you don't know exists. Provide either a file path or a specific query. If the user invokes this skill without specifying a paper, ask them. + +## Prerequisites + +- **Python ≥ 3.10** must be available. `install.py` refuses to proceed on Python 3.9 or older. If needed: `brew install python@3.12`, `apt install python3.11`, or python.org installer. +- **Optional GPU acceleration** is auto-detected: NVIDIA CUDA → CPU. (MPS on Apple Silicon is excluded — surya's layout model crashes on MPS at runtime.) + +## Step 1: Acquire the PDF + +**If a local file path is provided:** +- Verify the file exists +- Use the PDF in place. The working directory is the folder containing the PDF. +- Proceed to Step 2 + +**If a search query or paper title is provided:** +1. Use WebSearch to find the paper +2. Use WebFetch or Bash (curl/wget) to download the PDF +3. Save it to the current working directory +4. Proceed to Step 2 + +**CRITICAL: Always preserve the original PDF.** Never delete, move, or overwrite it at any point in this workflow. + +## Step 2: Ensure the converter is installed + +```bash +python3 ~/.claude/skills/read-pdf/install.py +``` + +Idempotent. First run creates a venv at `~/.cache/claude-pdf-converter/venv-marker/` and downloads marker models (~500 MB, 1–3 min). Surface the "First run" message to the user verbatim if it appears — they should know why this invocation is slow. + +## Step 3: Convert + +**Before converting, check for a cached conversion.** Compute the SHA-256 hash of the PDF and check whether `markdown.md` already exists in the cache: + +```python +import hashlib, os, sys + +pdf_path = "" + +with open(pdf_path, 'rb') as f: + pdf_hash = hashlib.sha256(f.read()).hexdigest() + +markdown_path = os.path.expanduser( + f'~/.cache/claude-pdf-converter/cache/marker/{pdf_hash}/markdown.md' +) +print(markdown_path if os.path.exists(markdown_path) else "NOT_CACHED") +``` + +- **If cached:** tell the user "Using cached markdown conversion (SHA-256 match), skipping re-conversion." Use the printed path as `markdown_path`. +- **If not cached:** run: + ```bash + python3 ~/.claude/skills/read-pdf/convert.py "" + ``` + It prints the absolute path to `markdown.md` on success and exits 0. For born-digital PDFs with a usable embedded text layer, `convert.py` uses that text layer and disables marker's full-document OCR path while preserving marker's layout/table processing. **Do not fall back to pdftotext or any other tool on failure** — surface the error and stop. The whole point of this skill is the layout-aware conversion; a degraded fallback produces silently-wrong output. + +## Step 4: Check for existing `_text.md` + +Look for `_text.md` in the same folder as the PDF. + +If found, ask: +> "An extract already exists (`_text.md`). Overwrite it, or save the new extraction as `_text2.md`?" + +Proceed using whichever filename the user chooses. + +## Step 5: Structured Extraction + +Read `markdown.md` and collect information along these dimensions: + +0. **Bibliographic metadata** — From the title section of the markdown, extract: + ``` + ## Bibliographic metadata + doi: <10.xxxx/yyyy if present, else null> + authors: [LastName1, LastName2, ...] + title: + year: + venue: + venue_type: journal | working_paper | book_chapter | other + ``` + If a field is not visible, record `null`. + +1. **Research question** — What is the paper asking and why does it matter? +2. **Audience** — Which sub-community of researchers cares about this? +3. **Method** — How do they answer the question? What is the identification strategy? +4. **Data** — What data do they use? Where precisely did they find it? Unit of observation? Sample size? Time period? +5. **Statistical methods** — What econometric or statistical techniques? Key specifications? +6. **Findings** — Main results? Key coefficient estimates and standard errors? +7. **Contributions** — What is learned that we didn't know before? +8. **Replication feasibility** — Public data? Replication archive? Data appendix? URLs? + +## The Output File + +Write the final structured extraction to `_text.md` (or `_text2.md` if chosen in Step 4) in the same folder as the source PDF, with the `## Bibliographic metadata` block first, followed by the research notes. + +Notify the user: +> "Extract saved to `smith_2024_text.md` alongside the source PDF. Future requests on this paper can reuse it without re-reading." + +This file is the persistent, reusable artifact. + +## Agent Isolation Protocol + +**When read-pdf is invoked by another skill**, the conversion steps (Steps 2–3) run in the parent context — they are lightweight bash calls. The reading and extraction (Steps 4–5) MUST run inside a subagent. The converted `markdown.md` can be large, and reading it in the parent context of an active workflow accumulates permanent token cost. The subagent reads `markdown.md`, writes plain-text `_text.md`, and the parent reads only that. + +**Pattern:** + +The parent skill handles install.py, the SHA-256 cache check, convert.py if needed, and the `_text.md` collision check. Then it launches an Agent: + +``` +Read a converted markdown file and produce structured extraction notes. + +Markdown input: +Text output: + +Process: +1. Read using the Read tool +2. From the title section, extract a bibliographic metadata block: + ## Bibliographic metadata + doi: <10.xxxx/yyyy if present, else null> + authors: [LastName1, LastName2, ...] + title: + year: + venue: + venue_type: journal | working_paper | book_chapter | other +3. Extract: research question, audience, method, data (sources, sample size, time period), + statistical methods, findings, contributions, replication feasibility +4. Write the final structured extraction to , with the + ## Bibliographic metadata block first, followed by the research notes. + +Report when done: page count, figures/tables found, one-sentence content summary. +``` + +After the agent returns, the parent reads `_text.md` (plain text, not the large `markdown.md`) and continues its workflow. + +**Standalone invocations** (user calls `/read-pdf` directly) read `markdown.md` in the main conversation and write `_text.md` directly — no subagent needed for a one-off read. + +## Quick Reference + +| Step | Action | +|------|--------| +| **Acquire** | Download via web search or use local file in place | +| **Install** | `python3 ~/.claude/skills/read-pdf/install.py` (idempotent; downloads models on first run) | +| **Check cache** | SHA-256 → `~/.cache/claude-pdf-converter/cache/marker//markdown.md` | +| **Convert** | `python3 ~/.claude/skills/read-pdf/convert.py ` if not cached | +| **Collision** | Ask overwrite vs `_text2.md` if `_text.md` already exists | +| **Extract** | Bibliographic metadata + 8-dimension notes from `markdown.md` | +| **Persist** | Save to `_text.md` alongside the source PDF | + +For backend details, cache management, and GPU notes, see [README.md](README.md). diff --git a/.claude/skills/read-pdf/convert.py b/.claude/skills/read-pdf/convert.py new file mode 100644 index 0000000..72deebe --- /dev/null +++ b/.claude/skills/read-pdf/convert.py @@ -0,0 +1,261 @@ +#!/usr/bin/env python3 +""" +read-pdf converter — PDF → markdown + figures (marker backend). + +Caches by SHA-256 of the PDF bytes. Re-running on the same content is free. + +Usage: + python3 convert.py + +Prints the absolute path to the cached markdown.md on success (exit 0). +On backend failure, exits non-zero with the error on stderr — no fallback. + +Cache layout: + ~/.cache/claude-pdf-converter/cache/marker// + markdown.md # verbatim conversion with inline ![](figures/...) + figures/*.png # extracted figures + meta.json # backend, version, page/figure counts, source path +""" + +import hashlib +import json +import os +import platform +import re +import subprocess +import sys +import time +from pathlib import Path + +BACKEND = "marker" +CACHE_ROOT = Path.home() / ".cache" / "claude-pdf-converter" +CACHE_DIR = CACHE_ROOT / "cache" / BACKEND +VENV_DIR = CACHE_ROOT / f"venv-{BACKEND}" +VENV_PYTHON = ( + VENV_DIR / "Scripts" / "python.exe" + if platform.system() == "Windows" + else VENV_DIR / "bin" / "python" +) +SKILL_DIR = Path(__file__).resolve().parent + + +def detect_torch_device() -> str: + """Pick best available torch device: cuda > cpu. MPS excluded — surya's layout + model crashes on Apple Silicon MPS with an index-bounds error at runtime.""" + try: + import torch + except ImportError: + return "cpu" + if torch.cuda.is_available(): + return "cuda" + return "cpu" + + +def normalize_footnotes(text: str) -> str: + """ + Rewrite marker's bare-number footnote encoding as Pandoc-style markdown footnotes. + + Marker places footnote superscripts as bare digits attached to the preceding + word/punctuation, then dumps the footnote body as a standalone paragraph + starting with the matching number at the next page-break boundary. This + function detects matched anchor/definition pairs and rewrites them: + + ...coefficient.12 We then... → ...coefficient.[^12] We then... + 12The county-level cluster... → (removed from body) + + A definitions block is appended at the end of the document: + + [^12]: The county-level cluster... + + Guards: code fences, table rows, display-math paragraphs, and numbered list + items (digit followed by ". " or ") ") are left untouched. + Only numbers that appear as BOTH an anchor and a definition are rewritten — + this is the primary false-positive guard. + """ + paragraphs = re.split(r'\n\n+', text) + + # --- Pass 1: find definition paragraphs --- + # Matches: bare 1–3 digit number at paragraph start, NOT followed by ". " + # or ") " (numbered list items), then optional whitespace, then the body. + # No mandatory space between number and body (handles OCR gaps in old scans). + fn_def_re = re.compile(r'^(\d{1,3})(?!\.\s|\)\s)\s*(\S.+)', re.DOTALL) + + footnote_defs: dict[str, str] = {} + def_para_indices: set[int] = set() + in_fence = False + + for i, para in enumerate(paragraphs): + stripped = para.strip() + # Track code-fence state across paragraphs + if stripped.count('```') % 2 != 0: + in_fence = not in_fence + if in_fence: + continue + # Skip tables, display math, and code fences + if re.match(r'\s*(\||```|\$\$)', stripped): + continue + m = fn_def_re.match(stripped) + if m: + num, body = m.group(1), m.group(2).strip() + if body and not body.isdigit(): + footnote_defs[num] = body + def_para_indices.add(i) + + if not footnote_defs: + return text + + # --- Pass 2: replace anchors in body paragraphs --- + # Anchor: one of the known footnote numbers immediately following a word + # character or sentence-ending punctuation, not preceded by '[' (citation). + # Lookahead: whitespace, sentence punctuation, closing bracket, or EOL. + nums_alt = '|'.join(re.escape(n) for n in sorted(footnote_defs, key=lambda x: -len(x))) + anchor_re = re.compile( + r'(?<=[a-zA-Z.,;:!?\'")\]])(? str: + h = hashlib.sha256() + with path.open("rb") as f: + for chunk in iter(lambda: f.read(1 << 20), b""): + h.update(chunk) + return h.hexdigest() + + +def text_layer_chars(path: Path, pages: int = 3) -> int: + """Return non-whitespace chars extracted from the PDF text layer sample.""" + try: + result = subprocess.run( + ["pdftotext", "-l", str(pages), str(path), "-"], + check=False, + capture_output=True, + text=True, + timeout=30, + ) + except (FileNotFoundError, subprocess.TimeoutExpired): + return 0 + if result.returncode != 0: + return 0 + return sum(1 for ch in result.stdout if not ch.isspace()) + + +def in_venv() -> bool: + return Path(sys.prefix).resolve() == VENV_DIR.resolve() + + +def reexec_in_venv(args: list[str]) -> None: + """Re-run this script under the backend venv's Python.""" + if not VENV_PYTHON.exists(): + installer = SKILL_DIR / "install.py" + subprocess.run([sys.executable, str(installer)], check=True) + os.execv(str(VENV_PYTHON), [str(VENV_PYTHON), str(Path(__file__).resolve()), *args]) + + +def convert_with_marker(pdf_path: Path, out_dir: Path) -> dict: + from marker.converters.pdf import PdfConverter + from marker.models import create_model_dict + from marker.output import text_from_rendered + + text_chars = text_layer_chars(pdf_path) + use_text_layer = text_chars >= 500 + config = {"disable_ocr": True} if use_text_layer else {} + converter = PdfConverter(artifact_dict=create_model_dict(), config=config) + rendered = converter(str(pdf_path)) + text, _, images = text_from_rendered(rendered) + + figures_dir = out_dir / "figures" + figures_dir.mkdir(exist_ok=True) + + fig_count = 0 + for name, img in (images or {}).items(): + out_name = figures_dir / Path(name).name + try: + img.save(out_name) + fig_count += 1 + except Exception as exc: # pragma: no cover + print(f"warn: figure {name} save failed: {exc}", file=sys.stderr) + + text = normalize_footnotes(text) + (out_dir / "markdown.md").write_text(text, encoding="utf-8") + + return { + "backend": "marker", + "page_count": None, + "figure_count": fig_count, + "text_layer_chars_sample": text_chars, + "ocr_disabled": use_text_layer, + "equation_extraction_mode": "native", # marker emits LaTeX directly + } + + +def main() -> int: + if len(sys.argv) != 2: + print("usage: convert.py ", file=sys.stderr) + return 2 + + pdf_path = Path(sys.argv[1]).expanduser().resolve() + if not pdf_path.is_file(): + print(f"error: not a file: {pdf_path}", file=sys.stderr) + return 2 + + if not in_venv(): + reexec_in_venv([str(pdf_path)]) + + # Marker reads TORCH_DEVICE at import time. Set before importing the + # backend, after we're inside the venv (so torch is the venv's torch). + if "TORCH_DEVICE" not in os.environ: + os.environ["TORCH_DEVICE"] = detect_torch_device() + + digest = sha256_of(pdf_path) + out_dir = CACHE_DIR / digest + md_path = out_dir / "markdown.md" + if md_path.is_file(): + print(str(md_path)) + return 0 + + out_dir.mkdir(parents=True, exist_ok=True) + started = time.time() + info = convert_with_marker(pdf_path, out_dir) + info.update( + { + "source_path": str(pdf_path), + "sha256": digest, + "elapsed_seconds": round(time.time() - started, 2), + "torch_device": os.environ.get("TORCH_DEVICE", "cpu"), + "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()), + } + ) + (out_dir / "meta.json").write_text( + json.dumps(info, indent=2), encoding="utf-8" + ) + print(str(md_path)) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/.claude/skills/read-pdf/install.py b/.claude/skills/read-pdf/install.py new file mode 100644 index 0000000..5d2c7fa --- /dev/null +++ b/.claude/skills/read-pdf/install.py @@ -0,0 +1,163 @@ +#!/usr/bin/env python3 +""" +read-pdf installer — sets up the local PDF→markdown converter (marker backend). + +Idempotent. First run creates a venv at ~/.cache/claude-pdf-converter/venv-marker/, +installs marker-pdf, and warms up model downloads. Subsequent runs short-circuit +if marker imports cleanly under the venv. + +The venv lives outside any git repo so that backend models (~hundreds of MB) +do not pollute the skills checkout. + +OS-agnostic: searches for a Python ≥ 3.10 across macOS, Linux, and Windows +common install locations. If none is found, prints a platform-aware install hint. +""" + +import platform +import shutil +import subprocess +import sys +from pathlib import Path + +BACKEND = "marker" +PINS = ["marker-pdf"] # latest; pin a version after a regression is observed + +PY_MIN = (3, 10) + +# Names to try on PATH. Cross-platform: same on macOS/Linux/Windows because +# python.org and most package managers install with these names. +PY_NAMES = ["python3.13", "python3.12", "python3.11", "python3.10", "python3", "python"] + +# Absolute fallback paths by OS, only consulted if PATH search fails. +def _fallback_paths() -> list[list[str]]: + sysname = platform.system() + if sysname == "Darwin": + return [[path] for path in ( + f"/Library/Frameworks/Python.framework/Versions/{v}/bin/python{v}" + for v in ("3.13", "3.12", "3.11", "3.10") + )] + [[path] for path in ( + f"/opt/homebrew/bin/python{v}" for v in ("3.13", "3.12", "3.11", "3.10") + )] + [[path] for path in ( + f"/usr/local/bin/python{v}" for v in ("3.13", "3.12", "3.11", "3.10") + )] + if sysname == "Linux": + return [[f"/usr/bin/python{v}"] for v in ("3.13", "3.12", "3.11", "3.10")] + if sysname == "Windows": + # py launcher handles version selection on Windows + return [["py", f"-{v}"] for v in ("3.13", "3.12", "3.11", "3.10")] + return [] + + +def _install_hint() -> str: + sysname = platform.system() + if sysname == "Darwin": + return "Install Python 3.10+ via `brew install python@3.12` or python.org installer." + if sysname == "Linux": + return "Install Python 3.10+ via your package manager (e.g. `apt install python3.12` or `dnf install python3.12`)." + if sysname == "Windows": + return "Install Python 3.10+ via `winget install Python.Python.3.12` or python.org installer." + return "Install Python 3.10 or newer." + + +def _check_version(cmd: list[str]) -> bool: + try: + out = subprocess.check_output( + [*cmd, "-c", "import sys; print('%d.%d' % sys.version_info[:2])"], + text=True, stderr=subprocess.DEVNULL, + ).strip() + major, minor = (int(x) for x in out.split(".")) + return (major, minor) >= PY_MIN + except Exception: + return False + + +def find_python() -> list[str]: + """Return command for a Python ≥3.10. Prefers the running interpreter if it qualifies.""" + if sys.version_info >= PY_MIN: + return [sys.executable] + for name in PY_NAMES: + path = shutil.which(name) + if path and _check_version([path]): + return [path] + for cand in _fallback_paths(): + executable = cand[0] + path = executable if Path(executable).exists() else shutil.which(executable) + if path: + cmd = [path, *cand[1:]] + if _check_version(cmd): + return cmd + print( + f"error: need Python ≥{PY_MIN[0]}.{PY_MIN[1]} but found only " + f"{sys.version_info.major}.{sys.version_info.minor}.\n" + f"{_install_hint()}", + file=sys.stderr, + ) + sys.exit(2) + + +CACHE_ROOT = Path.home() / ".cache" / "claude-pdf-converter" +VENV_DIR = CACHE_ROOT / f"venv-{BACKEND}" + + +def venv_python() -> Path: + # Windows venvs put python in Scripts/, not bin/ + if platform.system() == "Windows": + return VENV_DIR / "Scripts" / "python.exe" + return VENV_DIR / "bin" / "python" + + +def venv_exists() -> bool: + return venv_python().exists() + + +def backend_imports() -> bool: + if not venv_exists(): + return False + result = subprocess.run( + [str(venv_python()), "-c", "import marker"], + capture_output=True, + ) + return result.returncode == 0 + + +def create_venv() -> None: + CACHE_ROOT.mkdir(parents=True, exist_ok=True) + print( + f"First run: creating venv at {VENV_DIR} and installing " + f"{BACKEND} (~500MB, 1–3 min, one-time).", + flush=True, + ) + base_python = find_python() + subprocess.run([*base_python, "-m", "venv", str(VENV_DIR)], check=True) + subprocess.run( + [str(venv_python()), "-m", "pip", "install", "--upgrade", "pip"], + check=True, + ) + subprocess.run( + [str(venv_python()), "-m", "pip", "install", *PINS], + check=True, + ) + + +def warmup_models() -> None: + """Trigger first-run model download so the first conversion is fast.""" + print("Downloading layout/OCR models (one-time)...", flush=True) + subprocess.run( + [str(venv_python()), "-c", + "from marker.models import create_model_dict; create_model_dict()"], + check=True, + ) + + +def main() -> int: + if backend_imports(): + return 0 + if not venv_exists(): + create_venv() + warmup_models() + print(f"read-pdf setup complete. Backend: {BACKEND}", flush=True) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/skills/read-pdf/README.md b/skills/read-pdf/README.md new file mode 100644 index 0000000..3c1e22b --- /dev/null +++ b/skills/read-pdf/README.md @@ -0,0 +1,119 @@ +# `/read-pdf` — Download, Convert, and Deep-Read Academic Papers + +**Same workflow as `/split-pdf`, but uses python:marker to convert the PDF to markdown locally first, instead of having Claude vision-read PDF page images.** This makes equation, table, and figure extraction more faithful, and avoids image-based context bloat in the parent conversation. + +**Skill location:** [`.claude/skills/read-pdf/SKILL.md`](../../.claude/skills/read-pdf/SKILL.md) + +--- + +## What This Skill Does + +You give Claude a paper — either a local PDF file or a search query — and it does the rest. It finds and downloads the paper, or uses your local file in place, converts it to clean markdown using python:marker, then reads that markdown to write structured notes. When finished, it saves a persistent `_text.md` extraction alongside the source PDF, in the same format produced by `/split-pdf`. + +--- + +## Why It Exists + +`/split-pdf` reads PDFs by having Claude vision-read page images in batches. This works well for most papers but has two limitations: + +1. **Equation fidelity.** PDF page images render math as bitmaps. Vision-reading bitmaps produces approximate LaTeX transcriptions. Papers heavy with structural equations benefit from native math extraction. +2. **Table structure.** Complex tables are harder to transcribe accurately from images than from a layout-aware text conversion. + +`/read-pdf` addresses both by running a local conversion step first. The result is a `markdown.md` file where equations are native LaTeX math mode and tables are pipe-syntax markdown — readable as text rather than image bitmaps. + +--- + +## How It Works + +``` +~/.cache/claude-pdf-converter/ +├── venv-marker/ # one-time install of marker-pdf +└── cache/ + └── marker/ + └── / + ├── markdown.md # conversion + inline ![](figures/...) + ├── figures/ + │ ├── fig_1.png + │ └── fig_2.png + └── meta.json # backend, page/figure counts, timestamp +``` + +| Step | Action | +|------|--------| +| **Acquire** | Download the PDF via web search or use a local file in place | +| **Install** | `install.py` sets up the marker venv on first run (~500 MB, one-time) | +| **Check cache** | SHA-256 hash check — skip re-conversion if markdown already exists | +| **Convert** | `convert.py` runs marker and writes `markdown.md` to the content-hash cache | +| **Collision** | If `_text.md` already exists, ask: overwrite or save as `_text2.md`? | +| **Extract** | Read `markdown.md`, write bibliographic metadata + 8-dimension notes | +| **Persist** | Save final extraction to `_text.md` alongside the source PDF | + +### Usage + +``` +/read-pdf path/to/paper.pdf +/read-pdf "Gentzkow Shapiro Sinkinson 2014 competition newspapers" +``` + +When called by another skill, the caller can invoke `convert.py` directly via bash rather than spawning `/read-pdf` as a slash command — the script is the conversion contract. + +### First-run cost + +The first invocation on a fresh machine creates a venv at `~/.cache/claude-pdf-converter/venv-marker/` and downloads marker's layout/OCR models (~500 MB, 1–3 min). The skill prints a one-line warning so the user knows why it is slow. Every subsequent invocation skips this setup entirely. + +The venv lives **outside any git repo** so the model files do not pollute a checkout. + +--- + +## Conversion Backend + +The backend is fixed to **marker** (`marker-pdf`). Marker was selected after a bake-off on empirical-economics PDFs because it performed well on equation fidelity, table structure, and figure extraction quality. + +Backend selection is not exposed as a runtime option. If a future backend candidate should replace marker, edit the `BACKEND` constant in `convert.py` so the cache namespace and venv are regenerated cleanly. + +### Born-digital PDFs and OCR + +Most journal PDFs already contain an embedded text layer. For those files, `convert.py` samples the first pages with `pdftotext` and tells marker to use the embedded text rather than re-OCRing the whole document. Marker still performs layout, table, and selected region recognition, but avoids the slow full-document OCR path. If the text-layer sample is missing or too sparse, marker keeps OCR enabled for scanned PDFs. + +### GPU acceleration + +Auto-detected: NVIDIA CUDA → CPU. MPS on Apple Silicon is excluded because surya's layout model crashes at runtime on MPS with an index-bounds error. No flags are needed on any platform. + +--- + +## Output Contract + +`/read-pdf` writes the same `_text.md` format as `/split-pdf`: a bibliographic metadata block followed by eight research-note dimensions. + +``` +## Bibliographic metadata +doi: <10.xxxx/yyyy or null> +authors: [LastName1, LastName2, ...] +title: +year: +venue: +venue_type: journal | working_paper | book_chapter | other +``` + +This means downstream skills like `/bib-update` and `/wiki-update` can consume outputs from either `/split-pdf` or `/read-pdf`. + +--- + +## Failure Mode + +Hard fail. If marker errors on a given PDF (encrypted, malformed, OCR fails), the script exits non-zero and the caller surfaces the error. There is no silent fallback to `pdftotext` or any other tool — silent fallbacks can produce wrong conversions that look plausible on inspection. + +--- + +## Limitations + +- **First-run is slow** — venv creation + model download takes 1–3 minutes. After that, conversion of a typical 30-page paper takes ~30s–2min depending on hardware. +- **Requires writable cache space** at `~/.cache/claude-pdf-converter/`. +- **Conversion can fail on malformed PDFs.** If `convert.py` errors, use `/split-pdf` instead. +- **Cache is not auto-evicted** — re-converting the same PDF is free, but the cache grows monotonically. Wipe with `rm -rf ~/.cache/claude-pdf-converter/cache/` if needed. + +--- + +## Acknowledgments + +The in-place PDF handling, persistent `_text.md` extraction, build directory convention, and agent isolation protocol follow conventions established in the `/split-pdf` skill, where they were inspired by improvements identified by [Ben Bentzin](https://www.mccombs.utexas.edu) (Associate Professor of Instruction, McCombs School of Business, University of Texas at Austin). The marker integration (`convert.py`, `install.py`) and content-hash caching design are original to this skill.