From 9cf92d7724b7585683e8041be6b876ed2a5a14cc Mon Sep 17 00:00:00 2001 From: Noah Miller Date: Fri, 8 May 2026 11:32:49 -0400 Subject: [PATCH] Add wiki-update reference ingest skill --- .claude/skills/wiki-update/README.md | 113 ++++ .claude/skills/wiki-update/SKILL.md | 542 ++++++++++++++++++ .../templates/references_CLAUDE.md | 106 ++++ skills/README.md | 1 + skills/wiki-update/README.md | 124 ++++ 5 files changed, 886 insertions(+) create mode 100644 .claude/skills/wiki-update/README.md create mode 100644 .claude/skills/wiki-update/SKILL.md create mode 100644 .claude/skills/wiki-update/templates/references_CLAUDE.md create mode 100644 skills/wiki-update/README.md diff --git a/.claude/skills/wiki-update/README.md b/.claude/skills/wiki-update/README.md new file mode 100644 index 0000000..d6e43ff --- /dev/null +++ b/.claude/skills/wiki-update/README.md @@ -0,0 +1,113 @@ +# Wiki-Update (`/wiki-update`) + +> **Ingest new PDFs from a project's `references/raw/` into the project's wiki — relevance-filtered, atomically per paper, with BibTeX maintenance handed to `/bib-update`.** + +`/wiki-update` is the maintenance skill for the project wiki created by `/newproject`. You drop new papers into `references/raw/`, invoke the skill, and it discovers them, summarizes them through the lens of your project's research focus, writes new wiki pages, updates existing pages with cross-references, and then calls `/bib-update` to refresh the central BibTeX file — all without re-reading papers it has already processed. + +## When to use it + +After adding one or more PDFs to `references/raw/`. Triggers include: +- "ingest new references" +- "update the wiki" +- "process the new papers I added" + +The skill is also designed to be safe to re-invoke: if a previous run failed mid-way, the next invocation rediscovers the unprocessed papers and retries from scratch. + +## Prerequisites + +**This skill expects** (or creates on first invocation): + +- `references/raw/` (PDFs land here) +- `references/wiki/` (concept pages and the running log) +- `references/CLAUDE.md` (project-specific wiki conventions) +- A filled-in project root `CLAUDE.md` with research question, data sources, and identification strategy + +**First-run note.** If `references/raw/` does not exist when you invoke this skill, it will be created automatically along with the rest of the wiki structure (`references/wiki/`, `references/CLAUDE.md`, `wiki/index.md`, `wiki/log.md`). Just put PDFs in `references/raw/` for the next invocation to ingest. + +The project root `CLAUDE.md` is *not* auto-created — if it's missing or its placeholders are unfilled, `/wiki-update` stops and asks you to fill in the project context first. It will not guess the research question. + +## What it does + +``` +/wiki-update # ingest all new PDFs +/wiki-update "focus on IV strategies" # additional batch focus +``` + +The skill works in three phases: + +### 1. Pre-flight (main session) + +- Verifies the project structure exists +- Reads `references/CLAUDE.md` for project-specific format conventions +- Reads `references/wiki/log.md` to find previously-ingested filenames +- Lists new PDFs and proposes filename normalizations (per the convention `Last_Year_Venue.pdf` / `Last1_etal_Year_Venue.pdf`) +- Presents proposed renames in a single batch for one approve/edit/reject decision + +### 2. Per-paper ingest (subagents) + +For each new paper, a subagent runs in isolation (so PDF page images don't bloat the main conversation context). The subagent: + +- Picks the right ingest protocol: + - **Protocol M** — use `/read-pdf`'s marker conversion cache when available + - **Protocol E** — use an existing `_text.md` extract directly + - **Protocol S** — split the PDF into 4-page chunks and read the splits without the interactive `/split-pdf` confirmation gate +- Extracts content along the 11 dimensions defined by the skill, including tables and figures +- Applies the **relevance filter**: sections directly relevant to the project's research focus get full treatment in the wiki; less-relevant sections get a one-liner with a page reference. Nothing is fully omitted. +- Writes new concept pages and appends to existing pages directly +- Returns proposed *destructive* edits (rewording existing claims, deleting lines, modifying summaries) to the main session as diffs for user approval + +### 3. Per-paper atomicity (main session) + +For each paper: +- Apply approved destructive edits +- Append a single line to `wiki/log.md` *only after* all wiki edits succeed +- If anything fails before the log write, the next invocation will rediscover the paper as new and retry + +### 4. BibTeX handoff (main session) + +After all papers are ingested, call `/bib-update` in append-only mode. `/bib-update` reads the `## Bibliographic metadata` blocks from `_text.md` extracts, skips citation keys already present in `references/references.bib`, and handles DOI/CrossRef/OpenAlex/fallback entry generation. + +## Key features + +### Three-tier reuse + +The skill never re-reads a paper it has already processed. Cached `_text.md` extracts and existing splits are detected automatically and reused. A second invocation on the same papers is essentially free. + +### Subagent isolation + +Long PDFs render as image data that accumulates permanently in conversation context. Two or three large papers in the main session can crash the conversation. By delegating all PDF reading to subagents, the main session's context stays bounded — the main session only ever reads plain markdown returned by the subagents. + +### Atomic per-paper writes + +Each paper either fully succeeds at the wiki layer (all wiki edits + log entry) or fully fails with no log entry. If a run fails after the rollback journal is built, created/modified wiki pages are restored before retry. Converter caches, split `notes.md`, and `_text.md` extracts are treated as reusable intermediate artifacts, not part of the wiki rollback journal. + +### Project-context-driven relevance filtering + +The wiki is not a neutral summary archive — it's a focused map of the literature relevant to *this* project. The skill reads the project's research question and identification strategy from `CLAUDE.md` and uses that to decide what gets full treatment vs. a one-liner. The optional batch focus argument supplements this for unusual cases. + +### BibTeX handoff + +The wiki ingest step writes the metadata that `/bib-update` needs, then delegates BibTeX maintenance to that skill. Keeping the fetch cascade in `/bib-update` avoids duplicating bibliography rules here. + +## Hard rules + +- **Never modifies source PDFs without approval** — canonical renames require the batched approval flow +- **Never reads PDF extracts in the main session** — always delegated to subagents +- **Never writes the log entry before wiki edits complete** — the log lags, never leads +- **Never invents project context** — if `CLAUDE.md` placeholders are unfilled, stops and asks +- **Never renames a PDF without user approval** — even a single non-conforming file goes through the batched propose/approve flow + +## Files in this skill + +- [`SKILL.md`](SKILL.md) — the full operational protocol (pre-flight, per-paper subagent prompt, BibTeX handoff, error handling) + +## Related skills + +- **`/newproject`** — creates the directory structure that `/wiki-update` assumes. If you haven't scaffolded a project with `/newproject`, this skill will fail pre-flight. +- **`/read-pdf`** — preferred Protocol M converter for layout-aware markdown, tables, figures, and equations. +- **`/split-pdf`** — the underlying batched-reading method. `/wiki-update` inlines the splitting logic rather than invoking `/split-pdf` directly, because `/split-pdf` has a per-batch user-confirmation gate that would deadlock inside a subagent. +- **`/bib-update`** — maintains `references/references.bib` from the metadata blocks written during ingest. + +## Acknowledgments + +The conceptual foundation for this skill — maintaining a project-specific LLM-readable wiki that grows alongside the research and is consumed by future LLM sessions as compressed institutional memory — is owed to [Andrej Karpathy's LLMwiki concept](https://x.com/karpathy). `/wiki-update` operationalizes that idea for empirical-economics workflows: relevance-gated ingestion of new papers into a structured wiki that the project's `CLAUDE.md` indexes. diff --git a/.claude/skills/wiki-update/SKILL.md b/.claude/skills/wiki-update/SKILL.md new file mode 100644 index 0000000..0afbb9e --- /dev/null +++ b/.claude/skills/wiki-update/SKILL.md @@ -0,0 +1,542 @@ +--- +name: wiki-update +description: >- + Ingest new PDFs from a project's references/raw/ folder into the project's wiki, + following the project's wiki conventions and filtering for relevance to the + project's research focus. Auto-detects the best ingest path: converted markdown + (via read-pdf's converter, if installed) for high-fidelity tables, figures, and + equations; cached structured extract (_text.md) if available; or full split-PDF + pipeline as fallback. Creates `references/raw/`, `references/wiki/`, and + `references/CLAUDE.md` on first invocation if absent. Calls `/bib-update` + automatically at the end to refresh `references/references.bib`. Use when the + user adds new papers to references/raw/ and asks to update the wiki, or says + "ingest new references", "update the wiki", or similar. +allowed-tools: Read, Edit, Write, Glob, Grep, Bash(ls*), Bash(pdftotext:*), Bash(python3:*), Bash(mv:*), Bash(cp:*), Bash(mkdir:*), Bash(touch:*), Agent +argument-hint: [optional focus or theme for this batch] +--- + +# wiki-update: Ingest new references into the project wiki + +Maintains a project's reference wiki by ingesting newly-added PDFs from `references/raw/`, summarizing each through the lens of the project's research focus, and updating the wiki atomically per-paper. + +**Ingest path is auto-detected per paper.** If the read-pdf converter is installed, it runs first for high-fidelity markdown (Protocol M: best tables, figures, and equation handling). If only a cached `_text.md` extract exists, that feeds wiki writing directly (Protocol E). Otherwise the full split-PDF vision pipeline runs (Protocol S). All three paths produce the same wiki output — the difference is quality of table and figure capture. + +**`pdftotext` is not an ingest source.** It is allowed only for narrow pre-flight tasks: first-page filename proposals when the converter is unavailable, metadata checks needed for `/bib-update`, and other explicit bootstrap/diagnostic checks that do not synthesize wiki content. Once a paper is assigned to Protocol M or Protocol E, do not use `pdftotext` to read, summarize, validate, or supplement substantive content. Wait for the selected input (`markdown.md` or `_text.md`) and read that source only. + +## When this skill is invoked + +The user has added one or more PDFs to `references/raw/` and wants the wiki updated. The optional argument is a free-form focus string (e.g., "focus on IV strategies and instrument validity") that applies to this batch in addition to the project's standing context. + +## Pre-flight (main session) + +Run all checks before any ingest work. If anything fails, stop and ask the user. + +### 0. Lazy scaffolding (first invocation in a project) + +Before the other pre-flight checks, self-bootstrap the wiki structure if it's absent. All steps are idempotent — re-invocations against an already-scaffolded project are no-ops. + +**a. Check for `references/`.** If `./references/` does not exist: + +1. Create the directory tree: + ```bash + mkdir -p references/raw references/wiki references/wiki/figures + ``` +2. Render `references/CLAUDE.md` from the template at `~/.claude/skills/wiki-update/templates/references_CLAUDE.md`, substituting `{{PROJECT_NAME}}` with the current project's name (use the basename of the project root — typically the current working directory). +3. Initialize `references/wiki/index.md`: + ```markdown + # Wiki Index — [project-name] + + | Page | Description | + |------|-------------| + ``` +4. Initialize `references/wiki/log.md`: + ```markdown + # Wiki Log — [project-name] + + | Date | Source | Changes | + |------|--------|---------| + ``` + +If `./references/` already exists, skip. Do not clobber any existing files. + +**b. Append a wiki-references entry to the project's root `CLAUDE.md` (idempotent).** If `./CLAUDE.md` exists at the project root and does NOT already contain a reference to `references/CLAUDE.md` (grep for the literal string `references/CLAUDE.md`), append: + +```markdown +- See `references/CLAUDE.md` for wiki conventions and the project's reference library. +``` + +If `./CLAUDE.md` does not exist, skip silently. + +After this self-bootstrap, the rest of the pre-flight (steps 1–6 below) runs as before. + +### 1. Locate the wiki + +Check that `./references/raw/` and `./references/wiki/` both exist relative to the current working directory. If either is still missing after the lazy-scaffolding step, ask the user where the wiki lives. Do not search parent directories. + +Read `./references/CLAUDE.md` for project-specific wiki conventions (page format, citation rules, naming). These conventions take precedence over anything in this skill if they conflict — this skill defines *workflow*, not *format*. + +### 2. Verify project context is filled in + +Read `./CLAUDE.md` (the project root file). Check the "Research Question," "Data Sources," and "Identification Strategy" fields (or their equivalents). If any are still placeholder text — bracketed phrases like `[What are you trying to answer?]`, `[What data are you using?]`, or otherwise unfilled — **stop and ask the user to fill them in first**. Explain that relevance filtering depends on this context. + +The optional `[focus]` argument supplements but does not replace the project CLAUDE.md context. + +### 3. Discover new papers + +Read `./references/wiki/log.md` to find previously-ingested filenames. List files in `./references/raw/` that do not appear in the log. + +**Non-PDF files:** If any non-PDF files are present in `raw/`, surface them before continuing: + +``` +Non-PDF files found in references/raw/: +These were skipped for ingest. Move them elsewhere if they don't belong, or tell me if any should be treated differently. +``` + +Include skipped filenames in the end-of-run summary under "Skipped (non-PDF)." + +Proceed with PDF files only, in filename-sorted order. If no new PDFs are found, report that and exit. + +### 4. Normalize filenames + +Each new PDF must conform to the project naming convention before ingest. This runs in the **main session** (not subagents) so renames can be batched and approved once. + +**Convention:** +- 1 author → `Last_Year_Venue.pdf` +- 2 authors → `Last1_Last2_Year_Venue.pdf` +- 3+ authors → `Last1_etal_Year_Venue.pdf` +- Venue slug: standard econ journal abbreviation (`AER`, `JPE`, `QJE`, `JEEM`, `JHE`); `NBER` / `SSRN` / `IZA` for known WP series; `WP` for generic working papers; chapter abbrev or `Book` for book chapters. + +**Skip condition.** A filename matching +``` +^[A-Z][a-zA-Z]+(_[A-Z][a-zA-Z]+|_etal)?(_[A-Z][a-zA-Z]+){0,2}_\d{4}_[A-Z][A-Za-z]+\.pdf$ +``` +is already-conforming and passed through untouched. Non-conforming files go through the propose-and-approve flow below. + +**Extracting text for name proposal:** + +For each non-conforming file, extract enough text to propose a name. Choose the method based on what's available: + +- **If `~/.claude/skills/read-pdf/convert.py` exists:** run + ```bash + python3 ~/.claude/skills/read-pdf/convert.py "" + ``` + Capture the printed path to `markdown.md`. Read the first ~2000 characters of `markdown.md` — this covers title, authors, year, and venue. This also primes the converter cache for the ingest step that follows (the cache is SHA-keyed, so renaming the PDF after this point does not invalidate it). + +- **Otherwise:** run `pdftotext -l 1 "" -` and read the output. + +If either method returns empty or <50 chars of non-whitespace, mark the file as **unparseable** and flag for manual handling. + +This `pdftotext` fallback is for filename proposal only. Do not reuse its output for paper synthesis, wiki page writing, tables, figures, or relevance filtering. + +**Batched approval.** After proposals for all non-conforming files are ready, present as one block: + +``` +Proposed renames (N files): + + ... + +Already conform (skipped): K files + +Unparseable (needs manual decision): + ⚠ — extraction failed: + Keep as-is / Provide name? + +Approve all / Edit (per-file) / Reject all? +``` + +- **Approve all** → apply all renames via `mv`. +- **Edit** → per-file review; for each, user can approve, edit, or skip. +- **Reject all** → proceed with no renames. + +**Collision handling** (before any `mv`): proposed name matches existing file → block and ask user to provide an alternative. Two proposals in the batch collide with each other → flag both, require disambiguation (e.g., appending a title word). + +Never silently overwrite. Never proceed past a collision without user input. + +After renames are applied, re-list new PDFs under their new names before continuing. + +### 5. Pre-scan: classify each paper into an ingest tier + +For each new paper (using its post-rename canonical name), determine its ingest protocol. This classification runs entirely in the main session — each subagent receives exactly one protocol with no branching. + +**Check order (stop at the first match):** + +1. **Tier M — Converted markdown:** `~/.claude/skills/read-pdf/convert.py` exists, **and** running it for this PDF succeeds (cache hit is instant; a miss triggers the full conversion here). Capture the returned `markdown.md` path and cache directory. If `convert.py` was already run during step 4 for this paper, it was cached — re-running is a no-op. + + If `convert.py` exists but fails for a specific paper (conversion error), report the error, skip tier M for that paper, and fall through to tier E or S. Do not use `pdftotext` as a temporary or parallel substitute while conversion is running or after conversion fails. + +2. **Tier E — Cached extract:** `references/raw/_text.md` exists. No conversion needed. + +3. **Tier S — Split-PDF pipeline:** Neither of the above. Check whether `references/raw/raw_build/split_/` already exists (splits cached from a prior interrupted run) — pass this as `splits_exist=true|false` to the subagent. + +**Report tier breakdown once, before spawning subagents:** + +``` +Ingest tiers for this batch: + M (converted markdown): N papers + E (cached extract): M papers + S (full pipeline): K papers [X with cached splits] + +[If any converter failures:] + ⚠ Converter failed for: — falling back to E or S +``` + +### 6. Read the wiki index + +Load `./references/wiki/index.md` once. Pass it into each per-paper subagent so it can match new concepts against existing pages and avoid creating duplicates. + +--- + +## Per-paper ingest (subagent) + +Spawn one Agent per paper, sequentially. The main session must not read PDF extracts or markdown directly — delegate deep reading to subagents to bound context. + +Each subagent prompt must be self-contained — the agent has no memory of this conversation. Include: + +- Absolute paths: PDF, input source (markdown.md, `_text.md`, or PDF/splits), `references/raw/`, `references/wiki/`, `references/wiki/figures/`, `references/CLAUDE.md` +- The tier (M, E, or S) and `splits_exist` flag if tier S +- Current `wiki/index.md` contents (for disambiguation) +- Project context block: research question, data sources, identification strategy (from `./CLAUDE.md`) +- Optional batch focus string (if provided as the skill argument) +- The verbatim protocol for this tier (M, E, or S — from the sections below) +- The common verbatim sections: structured-extraction dimensions, tables protocol, figures protocol variant for this tier, substantive-change rule, concept page disambiguation, relevance filtering, subagent return value + +--- + +### Protocol M — Converted Markdown + +*Input:* path to `markdown.md` (in the converter cache), path to the cache directory (for figures), canonical paper basename. + +Protocol M reads only the converted `markdown.md`, `meta.json`, and cache-local figure/equation files. Do not inspect the source PDF with `pdftotext` or any other text extractor for substantive synthesis, even if conversion is slow. If conversion is still running, wait. + +**Step 1: Check for equation fallback.** + +Read `/meta.json`. If `equation_extraction_mode == "image_fallback"`, equations were extracted as `/figures/eq_*.png` rather than inline LaTeX. Before synthesis, transcribe each: + +``` +Read the image at . It is a single equation clipped from an academic paper. +Transcribe it as LaTeX, in display math mode ($$ ... $$). Output only the LaTeX — +no commentary, no surrounding text. If the equation is not legible, output "[unreadable equation]". +``` + +Edit `/markdown.md` in place to replace each `![](figures/eq_N.png)` with the transcribed LaTeX. (The cache markdown is scratch — overwriting is fine; `convert.py` regenerates it on a hash miss.) + +**Step 2: Synthesize `_text.md`.** + +Read `markdown.md`. Produce `references/raw/_text.md` following the `_text.md` structure below (bib block, plain-English synthesis, 11 structured dimensions). Write or overwrite if a prior partial file exists. + +For the bib metadata block: scan `markdown.md` for the DOI regex `10\.\d{4,}/\S+`. Extract authors, title, year, and venue from the title page text. Record null for any field not found. + +**Step 3: Copy and classify relevant figures.** + +For each figure in `markdown.md` (referenced as `![](figures/fig_N.png)`): +1. Identify the paper figure number from surrounding caption text. +2. Apply the project-relevance filter. Non-relevant: one-line description + page ref only; do not copy. +3. For relevant figures: + - Copy from cache to wiki: `cp /figures/fig_N.png references/wiki/figures/_fig.png` (where M is the paper's figure number). Before the first copy, run `mkdir -p references/wiki/figures` (idempotent). + - Classify as Tier A (data figure: scatter, line, bar, coefplot, histogram, density, time series, RD/event-study plot) or Tier B (schematic: DAG, conceptual diagram, map, flowchart, theoretical model). Use the caption text; read the PNG only if the caption is genuinely ambiguous. + +**Step 4: Write wiki pages** using the substantive-change rule and relevance filtering below. + +For relevant figures embedded in wiki concept pages, use this format regardless of Tier A/B: + +```markdown +**Figure N:** (p. 12) + +![](../figures/_figN.png) + +- Key visual finding: +- **Figure notes:** +``` + +The Tier A/B distinction lives in `_text.md` only (full optical decomposition for Tier A; schematic one-liner for Tier B). Wiki pages use the same lightweight embed format for all figures. + +**Return value additions for Protocol M:** + +``` +Figures copied: [list of {source_cache_path, dest_wiki_path, paper_figure_label}] +Equation fallback used: (with count and any "[unreadable equation]" instances if true) +``` + +--- + +### Protocol E — Cached Extract + +*Input:* path to `references/raw/_text.md`. + +**Step 1: Read the extract.** + +Read `_text.md` in full. Extract the `## Bibliographic metadata` block for the return value. Note any CLIP placeholders in the figures sections (these were created by a prior Protocol S run and are still pending). + +Protocol E reads only the cached `_text.md` and any figure files it references. Do not re-read the PDF with `pdftotext` to expand or validate the extract. + +**Step 2: Write wiki pages** using the substantive-change rule and relevance filtering below. + +For figures: if `_text.md` references wiki figure paths that already exist on disk (from a prior Protocol M run), embed them in wiki pages using the same lightweight format as Protocol M. If `_text.md` contains CLIP placeholders, pass them through to the wiki and aggregate them into the Pending CLIPs return field. + +Do **not** re-synthesize or overwrite `_text.md` — it is the canonical extract for this paper. + +**Return value additions for Protocol E:** + +``` +Pending CLIPs: [list of {target_path, source_paper, page_number, one_liner} — forwarded from _text.md] +``` + +--- + +### Protocol S — Split-PDF Pipeline + +*Input:* absolute path to the PDF, absolute path to the splits directory (`references/raw/raw_build/split_/`), `splits_exist` boolean. + +**Step 1: Split (if needed).** + +If `splits_exist=false`: split the PDF into 4-page chunks using PyPDF2, writing to `/`. The canonical splits directory is `references/raw/raw_build/split_/` — use this exact path. Do not derive it yourself. + +**Step 2: Read splits in batches of 3.** + +Read each split sequentially in batches of 3, without pausing or asking for confirmation. After each batch, append findings to `/notes.md` under the structured-extraction dimensions below, preceded by a batch boundary comment: + +``` + +``` + +If `notes.md` already exists (prior interrupted run), read it first and resume from where it left off — do not overwrite earlier content. `notes.md` is append-mostly and permanent; never delete it. + +**Step 3: Synthesize `_text.md`.** + +After all splits are read, write `references/raw/_text.md` from the accumulated `notes.md` content. Follow the `_text.md` structure below (bib block, plain-English synthesis, 11 dimensions). + +For the bib metadata block: scan the first split for the DOI regex `10\.\d{4,}/\S+`. Extract authors, title, year, and venue from the first-split text. Record null for any field not found. + +`notes.md` is permanent — do not delete it after writing `_text.md`. + +**Step 4: Write wiki pages** using the substantive-change rule and relevance filtering below. + +For figures: Protocol S does not have extracted figure images. Use CLIP placeholders for all Tier B figures and for any Tier A data figures that cannot be adequately described in text. A structured Tier A block suffices when the data description is complete; use a CLIP placeholder when it isn't. + +CLIP placeholder format in `_text.md`: + +``` +> **Figure N (CLIP):** (p. 12) +> One-liner: +> ACTION: clip from PDF, save to references/wiki/figures/_fig.png +``` + +When a wiki page references a CLIP figure, use a broken image link (it renders as a visible TODO): + +```markdown +![](../figures/_figN.png) +* ([](../log.md), p. 12)* +``` + +Before writing any CLIP placeholder that references the figures directory, ensure it exists: `mkdir -p references/wiki/figures`. + +**Return value additions for Protocol S:** + +``` +Pending CLIPs: [list of {target_path, source_paper, page_number, one_liner}] +``` + +--- + +### Common: `_text.md` structure + +All protocols that synthesize `_text.md` (M and S) use this layout: + +```markdown +## Bibliographic metadata +doi: <10.xxxx/yyyy if found, else null> +authors: [LastName1, LastName2, ...] +title: +year: +venue: +venue_type: journal | working_paper | book_chapter | other + +## Plain-English synthesis +[~200 words, see below] + +## 1. Research question +... +## 2. Audience +... +[continue through dimension 11] +``` + +### Common: Plain-English synthesis block + +Hard cap: ~200 words. No jargon. Cover: + +- Research question (1 sentence) +- Motivation / why it matters (1–2 sentences) +- What they estimate and how, in plain terms (2–3 sentences) +- What they found (1–2 sentences) +- The take-away — what someone should walk away believing or doing differently (1 sentence) + +This block is the answer to "what's this paper about?" for someone who will not read the rest. Anyone with a college degree should be able to read it without a glossary. If you find yourself writing "endogeneity" or "LATE" or "first-stage F-stat," rewrite in plainer terms. + +### Common: Structured-extraction dimensions + +1. **Research question** — what the paper asks and why it matters +2. **Audience** — sub-community of researchers who care +3. **Method / identification strategy** — how they answer the question +4. **Target parameter** — the estimand in plain terms (e.g., "ATE of schooling on log wages, conditional on age and state-by-year FE"). Distinct from method and identification assumptions. +5. **Data** — sources, unit of observation, sample size, time period +6. **Statistical methods / specifications** — econometric techniques, key specifications, key equations (extract verbatim in LaTeX math mode where available — Protocol M gets these from the converter; Protocol S extracts them from split text) +7. **Findings** — key coefficients and standard errors +8. **Contributions** — what is learned that we didn't know before +9. **Replication feasibility** — data availability, replication archive +10. **Tables (project-relevance gated)** — see Tables protocol below +11. **Figures (project-relevance gated)** — see Figures protocol below + +### Common: Tables protocol (project-relevance gated) + +Apply the project-relevance filter. For tables *directly relevant* to the project's research focus, extract in machine-readable markdown. For non-relevant tables, one-line description with page reference. + +For relevant tables: + +``` +**Table N:** (p. 12) + +| Variable | (1) | (2) | (3) | +|---|---|---|---| +| Schooling | 0.087*** | 0.091*** | 0.085*** | +| | (0.012) | (0.013) | (0.011) | +| N | 12,450 | 12,450 | 12,450 | +| R² | 0.34 | 0.36 | 0.38 | + +Notes: +``` + +Preserve column headers verbatim, numerical values verbatim (including SEs in parentheses and significance stars), and table notes verbatim. Pipe-syntax markdown only; no HTML tables. Table notes are part of the table's content — capture them. + +*Protocol M advantage:* the converter already produces pipe-syntax tables from the PDF. Extract them with light cleanup rather than re-reading the figures. + +### Common: Figures protocol (project-relevance gated, two-tier) + +Apply the project-relevance filter. Non-relevant figures: one-line description with page reference only. + +For relevant figures, classify as Tier A or Tier B using caption text: + +- **Tier A — Data figures**: scatter, line, bar, coefplot, histogram, density, time series, RD/event-study plot. The data IS the content. +- **Tier B — Schematic figures**: DAGs, conceptual diagrams, maps, flowcharts, theoretical model schematics. Do NOT attempt optical decomposition. Default to Tier B when uncertain — a structured Tier A block written for a schematic is misleading; a Tier B for a data figure just makes the reader look at the image. + +**In `_text.md`:** + +*Protocol M* — figures are copied to `references/wiki/figures/`. Record: + +``` +**Figure N:** (p. 12) +![](../wiki/figures/_figN.png) +- Type: +- X-axis: [Tier A only] +- Y-axis: [Tier A only] +- Series / panels: [Tier A only] +- Key visual finding: +- Annotations: [Tier A only] +- **Figure notes:** +[Tier B: replace the structured block with just: One-liner: ] +``` + +*Protocols E and S* — use CLIP placeholders (described in their respective protocol sections). + +### Common: Substantive-change rule (passed to subagent verbatim) + +The subagent applies non-destructive edits directly. Destructive edits to existing pages must be returned as proposed unified diffs — not applied. + +| Edit | Apply directly? | +|---|---| +| Create new wiki page | Yes | +| Append new section / bullet / paragraph to existing page | Yes | +| Add `[[backlink]]` (inline or under "Related pages") | Yes | +| Update `**Last updated**` date | Yes | +| Append a new source to `**Sources**` | Yes | +| Note a contradiction between sources (additive note) | Yes | +| Reorganize section order (no content lost) | Yes | +| Update `wiki/index.md` (append new entries, edit existing one-liners) | Yes | +| Copy an extracted figure into `references/wiki/figures/` | Yes | +| Edit the `**Summary**` field on an existing page | **Return as diff** | +| Delete any existing line | **Return as diff** | +| Modify the wording of an existing claim | **Return as diff** | + +### Common: Concept page disambiguation (subagent) + +Before creating a new concept page, check `wiki/index.md` for existing pages covering the same concept — including obvious synonyms (e.g., "RDD" vs "regression discontinuity"). If a near-match exists but you aren't confident, do **not** create a new page; return the ambiguity to the main session as a question for the user. + +### Common: Relevance filtering (subagent) + +Apply "compress, don't omit": sections directly relevant to the project's research focus get full treatment. Less-relevant sections get a one-line description plus page reference. Nothing is fully omitted. + +### Common: Subagent return value + +``` +Pages created: [list] +Pages modified non-destructively: [list with brief description] +Proposed destructive edits: [list of {page, unified diff, rationale}] +Disambiguation questions: [list of {concept, candidate existing pages}] +Proposed log entry: [single line for wiki/log.md] +Pending CLIPs: [list of {target_path, source_paper, page_number, one_liner}] +[Protocol M only] Figures copied: [list of {source_cache_path, dest_wiki_path, paper_figure_label}] +[Protocol M only] Equation fallback used: +Errors: [any issues encountered] +``` + +--- + +## Per-paper atomicity (main session) + +For each paper, the main session uses a **journal-and-rollback** pattern to guarantee the wiki is never left in an inconsistent state. + +**Execute the write sequence:** +1. Spawn the subagent and wait for its return summary. +2. If there are disambiguation questions, ask the user; pass answers back via a follow-up SendMessage to the same agent (or apply decisions directly if simple). +3. Build the rollback journal from the returned page lists and proposed diffs: for every wiki page that will be created, note that it does not exist; for every wiki page that will be modified, read and save its current content. +4. Apply all non-destructive edits. +5. If there are proposed destructive edits, present them to the user as a single batched approval request (one prompt per paper). User can approve all, reject all, or selectively approve. Apply approved edits. +6. **Last:** append the log entry to `wiki/log.md`. + +**On failure at any step after the rollback journal exists:** roll back: restore each touched page to its journaled state (delete newly-created pages; restore original content for modified pages). Do not write the log entry. The next invocation will rediscover the paper as new and retry cleanly. + +Do not implement partial-resume logic. The journal guarantees retry is always safe. + +After each paper finishes, move to the next. Do not batch papers. + +--- + +## Post-log: update `references/references.bib` + +After **all** papers have been ingested and logged, invoke `/bib-update` in append-only mode. It reads the `## Bibliographic metadata` blocks from each newly-ingested paper's `_text.md`, runs the DOI-direct → CrossRef → OpenAlex → LLM-fallback cascade, and appends new entries to `references/references.bib`. Papers already present in `.bib` are skipped automatically. + +To regenerate `.bib` from scratch, run `/bib-update --rebuild-bib` as a separate, explicit step — not as part of a normal ingest run. + +--- + +## End-of-run summary + +After all papers are processed, report: + +- Papers successfully ingested (with counts of pages created/modified, and for Protocol M: figures copied) +- Papers that failed (with brief reasons; user can re-invoke to retry) +- Any disambiguation decisions the user made +- Any equation-fallback transcriptions marked "[unreadable equation]" (so the user can manually fix them) +- **Pending figure clips (punch-list).** Aggregate every CLIP placeholder from all Protocol E and S subagents: + + ``` + Pending figure clips (N): + 1. references/wiki/figures/Smith_2024_AER_fig2.png + Smith_2024_AER, p. 14 — "DAG of identification strategy" + ... + ``` + + Open each PDF to the indicated page, clip the figure, save under the listed path. Wiki pages already reference these paths — broken-image placeholders resolve silently as each PNG is added. + +--- + +## Rules + +- **Never modify source PDFs in `references/raw/` without approval.** Canonical renames require the batched approval flow. `_text.md` extracts are generated artifacts and may be created by Protocol M/S; Protocol E treats existing `_text.md` extracts as canonical input and does not rewrite them. +- **Never read PDF extracts, markdown, or splits in the main session.** Always delegate deep reading to subagents. The main session's job is orchestration and approval. +- **Never write the log entry before wiki edits complete.** The log is the source of truth for "what's been ingested" — it must lag behind, not lead. +- **Never invent project context.** If `CLAUDE.md` placeholders are unfilled, stop and ask. Do not guess the research question. +- **Project conventions in `references/CLAUDE.md` override this skill** if they conflict on format/naming/citation. This skill owns workflow only. +- **Never rename a PDF without user approval.** Even a single non-conforming file goes through the batched propose/approve flow. No silent `mv`. No overwriting an existing file. +- **Never fall back from the converter silently.** If `convert.py` errors on a PDF, report the error and proceed to tier E or S for that paper — do not substitute pdftotext output without telling the user. +- **Never use `pdftotext` for substantive ingest.** `pdftotext` is limited to first-page metadata/filename/bootstrap checks. It must not be used to summarize, validate, or supplement Protocol M or Protocol E content. diff --git a/.claude/skills/wiki-update/templates/references_CLAUDE.md b/.claude/skills/wiki-update/templates/references_CLAUDE.md new file mode 100644 index 0000000..d9f7e78 --- /dev/null +++ b/.claude/skills/wiki-update/templates/references_CLAUDE.md @@ -0,0 +1,106 @@ +# LLM Wiki + +A personal knowledge base maintained by Claude Code. +Based on [Andrej Karpathy's LLM Wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f). + +## Purpose + +This wiki is a structured, interlinked knowledge base for the {{PROJECT_NAME}} research project. +Claude maintains the wiki. The human curates sources, asks questions, and guides the analysis. + +## Folder structure + +``` +raw/ -- source documents (immutable -- never modify these) +wiki/ -- markdown pages maintained by Claude +wiki/index.md -- table of contents for the entire wiki +wiki/log.md -- append-only record of all operations +wiki/figures/ -- extracted figure clips from source PDFs +``` + +## Ingest workflow + +When the user adds a new source to `raw/` and asks you to ingest it: + +1. Read the full source document (via `/wiki-update`'s converter for PDFs, if available) +2. Discuss key takeaways with the user before writing anything +3. Create a summary page in `wiki/` named after the source +4. Create or update concept pages for each major idea or entity +5. Add wiki-links ([[page-name]]) to connect related pages +6. Update `wiki/index.md` with new pages and one-line descriptions +7. Append an entry to `wiki/log.md` with the date, source name, and what changed + +A single source may touch 10-15 wiki pages. That is normal. + +## Page format + +Every wiki page should follow this structure: + +```markdown +# Page Title + +**Summary**: One to two sentences describing this page. + +**Sources**: List of raw source files this page draws from. + +**Last updated**: Date of most recent update. + +--- + +Main content goes here. Use clear headings and short paragraphs. + +Link to related concepts using [[wiki-links]] throughout the text. + +## Related pages + +- [[related-concept-1]] +- [[related-concept-2]] +``` + +## Math notation + +Always write math in LaTeX/Markdown math mode, **never** as Unicode symbols or plaintext. + +- Inline math → `$x_{it}$`, `$\beta^S - \beta^N$`, `$\epsilon_t$` +- Display math → `$$ ... $$` on its own line +- Greek letters, subscripts, superscripts, operators (≤, ≥, ∑, ∫, ≈) all go inside `$...$`, never as Unicode + +This applies to every wiki page, every extraction note, and every string Claude writes into this wiki. Unicode math is hard to read, impossible to copy into a .tex file, and inconsistent across pages. + +## Citation rules + +- Every factual claim should reference its source file +- Use the format (source: filename.pdf) after the claim +- If two sources disagree, note the contradiction explicitly +- If a claim has no source, mark it as needing verification + +## Question answering + +When the user asks a question: + +1. Read `wiki/index.md` first to find relevant pages +2. Read those pages and synthesize an answer +3. Cite specific wiki pages in your response +4. If the answer is not in the wiki, say so clearly +5. If the answer is valuable, offer to save it as a new wiki page + +Good answers should be filed back into the wiki so they compound over time. + +## Lint + +When the user asks you to lint or audit the wiki: + +- Check for contradictions between pages +- Find orphan pages (no inbound links from other pages) +- Identify concepts mentioned in pages that lack their own page +- Flag claims that may be outdated based on newer sources +- Check that all pages follow the page format above +- Report findings as a numbered list with suggested fixes + +## Rules + +- Never modify anything in the `raw/` folder +- Always update `wiki/index.md` and `wiki/log.md` after changes +- Keep page names lowercase with hyphens (e.g. `machine-learning.md`) +- Write in clear, plain language +- When uncertain about how to categorize something, ask the user diff --git a/skills/README.md b/skills/README.md index ea92517..d4b626f 100644 --- a/skills/README.md +++ b/skills/README.md @@ -20,6 +20,7 @@ This directory contains documentation, methodology, and example output for the s | [**Compile Deck**](compiledeck/) | `/compiledeck` | The mechanical compile loop — preamble templates, palette reference, and TikZ rules. Called by `/beautiful_deck` for compile mechanics. Use directly when editing an existing deck rather than building from scratch. [See documentation →](compiledeck/) | | [**TikZ Audit**](tikz/) | `/tikz` | **A repair tool, not a safety net.** Finds and fixes residual visual collisions in TikZ figures using measurement, not intuition — six-pass protocol covering Bézier curve depths, edge-label gap calculations, boundary clearances, and cross-slide consistency. Catches what `pdflatex` misses. But it cannot reliably fix diagrams that were never built with measurement in mind. The upstream defense is `/beautiful_deck` Step 4.4, which writes safe TikZ from the start. `/tikz` is the downstream check. [See documentation →](tikz/) | | [**Split-PDF**](split-pdf/) | `/split-pdf` | Downloads and deep-reads academic PDFs without crashing the session. Uses the PDF in place (no centralized `articles/` folder), splits into 4-page chunks in a `_build/` directory, reads in batches of ~12 pages, writes structured notes, and saves a persistent `_text.md` extraction so future invocations skip re-reading. When called by another skill, reads inside a subagent to prevent context bloat. [See full walkthrough →](split-pdf/) | +| [**Wiki Update**](wiki-update/) | `/wiki-update` | Ingests PDFs from `references/raw/` into a project-specific markdown wiki. Uses project context from `CLAUDE.md`, routes paper reading through `/read-pdf`, cached extracts, or split-PDF fallback, gates destructive wiki edits behind approval, and hands BibTeX maintenance to `/bib-update`. [See documentation →](wiki-update/) | | [**Bibcheck**](bibcheck/) | `/bibcheck` | Many-agent bibliography audit. Spawns one narrow-focus agent per citation (or one specialist per field) to verify each `.bib` entry against canonical sources — DOI, journal landing page, author working paper. Catches the silent errors a single-agent audit misses as attention decays across long lists: mixed-up entries (title of paper A with authors of paper B), wrong years, journal misattributions. Per-field mode launches each specialist as an isolated `claude -p` subprocess so they cannot peek at each other's conclusions. Outputs a `bibcheck_report.md` and a drop-in `corrected.bib`. [See documentation →](bibcheck/) | | [**New Project**](newproject/) | `/newproject` | Scaffolds a new research project with standard directory structure, CLAUDE.md template, and documented README. [See documentation →](newproject/) | | [**New Book**](newbook/) | `/newbook` | Scaffolds a book-shaped project: `memoir`-based LaTeX skeleton, Palatino body, Gov 2001 palette, voiced-sidebar `\voice{}{}` callouts, one chapter per file, bibliography stub, `CLAUDE.md` with voice and lineage rules, `% SUBSTACK MAP:` placeholders in each chapter, and a chapter-per-file structure that converts cleanly to HTML later. Parallel to `/newproject`. This is the skill that produced *AI Agents and the Research Worker*. [See documentation →](newbook/) | diff --git a/skills/wiki-update/README.md b/skills/wiki-update/README.md new file mode 100644 index 0000000..109939d --- /dev/null +++ b/skills/wiki-update/README.md @@ -0,0 +1,124 @@ +# `/wiki-update` — Reference wiki ingest + +**Skill location:** [`.claude/skills/wiki-update/SKILL.md`](../../.claude/skills/wiki-update/SKILL.md) + +--- + +## What This Skill Does + +You drop one or more PDFs into a project's `references/raw/` folder, run `/wiki-update`, and the skill ingests each paper into a structured wiki at `references/wiki/`. For each paper it: + +1. **Auto-detects the best ingest path** (see below) and produces or reuses a structured 11-dimension extract (`_text.md`) with a 200-word plain-English synthesis at the top. +2. **Updates the wiki:** creates new concept pages, appends to existing ones, embeds or references figures, and adds backlinks. Destructive edits to existing pages are returned as diffs for user approval rather than applied silently. +3. **Atomically logs** the ingest in `wiki/log.md` only after wiki edits succeed — failed runs leave nothing partially committed. +4. **Updates BibTeX** by calling `/bib-update` after successful wiki ingest. + +The wiki is shaped by per-project conventions in `references/CLAUDE.md`, which the skill reads first and treats as authoritative. + +--- + +## Auto-Detection: Three Ingest Paths + +The skill picks the best available path per paper, in order: + +| Protocol | Condition | What it does | +|---|---|---| +| **M — Converted markdown** | `read-pdf`'s converter is installed | Runs `convert.py` (or uses its cache) → high-fidelity markdown with tables, figures, equations | +| **E — Cached extract** | `_text.md` already exists | Reads the existing structured extract and writes wiki pages directly | +| **S — Split-PDF pipeline** | Neither above | Splits PDF into 4-page chunks, reads in batches with vision, synthesizes `_text.md` | + +**Protocol M** is the richest path: the marker-based layout converter used by `/read-pdf` produces pipe-syntax tables ready for copy-paste, figure PNGs that are copied directly into `references/wiki/figures/`, and LaTeX equations. It requires a one-time venv install (~500 MB, 1–3 min, handled lazily by `/read-pdf`). After that, conversions are cached by content hash — re-ingesting the same PDF is free. + +**Protocol S** is the zero-install fallback. Tables and figures are still captured — tables via careful reading of the PDF splits, figures via CLIP placeholders that you fill in by manually clipping from the PDF. It costs more tokens than Protocol M. + +The skill reports which protocol each paper was assigned before spawning any subagents. + +`pdftotext` is not a substantive ingest path. It may be used for narrow pre-flight checks such as first-page filename proposals when the converter is unavailable or metadata/bootstrap checks, but Protocol M must read from converted `markdown.md`, and Protocol E must read from the cached `_text.md`. + +--- + +## Layout + +``` +project-root/ +├── CLAUDE.md # research question, data, identification (must be filled in) +└── references/ + ├── CLAUDE.md # wiki conventions (rendered from skill template on first run) + ├── references.bib # BibTeX entries (maintained by /bib-update) + ├── raw/ # immutable source PDFs + per-paper structured extracts + │ ├── Smith_2024_AER.pdf + │ ├── Smith_2024_AER_text.md # written by this skill — reusable cross-session + │ └── raw_build/ # splits cache for Protocol S (never modify) + │ └── split_Smith_2024_AER/ + └── wiki/ + ├── index.md # table of contents — appended on each ingest + ├── log.md # append-only ingest log + ├── figures/ # figure clips (Protocol M copies; Protocol S uses placeholders) + │ └── Smith_2024_AER_fig3.png + └── .md # one per concept; updated by ingest, linked via [[wiki-links]] +``` + +--- + +## Usage + +``` +/wiki-update +/wiki-update "focus on IV strategies and instrument validity" +``` + +The optional focus string applies to this batch in addition to the project's standing context (research question, data, identification strategy, read from `./CLAUDE.md`). + +To regenerate `references.bib` from scratch, run `/bib-update --rebuild-bib` explicitly after ingest. + +--- + +## What Gets Extracted (11 Dimensions) + +For each paper, the structured extract covers: + +1. Research question +2. Audience +3. Method / identification strategy +4. **Target parameter** — the estimand in plain terms (distinct from method and identification assumptions) +5. Data — sources, unit of observation, sample size, time period +6. Statistical methods / specifications — including **key equations verbatim** in LaTeX (Protocol M extracts them from the converter; Protocol S reads them from the PDF text) +7. Findings — key coefficients and standard errors +8. Contributions +9. Replication feasibility +10. **Tables** — pipe-syntax markdown, project-relevant tables only (Protocol M gets these from the converter output; Protocol S reads them from splits) +11. **Figures** — Tier A (data figures: structured optical description + image embed for M, structured description for S) or Tier B (schematics: image embed for M, CLIP placeholder for S) + +Plus a `Bibliographic metadata` block at the top (DOI, authors, title, year, venue, venue_type) for the BibTeX step. + +--- + +## Pre-Flight Checks + +- **Lazy scaffolding** — creates `references/raw/`, `references/wiki/`, `references/wiki/figures/`, `references/CLAUDE.md`, `wiki/index.md`, `wiki/log.md` on first invocation. Idempotent on re-runs. +- **Project context check** — refuses to proceed if `./CLAUDE.md` has unfilled placeholder fields. Relevance filtering depends on the research question, data sources, and identification strategy. +- **Non-PDF surfacing** — any non-PDF files in `raw/` are reported to the user before ingest starts. +- **Filename normalization** — proposes `Last_Year_Venue.pdf`-style renames for non-conforming PDFs and asks for batched approval before applying `mv`. +- **Tier classification** — runs per-paper detection (converter cache check → `_text.md` check → splits check) and reports the tier breakdown before spawning any subagents. + +--- + +## Per-Paper Subagent Isolation + +Each paper is ingested by a dedicated subagent so that converted markdown and extracted figures don't accumulate in the main session's context across papers. The main session orchestrates: spawning subagents, surfacing user-approval prompts, journal-and-rollback for wiki pages on failure, and writing the log entry only after wiki edits succeed. + +--- + +## What This Skill Does NOT Do + +- **No PDF download.** Drop them into `references/raw/` first. +- **No silent converter fallback.** If `convert.py` errors on a specific paper, the skill reports the error and falls through to Protocol E or S for that paper rather than substituting `pdftotext` output silently. +- **No `pdftotext` summaries.** Once a paper is assigned to Protocol M or E, `pdftotext` must not be used to read, summarize, validate, or supplement the paper's substantive content. + +--- + +## Acknowledgments + +Inspired by Andrej Karpathy's [LLM Wiki](https://karpathy.bearblog.dev/llm-wiki/) pattern — a structured, interlinked knowledge base maintained by an LLM, curated by a human. The Tier A / Tier B figure protocol, the project-relevance gate, and the substantive-change rule are workflow refinements specific to academic-paper ingest at scale. + +The local-conversion path (Protocol M) relies on [marker](https://github.com/VikParuchuri/marker), the open-source layout-aware PDF parser used by `/read-pdf`.