Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions .claude/skills/wiki-update/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Wiki-Update (`/wiki-update`)

`/wiki-update` ingests new PDFs from a project's `references/raw/` folder into the project's literature wiki. It summarizes each paper through the lens of the project's research focus, writes or updates wiki pages, records completed ingests, and refreshes BibTeX metadata.

The executable protocol lives in [`SKILL.md`](SKILL.md). This README is the human overview.

## When To Use It

Use `/wiki-update` after adding one or more PDFs to `references/raw/`.

Natural-language triggers include:

- "ingest new references"
- "update the wiki"
- "process the new papers I added"

The skill is designed to be safe to re-run. Completed papers are identified from the wiki log; unfinished papers are rediscovered and retried.

## What It Expects

- `references/raw/` for source PDFs.
- `references/wiki/` for concept pages and the ingest log.
- `references/CLAUDE.md` for wiki conventions.
- A project root `CLAUDE.md` with the research question, data sources, and identification strategy filled in.

On first run, the skill can scaffold the references wiki structure. It will not invent missing project context.

## What It Does

- Finds new PDFs and proposes filename normalization.
- Reads each paper in isolated subagents to avoid PDF image bloat and whole-file markdown reads in the main session.
- Reuses existing `_text.md` extracts or PDF splits when available.
- For marker-converted PDFs, writes a neutral `_text.md` first, then runs a separate project-wiki synthesis pass.
- Applies a project-context relevance filter so important material receives full treatment and less relevant material gets concise page-referenced notes.
- Writes wiki pages atomically per paper, then logs completion only after edits succeed.
- Runs the BibTeX update cascade after ingestion.

## Boundaries

`/wiki-update` owns the project-wiki lifecycle. `/read-pdf` owns standalone paper reading, including the `/read-pdf --split` fallback. The two skills share the same batching idea, but `/wiki-update` uses a non-interactive subagent flow because a per-batch confirmation gate would deadlock inside an ingest subagent.

For exact tier rules, destructive-edit handling, filename checks, log format, and BibTeX behavior, read [`SKILL.md`](SKILL.md).

## Related Skills

- `/newproject` — creates the project structure this skill expects.
- `/read-pdf` — standalone paper reading and reusable `_text.md` extraction.
- `/read-pdf --split` — standalone batched vision reading for individual papers.
- `/bib-update` — refreshes `references/references.bib` from extracted metadata.

---

The conceptual foundation for this skill is owed to [Andrej Karpathy's LLMwiki concept](https://x.com/karpathy). `/wiki-update` operationalizes that idea for empirical-economics workflows.

This skill originated in [Scott Cunningham](https://github.com/scunning1975/MixtapeTools)'s MixtapeTools repository.
291 changes: 291 additions & 0 deletions .claude/skills/wiki-update/SKILL.md

Large diffs are not rendered by default.

149 changes: 149 additions & 0 deletions .claude/skills/wiki-update/common.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Common protocol fragments — wiki-update subagent

These sections are shared across Protocols M, E, and S. The main session passes this file by path into every per-paper subagent prompt, alongside exactly one of `protocol_m.md`, `protocol_e.md`, or `protocol_s.md`.

---

## `_text.md` structure

Protocols that synthesize `_text.md` (Protocol S and the read-pdf fanout synthesis used by Protocol M) use this layout:

```markdown
## Bibliographic metadata
doi: <10.xxxx/yyyy if found, else null>
authors: [LastName1, LastName2, ...]
title: <verbatim title>
year: <year>
venue: <journal/WP series/etc., verbatim>
venue_type: journal | working_paper | book_chapter | other

## Plain-English synthesis
[~200 words, see below]

## 1. Research question
...
## 2. Audience
...
[continue through dimension 12]
```

## Plain-English synthesis block

Hard cap: ~200 words. No jargon. Cover:

- Research question (1 sentence)
- Motivation / why it matters (1–2 sentences)
- What they estimate and how, in plain terms (2–3 sentences)
- What they found (1–2 sentences)
- The take-away — what someone should walk away believing or doing differently (1 sentence)

This block is the answer to "what's this paper about?" for someone who will not read the rest. Anyone with a college degree should be able to read it without a glossary. If you find yourself writing "endogeneity" or "LATE" or "first-stage F-stat," rewrite in plainer terms.

## Structured-extraction dimensions

1. **Research question** — what the paper asks and why it matters
2. **Audience** — sub-community of researchers who care
3. **Method / identification strategy** — how they answer the question
4. **Target parameter** — the estimand in plain terms (e.g., "ATE of schooling on log wages, conditional on age and state-by-year FE"). Distinct from method and identification assumptions.
5. **Data** — sources, unit of observation, sample size, time period
6. **Statistical methods / specifications** — econometric techniques, key specifications, key equations (extract verbatim in LaTeX math mode where available — Protocol M gets these from the converter; Protocol S extracts them from split text)
7. **Findings** — key coefficients and standard errors
8. **Contributions** — what is learned that we didn't know before
9. **Replication feasibility** — data availability, replication archive
10. **Tables (project-relevance gated)** — see Tables protocol below
11. **Figures (project-relevance gated)** — see Figures protocol below
12. **Equations / formal objects** — labeled equations, model primitives, algorithms, propositions, and other formal objects needed to understand or replicate the paper

## Tables protocol (project-relevance gated)

Apply the project-relevance filter. For tables *directly relevant* to the project's research focus, extract in machine-readable markdown. For non-relevant tables, one-line description with page reference.

For relevant tables:

```
**Table N:** <verbatim caption> (p. 12)

| Variable | (1) | (2) | (3) |
|---|---|---|---|
| Schooling | 0.087*** | 0.091*** | 0.085*** |
| | (0.012) | (0.013) | (0.011) |
| N | 12,450 | 12,450 | 12,450 |
| R² | 0.34 | 0.36 | 0.38 |

Notes: <verbatim table notes — SE clustering, FE structure, etc.>
```

Preserve column headers verbatim, numerical values verbatim (including SEs in parentheses and significance stars), and table notes verbatim. Pipe-syntax markdown only; no HTML tables. Table notes are part of the table's content — capture them.

*Protocol M advantage:* the converter already produces pipe-syntax tables from the PDF. Extract them with light cleanup rather than re-reading the figures.

## Figures protocol (project-relevance gated, two-tier)

Apply the project-relevance filter. Non-relevant figures: one-line description with page reference only.

For relevant figures, classify as Tier A or Tier B using caption text:

- **Tier A — Data figures**: scatter, line, bar, coefplot, histogram, density, time series, RD/event-study plot. The data IS the content.
- **Tier B — Schematic figures**: DAGs, conceptual diagrams, maps, flowcharts, theoretical model schematics. Do NOT attempt optical decomposition. Default to Tier B when uncertain — a structured Tier A block written for a schematic is misleading; a Tier B for a data figure just makes the reader look at the image.

**In `_text.md`:**

*Protocol M* — figures are copied to `references/wiki/figures/`. Record:

```
**Figure N:** <verbatim caption> (p. 12)
![<short description>](figures/<basename>_figN.<ext>)
- Type: <for Tier A: scatter / line / bar / etc.>
- X-axis: <variable, units, range> [Tier A only]
- Y-axis: <variable, units, range> [Tier A only]
- Series / panels: <brief list> [Tier A only]
- Key visual finding: <one sentence>
- Annotations: <labels, reference lines, shaded regions> [Tier A only]
- **Figure notes:** <verbatim notes below the figure, if any>
[Tier B: replace the structured block with just: One-liner: <what the figure depicts at a glance>]
```

All wiki source pages and concept pages are written directly under `references/wiki/`, so embedded figure links must be relative to that directory. For Protocol M, use the path printed by `copy_marker_figure.py`, usually `figures/<basename>_figN.jpg` or `figures/<basename>_figN.png`. Do not use `../figures/...` or `../wiki/figures/...` in wiki pages.

*Protocols E and S* — use CLIP placeholders (described in their respective protocol sections).

## Substantive-change rule

The subagent applies non-destructive edits directly. Destructive edits to existing pages must be returned as proposed unified diffs — not applied.

| Edit | Apply directly? |
|---|---|
| Create new wiki page | Yes |
| Append new section / bullet / paragraph to existing page | Yes |
| Add `[[backlink]]` (inline or under "Related pages") | Yes |
| Update `**Last updated**` date | Yes |
| Append a new source to `**Sources**` | Yes |
| Note a contradiction between sources (additive note) | Yes |
| Reorganize section order (no content lost) | Yes |
| Update `wiki/index.md` (append new entries, edit existing one-liners) | Yes |
| Copy an extracted figure into `references/wiki/figures/` | Yes |
| Edit the `**Summary**` field on an existing page | **Return as diff** |
| Delete any existing line | **Return as diff** |
| Modify the wording of an existing claim | **Return as diff** |

## Concept page disambiguation

Before creating a new concept page, check `wiki/index.md` for existing pages covering the same concept — including obvious synonyms (e.g., "RDD" vs "regression discontinuity"). If a near-match exists but you aren't confident, do **not** create a new page; return the ambiguity to the main session as a question for the user.

## Relevance filtering

Apply "compress, don't omit": sections directly relevant to the project's research focus get full treatment. Less-relevant sections get a one-line description plus page reference. Nothing is fully omitted.

## Subagent return value

```
Pages created: [list]
Pages modified non-destructively: [list with brief description]
Proposed destructive edits: [list of {page, unified diff, rationale}]
Disambiguation questions: [list of {concept, candidate existing pages}]
Proposed log entry: [single line for wiki/log.md]
Pending CLIPs: [list of {target_path, source_paper, page_number, one_liner}]
[Protocol M only] Figures copied: [list of {source_cache_path, dest_wiki_path, paper_figure_label}]
[Protocol M only] Equation fallback used: <true/false>
Errors: [any issues encountered]
```
23 changes: 23 additions & 0 deletions .claude/skills/wiki-update/protocol_e.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Protocol E — Cached Extract

*Input:* path to `references/raw/<basename>_text.md`.

## Step 1: Read the extract

Read `_text.md` in full. Extract the `## Bibliographic metadata` block for the return value. Note any CLIP placeholders in the figures sections.

Protocol E reads only the cached `_text.md` and any figure files it references. Do not re-read the PDF with `pdftotext` to expand or validate the extract.

## Step 2: Write wiki pages

Use the substantive-change rule and relevance filtering in `common.md`.

For figures: if `_text.md` references wiki figure paths that already exist on disk, embed them in wiki pages using the same lightweight format as Protocol M. If `_text.md` contains CLIP placeholders, pass them through to the wiki and aggregate them into the Pending CLIPs return field.

Do **not** re-synthesize or overwrite `_text.md` — it is the canonical extract for this paper.

## Return value additions for Protocol E

```
Pending CLIPs: [list of {target_path, source_paper, page_number, one_liner} — forwarded from _text.md]
```
81 changes: 81 additions & 0 deletions .claude/skills/wiki-update/protocol_m.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Protocol M — Fanout Extract Then Wiki Synthesis

*Input:* path to `manifest.json` produced by `read-pdf/scripts/prepare_substrate.py`, path to the converter cache directory (for figures and `text.md`), canonical paper basename.

Protocol M reads only `manifest.json`, its chunk files, worker notes, cache-local figure files, the neutral `_text.md`, and wiki context files. Do not read the whole converted `markdown.md`. Do not inspect the source PDF with `pdftotext` or any other text extractor for substantive synthesis, even if conversion is slow. If conversion or substrate preparation is still running, wait.

## Step 1: Extract bounded worker notes

The main session spawns one worker agent per `manifest.worker_bundles` entry, sequentially. Each worker receives its bundle excerpt, reads the assigned chunk paths only, follows `~/.claude/skills/read-pdf/fanout_worker.md`, and writes one durable note file under `references/raw/raw_build/<basename>_fanout/worker_notes/`.

If interrupted, completed worker notes are salvageable and should not be deleted.

## Step 2: Synthesize `_text.md`

After all worker notes exist, the main session spawns one read-pdf synthesis agent. The synthesis agent reads `manifest.json` and all worker note files. It uses `~/.claude/skills/read-pdf/fanout_synthesis.md` plus `~/.claude/skills/read-pdf/extraction_schema.md` to produce `references/raw/<basename>_text.md` following the project-neutral `_text.md` structure (bib block, plain-English synthesis, structured dimensions, and formal-object inventories). Gap-reread specific chunk files only when worker notes omit a needed table, figure, equation, result, or ambiguous claim. Write or overwrite if a prior partial file exists.

After the synthesis agent returns, cache the neutral extract with:

```bash
python3 ~/.claude/skills/read-pdf/scripts/cache_text.py push "<cache-dir>/markdown.md" "references/raw/<basename>_text.md"
```

This cache-level neutral extract is project-neutral and reusable by future projects that ingest the same PDF hash.

For the bib metadata block, use DOI candidates from `manifest.json` and front-matter worker notes. Extract authors, title, year, and venue from the front-matter chunks and worker notes. Record null for any field not found. Do not read the whole `markdown.md` for metadata.

The read-pdf synthesis agent must not read project wiki pages, project context files, citation-overlap JSON, or downstream wiki prompts. It writes only `_text.md`.

## Step 3: Write project wiki pages

After `_text.md` exists, the main session spawns one wiki synthesis agent. It reads:

- `references/raw/<basename>_text.md`
- `references/CLAUDE.md`
- project root `CLAUDE.md`
- current `references/wiki/index.md`
- relevant existing wiki pages
- `references/raw/raw_build/<basename>_citation_overlap.json`, if produced
- `~/.claude/skills/wiki-update/wiki_synthesis.md`
- `~/.claude/skills/wiki-update/common.md`

The wiki synthesis agent must not read worker notes or chunk files unless `_text.md` explicitly marks a gap and the main session approves a targeted recovery read.

## Step 4: Copy and classify relevant figures

For each relevant figure listed in `_text.md`:

1. Identify the paper figure number from surrounding caption text.
2. Apply the project-relevance filter. Non-relevant: one-line description + page ref only; do not copy.
3. For relevant figures:
- Copy with the deterministic helper, not by hand:
`python3 ~/.claude/skills/wiki-update/scripts/copy_marker_figure.py <cache-dir>/markdown.md <absolute-project-root>/references/wiki/figures --basename <basename> --figure <M>`
- Use the helper's printed wiki-relative path in markdown. The helper preserves the source image format and uses a byte-matching extension, so destinations may be `.jpg` or `.png`.
- Verify copied files exist with `ls references/wiki/figures/<basename>_fig<M>.*`.
- Classify as Tier A (data figure: scatter, line, bar, coefplot, histogram, density, time series, RD/event-study plot) or Tier B (schematic: DAG, conceptual diagram, map, flowchart, theoretical model). Use the `_text.md` figure description and caption; read the PNG only if genuinely needed for wiki writing.

## Step 5: Wiki figure embeds

Use the substantive-change rule and relevance filtering in `common.md`.

For relevant figures embedded in wiki concept pages, use this format regardless of Tier A/B:

```markdown
**Figure N:** <verbatim caption> (p. 12)

![<short description>](<helper-printed-path>)

- Key visual finding: <one sentence — what the eye sees / the point of the figure>
- **Figure notes:** <verbatim notes printed below the figure in the paper, if any>
```

All wiki pages live directly under `references/wiki/`. Figure links must use the helper-printed path, e.g. `figures/<basename>_figN.jpg` or `figures/<basename>_figN.png`, never `../figures/...`.

The Tier A/B distinction lives in `_text.md` only (full optical decomposition for Tier A; schematic one-liner for Tier B). Wiki pages use the same lightweight embed format for all figures.

## Return value additions for Protocol M

```
Figures copied: [list of {source_cache_path, dest_wiki_path, paper_figure_label}]
Equation fallback used: <true/false> (with count and any "[unreadable equation]" instances if true)
```
52 changes: 52 additions & 0 deletions .claude/skills/wiki-update/protocol_s.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Protocol S — Split-PDF Pipeline

*Input:* absolute path to the PDF, absolute path to the splits directory (`references/raw/raw_build/split_<basename>/`). The main session has already run the splitter — the splits directory is populated with `<basename>_pp<X>-<Y>.pdf` chunks before this subagent is spawned. Do not attempt to split the PDF yourself.

## Step 1: Read splits in batches of 3

Read each split sequentially in batches of 3, without pausing or asking for confirmation. After each batch, append findings to `<splits-dir>/notes.md` under the structured-extraction dimensions in `common.md`, preceded by a batch boundary comment:

```
<!-- batch N: pp X-Y -->
```

If `notes.md` already exists (prior interrupted run), read it first and resume from where it left off — do not overwrite earlier content. `notes.md` is append-mostly and permanent; never delete it.

## Step 2: Synthesize `_text.md`

After all splits are read, write `references/raw/<basename>_text.md` from the accumulated `notes.md` content. Follow the `_text.md` structure in `common.md` (bib block, plain-English synthesis, 12 dimensions).

For the bib metadata block: scan the first split for the DOI regex `10\.\d{4,}/\S+`. Extract authors, title, year, and venue from the first-split text. Record null for any field not found.

`notes.md` is permanent — do not delete it after writing `_text.md`.

## Step 3: Write wiki pages

Use the substantive-change rule and relevance filtering in `common.md`.

For figures: Protocol S does not have extracted figure images. Use CLIP placeholders for all Tier B figures and for any Tier A data figures that cannot be adequately described in text. A structured Tier A block suffices when the data description is complete; use a CLIP placeholder when it isn't.

CLIP placeholder format in `_text.md`:

```
> **Figure N (CLIP):** <verbatim caption> (p. 12)
> One-liner: <what the figure depicts at a glance>
> ACTION: clip from PDF, save to references/wiki/figures/<basename>_fig<N>.png
```

When a wiki page references a CLIP figure, use a broken image link (it renders as a visible TODO):

```markdown
![<short description>](figures/<basename>_figN.png)
*<verbatim caption> ([<basename>](../log.md), p. 12)*
```

All wiki pages live directly under `references/wiki/`. For Protocol S CLIP placeholders, use `figures/<basename>_figN.png` in wiki markdown, never `../figures/...`.

Before writing any CLIP placeholder that references the figures directory, ensure it exists: `mkdir -p references/wiki/figures`.

## Return value additions for Protocol S

```
Pending CLIPs: [list of {target_path, source_paper, page_number, one_liner}]
```
Loading