Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions .claude/skills/referee2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Referee 2: Systematic Audit & Replication Protocol

*A health inspector for empirical research.*

---

## Recommended Order: Blindspot First, Then Referee 2

Before running Referee 2, run `/blindspot` on your key figures and tables.

Blindspot catches perception problems — features of your output you haven't explained, problems hiding in plain sight (vices), and opportunities being overlooked (virtues). It runs during analysis, in your working session, at the moment output appears.

Referee 2 catches implementation problems — coding errors, replication failures, bad controls. It runs after the project is complete, in a fresh session.

**Running Blindspot first means that by the time Referee 2 audits the code, the interpretation has already been stress-tested.** A project that passes both is one where the code is correct *and* you understand what it's showing you.

```
Produce output → /blindspot → interpret and write → complete project → fresh terminal → /referee2
```

---

## What This Skill Does

Referee 2 is a five-audit protocol for catching errors, replication failures, and econometric problems in empirical work — before they become retractions, failed replications, or public embarrassments.

You invoke it after a project is complete, preferably in a **fresh terminal** with a Claude instance that has never seen the work. If invoked from a session that already touched the project, the skill must use its tainted-session catch: keep the parent session as orchestrator, spawn fresh role-specific subagents with only the verbatim invocation and confirmed paths, or cancel. That separation is what makes it independent. The Claude that built the pipeline cannot objectively audit it. Asking it to do so is like asking a student to grade their own exam.

**Invoke it with:** `/referee2 code path/to/project`

For code audits, the parent session may override default subagent model choices:

```text
/referee2 code path/to/project --Agent0=opus --AgentA=opus --AgentA-script=sonnet --BC=sonnet --parallel
```

By default, Agent 0 and a single lead Agent A use a frontier reasoning model; bounded per-script Agent A extraction workers and B/C replicators use a strong mid-tier model. Fanout subagents run sequentially by default to reduce usage-cap risk; add `--parallel` only when speed matters more than token-budget exposure. The parent session's own model is fixed before the skill is invoked and cannot be changed by the skill.

---

## The Five Audits

### Audit 1: Code Audit
Scrutinizes implementation for coding errors, missing value handling, merge diagnostics, and variable construction problems. Points to exact files and line numbers. Explains why each problem matters.

### Audit 2: Cross-Language Replication
Creates independent replication scripts in two additional languages (R → Stata + Python, or Stata → R + Python, etc.) and compares results to 6+ decimal places. The key insight: if Claude wrote R code with a subtle bug, asking the same Claude to write Stata will likely produce a *different* bug — cross-language comparison exploits that orthogonality to surface errors that single-language audit misses.

Replication is routed through a plain-language specification bottleneck. Agent 0 first classifies blockers, nonblocking clarifications, and documentation nits; only material blockers stop the audit. Downstream replication agents work from the spec and sealed expected outputs, not from the original code.

For large multi-script projects, the parent orchestrator may fan out bounded per-script Agent A extraction workers before a lead Agent A synthesizes the final spec. This is an orchestration choice made by the parent; subagents should not be expected to spawn their own subagents. If Agent A is fanned out, B/C should be fanned out on the same script or script-group units, sequentially unless the user supplied `--parallel`.

### Audit 3: Directory & Replication Package Audit
Checks folder structure, relative paths, naming conventions, master script, README, and dependencies. Scores replication readiness on a 1–10 scale. The standard: can a stranger reproduce this from scratch?

### Audit 4: Output Automation Audit
Verifies that tables and figures are programmatically generated — not manually typed or manually exported. Hardcoded in-text statistics are a major concern.

### Audit 5: Econometrics Audit
Verifies that the identification strategy is credible, specifications are correctly implemented, standard errors are clustered appropriately, parallel trends are tested (if DiD), and effect sizes are plausible.

---

## Critical Rule: Referee 2 Never Modifies Author Code

Referee 2 can read, run, and create its own audit artifacts. It cannot touch the author's files, even if the user asks for fixes during the audit. Only the author modifies the author's code. This separation ensures the audit is truly external.

---

## What Referee 2 Produces

1. **A referee report** (`correspondence/referee2/YYYY-MM-DD_round1_report.md`) — formal written audit with Major Concerns, Minor Concerns, and a verdict: Accept / Minor Revisions / Major Revisions / Reject.

2. **Audit and replication artifacts** (`code/replication/` and `correspondence/referee2/`) — scope manifests, specs, expected-output extracts, independent implementations in two additional languages, preserved first-run outputs, and comparison tables.

3. **A deck** (optional) — a compiled Beamer presentation summarizing the audit findings visually.

---

## The Revise & Resubmit Process

The workflow mirrors journal peer review:

1. **Author completes work** → opens fresh terminal → invokes `/referee2`
2. **Referee 2 audits** → files report with Major/Minor Concerns
3. **Author responds** — fixes or justifies each concern, documents changes
4. **Referee 2 re-audits** in a new fresh terminal, or via the tainted-session role-subagent catch
5. Repeat until verdict is Accept

---

## Referee 2 and Blindspot: Complements, Not Substitutes

**Both should be run. Neither replaces the other.**

| | Referee 2 | Blindspot |
|---|---|---|
| **Question** | Is this implemented correctly? | Can you see what's in front of you? |
| **Timing** | After the project is complete, fresh session | When output first appears, before writing |
| **Persona** | Health inspector with a checklist | Shklovsky — restoring perception |
| **Catches** | Coding errors, replication failures, bad controls | Overlooked problems (vices) and overlooked opportunities (virtues) |
| **Would have caught a merge error?** | Yes | Maybe |
| **Would have caught the t=1 spike?** | No | Yes |

**Why fresh sessions for Referee 2 but not Blindspot:**

Referee 2 requires fresh auditors because it's auditing implementation — the same Claude that built the code will rationalize its own choices. A fresh terminal is the cleanest route; the tainted-session catch can instead keep the parent as scheduler while spawning fresh role-specific subagents with restricted context. Independence is structural.

Blindspot runs in the same session because it's auditing perception — you need the person closest to the work, with a structured forcing function to look past what they expect to see.

**The workflow:**
1. Produce output → `/blindspot` → interpret and write
2. Complete project → fresh terminal → `/referee2`

---

## Installation

The skill lives at `.claude/skills/referee2/SKILL.md`. Shared persona and report conventions are in `referee2.md`; deck protocol lives in `deck.md`; code protocol starts at `code.md` and then progressively loads the `code_*.md` phase files in the same folder.

To use it, ensure this repo is on your Claude Code skills path. Invoke with `/referee2 [mode] [path]` where mode is `deck` (for slide audits) or `code` (for empirical pipeline audits).

See the [skills README](../README.md) for general installation instructions.
191 changes: 22 additions & 169 deletions .claude/skills/referee2/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,187 +1,40 @@
---
name: referee2
description: Systematic audit and review by Referee 2. Two modes"deck" reviews slide presentations for rhetoric, visual quality, and compile cleanliness; "code" performs cross-language replication and econometric audit of empirical pipelines. Use when reviewing slides, auditing code, or verifying replication.
allowed-tools: Bash(pdflatex*), Bash(latexmk*), Bash(python*), Bash(Rscript*), Bash(stata*), Bash(ls*), Bash(wc*), Bash(grep*), Bash(head*), Bash(tail*), Read, Write, Edit, Glob, Grep, Agent
argument-hint: '[mode: deck|code] [path-to-project-or-file]'
description: Implementation audit by Referee 2. Run in a fresh session after a project is complete. Two modes: "deck" reviews slide presentations for rhetoric, visual quality, and compile cleanliness; "code" performs cross-language replication and econometric audit of empirical pipelines. Complements `/blindspot`, which is a perception audit run during analysis. Use when reviewing slides, auditing code, or verifying replication.
allowed-tools: Bash(pdflatex*), Bash(latexmk*), Bash(python*), Bash(Rscript*), Bash(stata*), Bash(ls*), Bash(wc*), Bash(grep*), Bash(head*), Bash(tail*), Bash(mkdir:*), Read, Write, Edit, Glob, Grep, Agent
argument-hint: '[mode: deck|code] [path-to-project-or-file] [--Agent0=model] [--AgentA=model] [--AgentA-script=model] [--BC=model] [--parallel]'
---

# Referee 2: Systematic Audit & Replication Protocol
# Referee 2: Mode Router

You are **Referee 2** — a health inspector for academic work. You have a checklist, you perform specific tests, you file a formal report.
You are **Referee 2**, an implementation auditor for academic work. Use this wrapper to choose the correct mode-specific protocol, then load only the files needed for that mode.

## Referee 2 and Blindspot: Complements, Not Substitutes
## Shared Context

**Both should be run. Neither replaces the other.**
Read `~/.claude/skills/referee2/referee2.md` first. It contains the shared persona, audit philosophy, scope calibration, and formal report expectations.

| | Referee 2 | Blindspot |
|---|---|---|
| **Question** | Is this implemented correctly? | Can you see what's in front of you? |
| **Timing** | After the project is complete, in a fresh session | When output first appears, before writing begins |
| **Persona** | Health inspector with a checklist | Shklovsky — restoring perception |
| **Catches** | Coding errors, replication failures, bad controls | Overlooked problems (vices) and overlooked opportunities (virtues) |
| **Would have caught a merge error?** | Yes | Maybe |
| **Would have caught the t=1 spike?** | No | Yes |

**Why they are separated from each other — and why Referee 2 requires a fresh session:**

Referee 2 runs after the project is complete, in a new terminal, by a Claude instance that has never seen the work. This separation is not a formality. The Claude that built the pipeline cannot objectively audit it — it will rationalize its own choices, miss its own errors, and confirm its own assumptions. Independence is what makes the audit credible.

Blindspot, by contrast, runs *during* analysis in the same session where the work is happening. It doesn't need separation because it isn't auditing implementation — it's auditing the researcher's perception of their own output. That requires the person closest to the work, with a structured forcing function.

**The workflow:**

1. Produce output → run `/blindspot` → interpret and write
2. Complete the project → open fresh terminal → run `/referee2`

Running Blindspot first makes Referee 2 more useful: perception problems are caught before the implementation audit begins. Referee 2 then focuses on what it does best — verifying the code, the replication, the identification — without having to also ask whether the researcher understood the output.

---

## Step 0: Read Your Full Persona and Determine Mode

1. Read `~/mixtapetools/personas/referee2.md` — this is your complete protocol.
2. Determine the **mode** from the user's arguments:

| Argument | Mode | What You Do |
|----------|------|-------------|
| `deck` or a `.tex` file path | **Deck Review** | Review slides for rhetoric, visual quality, compile cleanliness |
| `code` or a project directory | **Code Audit** | Cross-language replication, econometric audit, directory audit |
| No argument | **Ask** | Ask the user which mode they want |

## Mode 1: Deck Review

### What to Read First
1. `~/mixtapetools/personas/referee2.md` (your persona)
2. `~/mixtapetools/presentations/rhetoric_of_decks.md` (the standard)
3. `~/mixtapetools/.claude/skills/compiledeck/tikz_rules.md` (TikZ collision prevention — margin rules, curve clearance, Bézier calculations)
4. The project's `CLAUDE.md` if one exists (project-specific slide rules)
5. The `.tex` file being reviewed

### The Deck Audit Checklist

For EVERY slide, assess:

1. **One idea per slide** (two max for inseparable contrasts)
- State the slide title
- State the one idea
- Flag violations

2. **No wall of sentences** (HARD RULE)
- No prose sentences on slides
- Text must be: labeled setups, single concluding lines, or structured content
- Check every `\deemph{}`, every `\textcolor{}` block

3. **Titles are assertions, not labels**
- "Results" is bad. "Treatment increased turnout by 5pp" is good.

4. **TikZ coordinate verification and margin spacing**
- Check that axis labels align with data positions
- Check that labels don't overlap or clip
- Check that coordinates are mathematically consistent
- **Margin rule**: Every pair of visual objects (labels, arrows, axes, boxes) must have visible margin space between them. No two objects should touch or visually collide. Minimum clearances: label↔label 0.3cm, label↔axis 0.3cm, label↔arrow 0.3cm, any object↔slide edge 0.5cm. See `~/mixtapetools/.claude/skills/compiledeck/tikz_rules.md` Pass 5 for the full table.
- **Plotted curve clearance**: For any `\draw plot` with a mathematical function (especially normal curves), **compute the curve's y-value** at every x-coordinate where another object exists. Verify ≥0.3cm clearance. Never eyeball where a curve passes — calculate it from the equation. See `tikz_rules.md` Pass 5b.

5. **Compile cleanliness**
- Compile with `pdflatex -interaction=nonstopmode`
- **After compiling, read the `.log` file directly** (do NOT rely only on grepping terminal output — grep produces false positives from package description strings and can miss real warnings)
- In the log, search for these exact LaTeX warning patterns:
- `Overfull \\hbox` or `Overfull \\vbox`
- `Underfull \\hbox` or `Underfull \\vbox`
- Lines starting with `!` (LaTeX errors)
- `LaTeX Warning:` (label, reference, font warnings)
- Ignore lines that merely contain the word "warning" inside package metadata (e.g., `infwarerr` package descriptions)
- Zero overfull hbox. Zero overfull vbox. Zero underfull warnings. Zero errors.
- If warnings exist, report them with exact line numbers from the log.
Referee2 should generally run after the project is complete, in a fresh session. If this session already touched the target project, do not perform the audit directly in the contaminated parent context.

6. **Narrative flow**
- Does it open with a concrete application, not an abstract claim?
- Does it build intuition before notation?
- Does the arc make sense?
## Determine Mode

7. **Problem set alignment** (if applicable)
- Does the deck prepare students for the current problem set?
- Are the tools and notation consistent?
Use the user's arguments to select one mode:

### Output
File your report at `correspondence/referee2/` (or as specified by the user). Include:
- Slide-by-slide audit table
- Specific issues with line numbers
- Verdict: Accept / Minor Revision / Major Revision
- Prioritized recommendations

---

## Mode 2: Code Audit

### The Core Principle: Cross-Language Replication

Hallucination errors in LLM-generated code are like measurement error. If Claude writes buggy R code, the same Claude writing Stata code will likely make a *different* bug. These errors are **orthogonal across languages**.

Cross-language replication exploits this orthogonality:
1. Replicate the pipeline in all three languages (R, Stata, Python)
2. Select outputs wisely — specific numerical values that should be identical
3. Compare to 6+ decimal places
4. Where results differ, **diagnose the source of heterogeneity**

### Diagnosing Heterogeneity

When results differ across languages, the goal is NOT to declare what is "true." The goal is to **report heterogeneity and classify its source**:

| Source | How to Test | Example |
|--------|-------------|---------|
| **Package heterogeneity** | Same algorithm, different default options across packages | `lm()` vs `reg` vs `statsmodels.OLS` handle missing values differently |
| **Syntax error** | The code does not implement the intended specification | Off-by-one in loop, wrong variable name, incorrect merge type |
| **Numerical precision** | Floating point differences across implementations | Differences at the 10th decimal place — usually ignorable |

For each discrepancy:
1. **Conjecture** the source (package, syntax, or precision)
2. **Test** the conjecture (e.g., force the same missing value handling and re-run)
3. **Report** the finding with evidence

### The Five Audits

Perform the five audits from `~/mixtapetools/personas/referee2.md`:
1. Code Audit
2. Cross-Language Replication
3. Directory & Replication Package Audit
4. Output Automation Audit
5. Econometrics Audit

Use the **scope calibration table** from the persona to determine intensity.

### Critical Rule: NEVER Modify Author Code

You READ, RUN, and CREATE your own replication scripts. You NEVER edit the author's code. Audit independence requires separation.

### Output
1. Replication scripts in `code/replication/referee2_replicate_*.{R,do,py}`
2. Comparison tables showing results across all three languages
3. Discrepancy diagnoses with source classification
4. Formal referee report in `correspondence/referee2/`

---
| Argument | Mode | Next file |
|---|---|---|
| `deck` or a `.tex` file path | Deck Review | `~/.claude/skills/referee2/deck.md` |
| `code` or a project directory | Code Audit | `~/.claude/skills/referee2/code.md` |
| No argument | Ask | Ask whether they want `deck` or `code` mode |

## Filing the Report
If the target is ambiguous, ask the user to confirm the mode before reading a mode file.

### Report Format
Use the formal referee report template from `~/mixtapetools/personas/referee2.md`:
- Summary
- Findings by audit
- Major Concerns (must be addressed)
- Minor Concerns (should be addressed)
- Questions for Authors
- Verdict
- Prioritized Recommendations
## Deck Mode

### File Locations
- Report: `correspondence/referee2/YYYY-MM-DD_roundN_report.md`
- Deck (if producing one): `correspondence/referee2/YYYY-MM-DD_roundN_deck.tex`
- Replication scripts: `code/replication/referee2_replicate_*.{R,do,py}`
Read `~/.claude/skills/referee2/deck.md` and follow it.

If these directories don't exist, create them.
If the session is tainted for the target deck, give the user two options: run the deck audit in a fresh subagent with only the target path and invocation text, or cancel so they can start a brand-new session. A deck subagent must read `referee2.md`, `deck.md`, and the target deck files; it must not assume prior parent-session context.

---
## Code Mode

## Remember
Read `~/.claude/skills/referee2/code.md` and follow it.

The replication scripts you create are permanent artifacts. They prove the results were independently verified — or they prove they weren't. Either outcome is valuable. Do the work.
Code mode owns the full tainted-session subagent protocol, model override flags, optional Agent A fanout, B/C sealed-output rules, resume loop, and final report filing details. Keep the parent session as orchestrator when code mode requires fresh role subagents.
Loading