Add Python evaluation runner: run a skill N times against fixtures, score accuracy + reproducibility, update leaderboard.json

## Summary

Ship a **Python script** (not a Claude Code skill) that takes a skill name, runs it multiple times **against every configured LLM** using fixtures in `AI4RA/evaluation-data-sets`, and produces two distinct measurements per model:

- **Accuracy** — how close each run's output is to ground truth.
- **Reproducibility** — how consistent the skill is across replicate runs of the same input.

All scoring is deterministic Python (no LLM-as-judge). The script appends one run record per model to the component's `leaderboard.json`, so cross-model performance can be compared over time.

## Invocation

```
python tools/evaluate_skill.py <skill-name> [--replicates 10] [--model <model-id>]
```

- `<skill-name>` (positional, required): the component to evaluate. Must match a folder under `components/` and a folder under `AI4RA/evaluation-data-sets/`.
- `--replicates N` (default **10**): number of independent runs per fixture. Needed to separate accuracy from reproducibility.
- `--model <model-id>` (optional): restrict this invocation to a single model. When omitted, evaluates every model in `.evaluator/models.yaml` for which a provider key is present. This flag exists mainly so CI can fan out to one job per model (see "CI wiring" below).


## Multi-model configuration

The script evaluates the skill against every model listed in a repo-root config file (proposed path: `.evaluator/models.yaml`):

```yaml
models:
  - id: claude-opus-4-7
    provider: anthropic
  - id: claude-sonnet-4-6
    provider: anthropic
  - id: gpt-5
    provider: openai
  - id: gemini-2-5-pro
    provider: google
```

Each provider maps to one API client in the script. At runtime the script picks up credentials from environment variables (one per provider, never per model):

- `ANTHROPIC_API_KEY`
- `OPENAI_API_KEY`
- `GOOGLE_API_KEY`
- (extensible — add a new provider by adding a client + env var name)

Any model whose provider env var is unset is **skipped with a warning**, not a hard failure. This lets contributors run the evaluator against only the providers they have keys for, and lets CI run the full matrix when all secrets are configured.

## CI wiring (separate follow-up)

When this runner is wired into GitHub Actions, use a **matrix strategy with one job per model** — not one monolithic job. This matters:

- Each model's job inherits the default 6-hour GitHub Actions timeout independently. A slow provider can't starve the others.
- A provider outage fails only its own matrix job; the rest still produce results.
- Each job can be scoped to receive only its own provider's secret (principle of least privilege).
- Jobs run in parallel, wall-clock time drops roughly linearly with the model count.

Sketch:

```yaml
jobs:
  evaluate:
    strategy:
      fail-fast: false
      matrix:
        include:
          - model: claude-opus-4-7
            secret_name: ANTHROPIC_API_KEY
          - model: claude-sonnet-4-6
            secret_name: ANTHROPIC_API_KEY
          - model: gpt-5
            secret_name: OPENAI_API_KEY
          - model: gemini-2-5-pro
            secret_name: GOOGLE_API_KEY
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: python tools/evaluate_skill.py ${{ inputs.skill_name }} --replicates 10 --model ${{ matrix.model }}
        env:
          ${{ matrix.secret_name }}: ${{ secrets[matrix.secret_name] }}
```

### Trigger policy

The evaluator does not run on every commit to a component. It runs only when a component's MAJOR or MINOR version bumps — **never** on PATCH. Rationale: per the versioning policy in the root README, a PATCH is a wording/clarity fix with no expected behavior change, so re-burning 120+ API calls per model to measure "no change expected" is waste. MAJOR (output contract change) and MINOR (new capability) are the cases where scores can legitimately move.

Mechanics:

- On PR merge / push to `main`, detect which `components/<name>/prompt.md` files changed.
- For each changed component, diff old vs. new `version:` in frontmatter:
  - `X.Y.Z → X.Y.Z+1` (patch bump): **skip** evaluation, log reason, exit clean.
  - `X.Y.Z → X.Y+1.0` (minor bump): **evaluate** that component across all configured models.
  - `X.Y.Z → X+1.0.0` (major bump): **evaluate** that component across all configured models.
  - Prompt changed but version did not: that's a lint error already (existing lint script), so the evaluator does not special-case it — lint fails first.
- Manual re-runs (workflow_dispatch) bypass the policy so a contributor can force an evaluation on demand.

Implementation note: the matrix of (changed component × model) should be computed in a setup job and passed to the eval job as a matrix input, so patch-only PRs produce zero eval jobs.

Required repo secrets (GitHub → Settings → Secrets → Actions):

- `ANTHROPIC_API_KEY`
- `OPENAI_API_KEY`
- `GOOGLE_API_KEY`
- any additional provider keys added to `.evaluator/models.yaml`

Each matrix job appends its own entry to the component's `leaderboard.json`. A final aggregate step can merge them into a single commit, or each job can commit directly on its own branch and a merge-PR action consolidates. Decision deferred to the CI-wiring follow-up issue.

Cost discipline: per model, `n_cases × replicates` calls per scheduled run. At 12 cases × 10 replicates that is 120 calls per model per component, bounded and predictable. Scheduling policy (nightly / weekly / on-demand) is out of scope here — follow-up issue.

## Fixture convention (evaluation-data-sets)

```
<skill-name>/
  cases/
    <case-id>/
      document.md          # input
      ground_truth.json    # expected output
      meta.json            # optional case metadata
  README.md
```

If the folder is missing, the script exits with a clear message. Pairs this with AI4RA/evaluation-data-sets#1 — that issue should adopt `document.md` + `ground_truth.json` as the canonical fixture pair.

## Execution model

For each configured model, for each case in the dataset:

1. Run the skill `N` times against `document.md` via that model's provider API (shell out — do **not** score in-LLM).
2. Capture all `N` raw outputs plus parsed JSON where applicable.
3. Compute **accuracy** metrics per replicate against `ground_truth.json`, then aggregate (mean, stdev) across replicates.
4. Compute **reproducibility** metrics pairwise across the `N` replicates (no ground truth involved).
5. Record every raw output for audit.

Total API calls = `n_models × n_cases × replicates`. This is the cost-control knob — both `replicates` and the model list are explicit, so blast radius stays predictable.

## Metrics — to be finalized before implementation

This is the part that needs the most thought. Starting menu to discuss:

### Accuracy (output vs. ground truth)

- **`schema_valid`** — output parses against component `schema.json`. Binary per replicate; report as pass rate.
- **`exact_match`** — JSON-equal after key-sort + whitespace normalization. Binary per replicate.
- **`field_level_f1`** — precision/recall per leaf field; harmonic mean. Real-valued per replicate.
- **`jaccard`** — for array-valued fields (e.g., `document_requirements`): `|A ∩ B| / |A ∪ B|` on sets of canonical keys.
- **Structured diff / tree edit distance** — Levenshtein-over-JSON-tree, normalized by tree size. More forgiving than exact match.
- **Per-field accuracy vector** — don't collapse to a scalar. A dashboard-style breakdown per schema field is often more actionable than any single summary number.

Most components want a mix — the component declares which metrics are meaningful in `evals/datasets.json` (from #12).

### Reproducibility (replicate i vs. replicate j)

- **Pairwise exact-match rate** — fraction of the `N·(N-1)/2` replicate pairs that are JSON-equal.
- **Pairwise field-agreement F1** — same as `field_level_f1` but between replicates.
- **Per-field stability** — fraction of replicates that produce the same value per field. Highlights which fields are deterministic vs. sampled.

Reproducibility is independent of ground truth and therefore valid even on unlabeled data. It also shows whether "accuracy" is stable or a coin flip that happened to land right once.

## `leaderboard.json` schema (revised)

Per component, at `components/<skill-name>/leaderboard.json`. Append-only, newest first.

Every run record is keyed by `model_id` so per-model trends are queryable without joining across files.

```json
{
  "skill": "<skill-name>",
  "runs": [
    {
      "timestamp": "2026-04-18T17:34:00Z",
      "skill_version": "1.0.0",
      "prompt_commit": "<sha>",
      "dataset_commit": "<sha>",
      "model_id": "claude-opus-4-7",
      "provider": "anthropic",
      "replicates": 10,
      "n_cases": 12,
      "accuracy": {
        "schema_valid": { "mean": 1.0, "stdev": 0.0 },
        "exact_match":  { "mean": 0.32, "stdev": 0.08 },
        "field_level_f1": { "mean": 0.87, "stdev": 0.03 }
      },
      "reproducibility": {
        "pairwise_exact_match": 0.41,
        "pairwise_field_f1":   0.93
      },
      "per_case": [
        {
          "case_id": "nsf-r01-basic",
          "accuracy":       { "schema_valid": 1.0, "exact_match": 0.4, "field_level_f1": 0.88 },
          "reproducibility":{ "pairwise_exact_match": 0.5, "pairwise_field_f1": 0.95 }
        }
      ]
    }
  ]
}
```

Every raw output from every replicate is saved to an audit location (location TBD — likely `components/<name>/evals/results/<timestamp>/` and *not* committed to keep repo lean).

## Integration with existing work

- #11 (marketplace): unaffected — this is a Python tool, not a plugin.
- #12 (eval pairing): this issue is the concrete runner #12 called for. Metric set here supersedes the starter list in #12.
- #13 (SSOT generator): unaffected. Script lives under `tools/` alongside the generator; neither depends on the other.
- AI4RA/evaluation-data-sets#1: fixture naming (`document.md` + `ground_truth.json`) needs to be aligned with that issue before this runner can be finalized.

## Acceptance criteria

- [ ] `tools/evaluate_skill.py` with `--replicates` (default 10) and `--model` flags.
- [ ] Runs end-to-end against at least one existing component's fixtures.
- [ ] Writes a valid `leaderboard.json` entry containing both accuracy and reproducibility sections with per-case breakdown.
- [ ] Handles missing fixtures folder with a clear message and non-zero exit.
- [ ] Handles unparseable model output (`schema_valid: false`, other metrics recorded as `null`, run still written).
- [ ] `--dry-run` mode that validates plumbing without API calls.
- [ ] README section documenting invocation, metric definitions, and how to read leaderboard.json.

## Open questions (need to think through)

- **Which metrics to adopt from the menu above?** Probably `schema_valid` + `field_level_f1` + Jaccard for array fields as the accuracy baseline; pairwise field-F1 as the reproducibility baseline. Need to think through whether different components should declare different metrics.
- **How to compute `field_level_f1` on nested JSON?** Flatten to dotted paths? Recurse? Per-field weights?
- **Controlled-vocabulary fields** (`document_type`, `code`): exact-match is right; F1 hides category errors. Metric should be confusable-matrix-aware.
- **Free-text fields** (`rationale`, `knowledge_notes`): exact-match is meaningless. Skip, or score with ROUGE/semantic similarity?
- **Cost control**: 10 replicates × 12 cases × 5 components = 600 calls. Acceptable for a nightly run, not per-PR. Needs a schedule/opt-in policy.
- **Where do raw per-replicate outputs live?** In-repo under a committed-but-gitignored audit folder? External blob store? Not committed at all?

- **Initial model set**: which models populate the first `.evaluator/models.yaml`? Candidates: Claude Opus 4.7, Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5, Gemini 2.5 Pro. Pick 2–3 to start or run the full matrix?
- **Provider abstraction shape**: one unified client interface (e.g., `litellm`) or hand-rolled thin wrappers per provider?
- **Determinism settings per provider**: each provider exposes `temperature`, `seed`, `top_p` etc. differently. Need a normalized "deterministic" profile per provider for reproducibility numbers to be meaningful.

Tagging for discussion; will iterate on metric selection and model set before implementation starts.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Python evaluation runner: run a skill N times against fixtures, score accuracy + reproducibility, update leaderboard.json #14

Summary

Invocation

Multi-model configuration

CI wiring (separate follow-up)

Trigger policy

Fixture convention (evaluation-data-sets)

Execution model

Metrics — to be finalized before implementation

Accuracy (output vs. ground truth)

Reproducibility (replicate i vs. replicate j)

`leaderboard.json` schema (revised)

Integration with existing work

Acceptance criteria

Open questions (need to think through)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add Python evaluation runner: run a skill N times against fixtures, score accuracy + reproducibility, update leaderboard.json #14

Description

Summary

Invocation

Multi-model configuration

CI wiring (separate follow-up)

Trigger policy

Fixture convention (evaluation-data-sets)

Execution model

Metrics — to be finalized before implementation

Accuracy (output vs. ground truth)

Reproducibility (replicate i vs. replicate j)

leaderboard.json schema (revised)

Integration with existing work

Acceptance criteria

Open questions (need to think through)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`leaderboard.json` schema (revised)