Skip to content

Add Python evaluation runner: run a skill N times against fixtures, score accuracy + reproducibility, update leaderboard.json #14

@nate-layman

Description

@nate-layman

Summary

Ship a Python script (not a Claude Code skill) that takes a skill name, runs it multiple times against every configured LLM using fixtures in AI4RA/evaluation-data-sets, and produces two distinct measurements per model:

  • Accuracy — how close each run's output is to ground truth.
  • Reproducibility — how consistent the skill is across replicate runs of the same input.

All scoring is deterministic Python (no LLM-as-judge). The script appends one run record per model to the component's leaderboard.json, so cross-model performance can be compared over time.

Invocation

python tools/evaluate_skill.py <skill-name> [--replicates 10] [--model <model-id>]
  • <skill-name> (positional, required): the component to evaluate. Must match a folder under components/ and a folder under AI4RA/evaluation-data-sets/.
  • --replicates N (default 10): number of independent runs per fixture. Needed to separate accuracy from reproducibility.
  • --model <model-id> (optional): restrict this invocation to a single model. When omitted, evaluates every model in .evaluator/models.yaml for which a provider key is present. This flag exists mainly so CI can fan out to one job per model (see "CI wiring" below).

Multi-model configuration

The script evaluates the skill against every model listed in a repo-root config file (proposed path: .evaluator/models.yaml):

models:
  - id: claude-opus-4-7
    provider: anthropic
  - id: claude-sonnet-4-6
    provider: anthropic
  - id: gpt-5
    provider: openai
  - id: gemini-2-5-pro
    provider: google

Each provider maps to one API client in the script. At runtime the script picks up credentials from environment variables (one per provider, never per model):

  • ANTHROPIC_API_KEY
  • OPENAI_API_KEY
  • GOOGLE_API_KEY
  • (extensible — add a new provider by adding a client + env var name)

Any model whose provider env var is unset is skipped with a warning, not a hard failure. This lets contributors run the evaluator against only the providers they have keys for, and lets CI run the full matrix when all secrets are configured.

CI wiring (separate follow-up)

When this runner is wired into GitHub Actions, use a matrix strategy with one job per model — not one monolithic job. This matters:

  • Each model's job inherits the default 6-hour GitHub Actions timeout independently. A slow provider can't starve the others.
  • A provider outage fails only its own matrix job; the rest still produce results.
  • Each job can be scoped to receive only its own provider's secret (principle of least privilege).
  • Jobs run in parallel, wall-clock time drops roughly linearly with the model count.

Sketch:

jobs:
  evaluate:
    strategy:
      fail-fast: false
      matrix:
        include:
          - model: claude-opus-4-7
            secret_name: ANTHROPIC_API_KEY
          - model: claude-sonnet-4-6
            secret_name: ANTHROPIC_API_KEY
          - model: gpt-5
            secret_name: OPENAI_API_KEY
          - model: gemini-2-5-pro
            secret_name: GOOGLE_API_KEY
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: python tools/evaluate_skill.py ${{ inputs.skill_name }} --replicates 10 --model ${{ matrix.model }}
        env:
          ${{ matrix.secret_name }}: ${{ secrets[matrix.secret_name] }}

Trigger policy

The evaluator does not run on every commit to a component. It runs only when a component's MAJOR or MINOR version bumps — never on PATCH. Rationale: per the versioning policy in the root README, a PATCH is a wording/clarity fix with no expected behavior change, so re-burning 120+ API calls per model to measure "no change expected" is waste. MAJOR (output contract change) and MINOR (new capability) are the cases where scores can legitimately move.

Mechanics:

  • On PR merge / push to main, detect which components/<name>/prompt.md files changed.
  • For each changed component, diff old vs. new version: in frontmatter:
    • X.Y.Z → X.Y.Z+1 (patch bump): skip evaluation, log reason, exit clean.
    • X.Y.Z → X.Y+1.0 (minor bump): evaluate that component across all configured models.
    • X.Y.Z → X+1.0.0 (major bump): evaluate that component across all configured models.
    • Prompt changed but version did not: that's a lint error already (existing lint script), so the evaluator does not special-case it — lint fails first.
  • Manual re-runs (workflow_dispatch) bypass the policy so a contributor can force an evaluation on demand.

Implementation note: the matrix of (changed component × model) should be computed in a setup job and passed to the eval job as a matrix input, so patch-only PRs produce zero eval jobs.

Required repo secrets (GitHub → Settings → Secrets → Actions):

  • ANTHROPIC_API_KEY
  • OPENAI_API_KEY
  • GOOGLE_API_KEY
  • any additional provider keys added to .evaluator/models.yaml

Each matrix job appends its own entry to the component's leaderboard.json. A final aggregate step can merge them into a single commit, or each job can commit directly on its own branch and a merge-PR action consolidates. Decision deferred to the CI-wiring follow-up issue.

Cost discipline: per model, n_cases × replicates calls per scheduled run. At 12 cases × 10 replicates that is 120 calls per model per component, bounded and predictable. Scheduling policy (nightly / weekly / on-demand) is out of scope here — follow-up issue.

Fixture convention (evaluation-data-sets)

<skill-name>/
  cases/
    <case-id>/
      document.md          # input
      ground_truth.json    # expected output
      meta.json            # optional case metadata
  README.md

If the folder is missing, the script exits with a clear message. Pairs this with AI4RA/evaluation-data-sets#1 — that issue should adopt document.md + ground_truth.json as the canonical fixture pair.

Execution model

For each configured model, for each case in the dataset:

  1. Run the skill N times against document.md via that model's provider API (shell out — do not score in-LLM).
  2. Capture all N raw outputs plus parsed JSON where applicable.
  3. Compute accuracy metrics per replicate against ground_truth.json, then aggregate (mean, stdev) across replicates.
  4. Compute reproducibility metrics pairwise across the N replicates (no ground truth involved).
  5. Record every raw output for audit.

Total API calls = n_models × n_cases × replicates. This is the cost-control knob — both replicates and the model list are explicit, so blast radius stays predictable.

Metrics — to be finalized before implementation

This is the part that needs the most thought. Starting menu to discuss:

Accuracy (output vs. ground truth)

  • schema_valid — output parses against component schema.json. Binary per replicate; report as pass rate.
  • exact_match — JSON-equal after key-sort + whitespace normalization. Binary per replicate.
  • field_level_f1 — precision/recall per leaf field; harmonic mean. Real-valued per replicate.
  • jaccard — for array-valued fields (e.g., document_requirements): |A ∩ B| / |A ∪ B| on sets of canonical keys.
  • Structured diff / tree edit distance — Levenshtein-over-JSON-tree, normalized by tree size. More forgiving than exact match.
  • Per-field accuracy vector — don't collapse to a scalar. A dashboard-style breakdown per schema field is often more actionable than any single summary number.

Most components want a mix — the component declares which metrics are meaningful in evals/datasets.json (from #12).

Reproducibility (replicate i vs. replicate j)

  • Pairwise exact-match rate — fraction of the N·(N-1)/2 replicate pairs that are JSON-equal.
  • Pairwise field-agreement F1 — same as field_level_f1 but between replicates.
  • Per-field stability — fraction of replicates that produce the same value per field. Highlights which fields are deterministic vs. sampled.

Reproducibility is independent of ground truth and therefore valid even on unlabeled data. It also shows whether "accuracy" is stable or a coin flip that happened to land right once.

leaderboard.json schema (revised)

Per component, at components/<skill-name>/leaderboard.json. Append-only, newest first.

Every run record is keyed by model_id so per-model trends are queryable without joining across files.

{
  "skill": "<skill-name>",
  "runs": [
    {
      "timestamp": "2026-04-18T17:34:00Z",
      "skill_version": "1.0.0",
      "prompt_commit": "<sha>",
      "dataset_commit": "<sha>",
      "model_id": "claude-opus-4-7",
      "provider": "anthropic",
      "replicates": 10,
      "n_cases": 12,
      "accuracy": {
        "schema_valid": { "mean": 1.0, "stdev": 0.0 },
        "exact_match":  { "mean": 0.32, "stdev": 0.08 },
        "field_level_f1": { "mean": 0.87, "stdev": 0.03 }
      },
      "reproducibility": {
        "pairwise_exact_match": 0.41,
        "pairwise_field_f1":   0.93
      },
      "per_case": [
        {
          "case_id": "nsf-r01-basic",
          "accuracy":       { "schema_valid": 1.0, "exact_match": 0.4, "field_level_f1": 0.88 },
          "reproducibility":{ "pairwise_exact_match": 0.5, "pairwise_field_f1": 0.95 }
        }
      ]
    }
  ]
}

Every raw output from every replicate is saved to an audit location (location TBD — likely components/<name>/evals/results/<timestamp>/ and not committed to keep repo lean).

Integration with existing work

Acceptance criteria

  • tools/evaluate_skill.py with --replicates (default 10) and --model flags.
  • Runs end-to-end against at least one existing component's fixtures.
  • Writes a valid leaderboard.json entry containing both accuracy and reproducibility sections with per-case breakdown.
  • Handles missing fixtures folder with a clear message and non-zero exit.
  • Handles unparseable model output (schema_valid: false, other metrics recorded as null, run still written).
  • --dry-run mode that validates plumbing without API calls.
  • README section documenting invocation, metric definitions, and how to read leaderboard.json.

Open questions (need to think through)

  • Which metrics to adopt from the menu above? Probably schema_valid + field_level_f1 + Jaccard for array fields as the accuracy baseline; pairwise field-F1 as the reproducibility baseline. Need to think through whether different components should declare different metrics.

  • How to compute field_level_f1 on nested JSON? Flatten to dotted paths? Recurse? Per-field weights?

  • Controlled-vocabulary fields (document_type, code): exact-match is right; F1 hides category errors. Metric should be confusable-matrix-aware.

  • Free-text fields (rationale, knowledge_notes): exact-match is meaningless. Skip, or score with ROUGE/semantic similarity?

  • Cost control: 10 replicates × 12 cases × 5 components = 600 calls. Acceptable for a nightly run, not per-PR. Needs a schedule/opt-in policy.

  • Where do raw per-replicate outputs live? In-repo under a committed-but-gitignored audit folder? External blob store? Not committed at all?

  • Initial model set: which models populate the first .evaluator/models.yaml? Candidates: Claude Opus 4.7, Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5, Gemini 2.5 Pro. Pick 2–3 to start or run the full matrix?

  • Provider abstraction shape: one unified client interface (e.g., litellm) or hand-rolled thin wrappers per provider?

  • Determinism settings per provider: each provider exposes temperature, seed, top_p etc. differently. Need a normalized "deterministic" profile per provider for reproducibility numbers to be meaningful.

Tagging for discussion; will iterate on metric selection and model set before implementation starts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions