Summary
Ship a Python script (not a Claude Code skill) that takes a skill name, runs it multiple times against every configured LLM using fixtures in AI4RA/evaluation-data-sets, and produces two distinct measurements per model:
- Accuracy — how close each run's output is to ground truth.
- Reproducibility — how consistent the skill is across replicate runs of the same input.
All scoring is deterministic Python (no LLM-as-judge). The script appends one run record per model to the component's leaderboard.json, so cross-model performance can be compared over time.
Invocation
python tools/evaluate_skill.py <skill-name> [--replicates 10] [--model <model-id>]
<skill-name> (positional, required): the component to evaluate. Must match a folder under components/ and a folder under AI4RA/evaluation-data-sets/.
--replicates N (default 10): number of independent runs per fixture. Needed to separate accuracy from reproducibility.
--model <model-id> (optional): restrict this invocation to a single model. When omitted, evaluates every model in .evaluator/models.yaml for which a provider key is present. This flag exists mainly so CI can fan out to one job per model (see "CI wiring" below).
Multi-model configuration
The script evaluates the skill against every model listed in a repo-root config file (proposed path: .evaluator/models.yaml):
models:
- id: claude-opus-4-7
provider: anthropic
- id: claude-sonnet-4-6
provider: anthropic
- id: gpt-5
provider: openai
- id: gemini-2-5-pro
provider: google
Each provider maps to one API client in the script. At runtime the script picks up credentials from environment variables (one per provider, never per model):
ANTHROPIC_API_KEY
OPENAI_API_KEY
GOOGLE_API_KEY
- (extensible — add a new provider by adding a client + env var name)
Any model whose provider env var is unset is skipped with a warning, not a hard failure. This lets contributors run the evaluator against only the providers they have keys for, and lets CI run the full matrix when all secrets are configured.
CI wiring (separate follow-up)
When this runner is wired into GitHub Actions, use a matrix strategy with one job per model — not one monolithic job. This matters:
- Each model's job inherits the default 6-hour GitHub Actions timeout independently. A slow provider can't starve the others.
- A provider outage fails only its own matrix job; the rest still produce results.
- Each job can be scoped to receive only its own provider's secret (principle of least privilege).
- Jobs run in parallel, wall-clock time drops roughly linearly with the model count.
Sketch:
jobs:
evaluate:
strategy:
fail-fast: false
matrix:
include:
- model: claude-opus-4-7
secret_name: ANTHROPIC_API_KEY
- model: claude-sonnet-4-6
secret_name: ANTHROPIC_API_KEY
- model: gpt-5
secret_name: OPENAI_API_KEY
- model: gemini-2-5-pro
secret_name: GOOGLE_API_KEY
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: python tools/evaluate_skill.py ${{ inputs.skill_name }} --replicates 10 --model ${{ matrix.model }}
env:
${{ matrix.secret_name }}: ${{ secrets[matrix.secret_name] }}
Trigger policy
The evaluator does not run on every commit to a component. It runs only when a component's MAJOR or MINOR version bumps — never on PATCH. Rationale: per the versioning policy in the root README, a PATCH is a wording/clarity fix with no expected behavior change, so re-burning 120+ API calls per model to measure "no change expected" is waste. MAJOR (output contract change) and MINOR (new capability) are the cases where scores can legitimately move.
Mechanics:
- On PR merge / push to
main, detect which components/<name>/prompt.md files changed.
- For each changed component, diff old vs. new
version: in frontmatter:
X.Y.Z → X.Y.Z+1 (patch bump): skip evaluation, log reason, exit clean.
X.Y.Z → X.Y+1.0 (minor bump): evaluate that component across all configured models.
X.Y.Z → X+1.0.0 (major bump): evaluate that component across all configured models.
- Prompt changed but version did not: that's a lint error already (existing lint script), so the evaluator does not special-case it — lint fails first.
- Manual re-runs (workflow_dispatch) bypass the policy so a contributor can force an evaluation on demand.
Implementation note: the matrix of (changed component × model) should be computed in a setup job and passed to the eval job as a matrix input, so patch-only PRs produce zero eval jobs.
Required repo secrets (GitHub → Settings → Secrets → Actions):
ANTHROPIC_API_KEY
OPENAI_API_KEY
GOOGLE_API_KEY
- any additional provider keys added to
.evaluator/models.yaml
Each matrix job appends its own entry to the component's leaderboard.json. A final aggregate step can merge them into a single commit, or each job can commit directly on its own branch and a merge-PR action consolidates. Decision deferred to the CI-wiring follow-up issue.
Cost discipline: per model, n_cases × replicates calls per scheduled run. At 12 cases × 10 replicates that is 120 calls per model per component, bounded and predictable. Scheduling policy (nightly / weekly / on-demand) is out of scope here — follow-up issue.
Fixture convention (evaluation-data-sets)
<skill-name>/
cases/
<case-id>/
document.md # input
ground_truth.json # expected output
meta.json # optional case metadata
README.md
If the folder is missing, the script exits with a clear message. Pairs this with AI4RA/evaluation-data-sets#1 — that issue should adopt document.md + ground_truth.json as the canonical fixture pair.
Execution model
For each configured model, for each case in the dataset:
- Run the skill
N times against document.md via that model's provider API (shell out — do not score in-LLM).
- Capture all
N raw outputs plus parsed JSON where applicable.
- Compute accuracy metrics per replicate against
ground_truth.json, then aggregate (mean, stdev) across replicates.
- Compute reproducibility metrics pairwise across the
N replicates (no ground truth involved).
- Record every raw output for audit.
Total API calls = n_models × n_cases × replicates. This is the cost-control knob — both replicates and the model list are explicit, so blast radius stays predictable.
Metrics — to be finalized before implementation
This is the part that needs the most thought. Starting menu to discuss:
Accuracy (output vs. ground truth)
schema_valid — output parses against component schema.json. Binary per replicate; report as pass rate.
exact_match — JSON-equal after key-sort + whitespace normalization. Binary per replicate.
field_level_f1 — precision/recall per leaf field; harmonic mean. Real-valued per replicate.
jaccard — for array-valued fields (e.g., document_requirements): |A ∩ B| / |A ∪ B| on sets of canonical keys.
- Structured diff / tree edit distance — Levenshtein-over-JSON-tree, normalized by tree size. More forgiving than exact match.
- Per-field accuracy vector — don't collapse to a scalar. A dashboard-style breakdown per schema field is often more actionable than any single summary number.
Most components want a mix — the component declares which metrics are meaningful in evals/datasets.json (from #12).
Reproducibility (replicate i vs. replicate j)
- Pairwise exact-match rate — fraction of the
N·(N-1)/2 replicate pairs that are JSON-equal.
- Pairwise field-agreement F1 — same as
field_level_f1 but between replicates.
- Per-field stability — fraction of replicates that produce the same value per field. Highlights which fields are deterministic vs. sampled.
Reproducibility is independent of ground truth and therefore valid even on unlabeled data. It also shows whether "accuracy" is stable or a coin flip that happened to land right once.
leaderboard.json schema (revised)
Per component, at components/<skill-name>/leaderboard.json. Append-only, newest first.
Every run record is keyed by model_id so per-model trends are queryable without joining across files.
{
"skill": "<skill-name>",
"runs": [
{
"timestamp": "2026-04-18T17:34:00Z",
"skill_version": "1.0.0",
"prompt_commit": "<sha>",
"dataset_commit": "<sha>",
"model_id": "claude-opus-4-7",
"provider": "anthropic",
"replicates": 10,
"n_cases": 12,
"accuracy": {
"schema_valid": { "mean": 1.0, "stdev": 0.0 },
"exact_match": { "mean": 0.32, "stdev": 0.08 },
"field_level_f1": { "mean": 0.87, "stdev": 0.03 }
},
"reproducibility": {
"pairwise_exact_match": 0.41,
"pairwise_field_f1": 0.93
},
"per_case": [
{
"case_id": "nsf-r01-basic",
"accuracy": { "schema_valid": 1.0, "exact_match": 0.4, "field_level_f1": 0.88 },
"reproducibility":{ "pairwise_exact_match": 0.5, "pairwise_field_f1": 0.95 }
}
]
}
]
}
Every raw output from every replicate is saved to an audit location (location TBD — likely components/<name>/evals/results/<timestamp>/ and not committed to keep repo lean).
Integration with existing work
Acceptance criteria
Open questions (need to think through)
-
Which metrics to adopt from the menu above? Probably schema_valid + field_level_f1 + Jaccard for array fields as the accuracy baseline; pairwise field-F1 as the reproducibility baseline. Need to think through whether different components should declare different metrics.
-
How to compute field_level_f1 on nested JSON? Flatten to dotted paths? Recurse? Per-field weights?
-
Controlled-vocabulary fields (document_type, code): exact-match is right; F1 hides category errors. Metric should be confusable-matrix-aware.
-
Free-text fields (rationale, knowledge_notes): exact-match is meaningless. Skip, or score with ROUGE/semantic similarity?
-
Cost control: 10 replicates × 12 cases × 5 components = 600 calls. Acceptable for a nightly run, not per-PR. Needs a schedule/opt-in policy.
-
Where do raw per-replicate outputs live? In-repo under a committed-but-gitignored audit folder? External blob store? Not committed at all?
-
Initial model set: which models populate the first .evaluator/models.yaml? Candidates: Claude Opus 4.7, Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5, Gemini 2.5 Pro. Pick 2–3 to start or run the full matrix?
-
Provider abstraction shape: one unified client interface (e.g., litellm) or hand-rolled thin wrappers per provider?
-
Determinism settings per provider: each provider exposes temperature, seed, top_p etc. differently. Need a normalized "deterministic" profile per provider for reproducibility numbers to be meaningful.
Tagging for discussion; will iterate on metric selection and model set before implementation starts.
Summary
Ship a Python script (not a Claude Code skill) that takes a skill name, runs it multiple times against every configured LLM using fixtures in
AI4RA/evaluation-data-sets, and produces two distinct measurements per model:All scoring is deterministic Python (no LLM-as-judge). The script appends one run record per model to the component's
leaderboard.json, so cross-model performance can be compared over time.Invocation
<skill-name>(positional, required): the component to evaluate. Must match a folder undercomponents/and a folder underAI4RA/evaluation-data-sets/.--replicates N(default 10): number of independent runs per fixture. Needed to separate accuracy from reproducibility.--model <model-id>(optional): restrict this invocation to a single model. When omitted, evaluates every model in.evaluator/models.yamlfor which a provider key is present. This flag exists mainly so CI can fan out to one job per model (see "CI wiring" below).Multi-model configuration
The script evaluates the skill against every model listed in a repo-root config file (proposed path:
.evaluator/models.yaml):Each provider maps to one API client in the script. At runtime the script picks up credentials from environment variables (one per provider, never per model):
ANTHROPIC_API_KEYOPENAI_API_KEYGOOGLE_API_KEYAny model whose provider env var is unset is skipped with a warning, not a hard failure. This lets contributors run the evaluator against only the providers they have keys for, and lets CI run the full matrix when all secrets are configured.
CI wiring (separate follow-up)
When this runner is wired into GitHub Actions, use a matrix strategy with one job per model — not one monolithic job. This matters:
Sketch:
Trigger policy
The evaluator does not run on every commit to a component. It runs only when a component's MAJOR or MINOR version bumps — never on PATCH. Rationale: per the versioning policy in the root README, a PATCH is a wording/clarity fix with no expected behavior change, so re-burning 120+ API calls per model to measure "no change expected" is waste. MAJOR (output contract change) and MINOR (new capability) are the cases where scores can legitimately move.
Mechanics:
main, detect whichcomponents/<name>/prompt.mdfiles changed.version:in frontmatter:X.Y.Z → X.Y.Z+1(patch bump): skip evaluation, log reason, exit clean.X.Y.Z → X.Y+1.0(minor bump): evaluate that component across all configured models.X.Y.Z → X+1.0.0(major bump): evaluate that component across all configured models.Implementation note: the matrix of (changed component × model) should be computed in a setup job and passed to the eval job as a matrix input, so patch-only PRs produce zero eval jobs.
Required repo secrets (GitHub → Settings → Secrets → Actions):
ANTHROPIC_API_KEYOPENAI_API_KEYGOOGLE_API_KEY.evaluator/models.yamlEach matrix job appends its own entry to the component's
leaderboard.json. A final aggregate step can merge them into a single commit, or each job can commit directly on its own branch and a merge-PR action consolidates. Decision deferred to the CI-wiring follow-up issue.Cost discipline: per model,
n_cases × replicatescalls per scheduled run. At 12 cases × 10 replicates that is 120 calls per model per component, bounded and predictable. Scheduling policy (nightly / weekly / on-demand) is out of scope here — follow-up issue.Fixture convention (evaluation-data-sets)
If the folder is missing, the script exits with a clear message. Pairs this with AI4RA/evaluation-data-sets#1 — that issue should adopt
document.md+ground_truth.jsonas the canonical fixture pair.Execution model
For each configured model, for each case in the dataset:
Ntimes againstdocument.mdvia that model's provider API (shell out — do not score in-LLM).Nraw outputs plus parsed JSON where applicable.ground_truth.json, then aggregate (mean, stdev) across replicates.Nreplicates (no ground truth involved).Total API calls =
n_models × n_cases × replicates. This is the cost-control knob — bothreplicatesand the model list are explicit, so blast radius stays predictable.Metrics — to be finalized before implementation
This is the part that needs the most thought. Starting menu to discuss:
Accuracy (output vs. ground truth)
schema_valid— output parses against componentschema.json. Binary per replicate; report as pass rate.exact_match— JSON-equal after key-sort + whitespace normalization. Binary per replicate.field_level_f1— precision/recall per leaf field; harmonic mean. Real-valued per replicate.jaccard— for array-valued fields (e.g.,document_requirements):|A ∩ B| / |A ∪ B|on sets of canonical keys.Most components want a mix — the component declares which metrics are meaningful in
evals/datasets.json(from #12).Reproducibility (replicate i vs. replicate j)
N·(N-1)/2replicate pairs that are JSON-equal.field_level_f1but between replicates.Reproducibility is independent of ground truth and therefore valid even on unlabeled data. It also shows whether "accuracy" is stable or a coin flip that happened to land right once.
leaderboard.jsonschema (revised)Per component, at
components/<skill-name>/leaderboard.json. Append-only, newest first.Every run record is keyed by
model_idso per-model trends are queryable without joining across files.{ "skill": "<skill-name>", "runs": [ { "timestamp": "2026-04-18T17:34:00Z", "skill_version": "1.0.0", "prompt_commit": "<sha>", "dataset_commit": "<sha>", "model_id": "claude-opus-4-7", "provider": "anthropic", "replicates": 10, "n_cases": 12, "accuracy": { "schema_valid": { "mean": 1.0, "stdev": 0.0 }, "exact_match": { "mean": 0.32, "stdev": 0.08 }, "field_level_f1": { "mean": 0.87, "stdev": 0.03 } }, "reproducibility": { "pairwise_exact_match": 0.41, "pairwise_field_f1": 0.93 }, "per_case": [ { "case_id": "nsf-r01-basic", "accuracy": { "schema_valid": 1.0, "exact_match": 0.4, "field_level_f1": 0.88 }, "reproducibility":{ "pairwise_exact_match": 0.5, "pairwise_field_f1": 0.95 } } ] } ] }Every raw output from every replicate is saved to an audit location (location TBD — likely
components/<name>/evals/results/<timestamp>/and not committed to keep repo lean).Integration with existing work
tools/alongside the generator; neither depends on the other.document.md+ground_truth.json) needs to be aligned with that issue before this runner can be finalized.Acceptance criteria
tools/evaluate_skill.pywith--replicates(default 10) and--modelflags.leaderboard.jsonentry containing both accuracy and reproducibility sections with per-case breakdown.schema_valid: false, other metrics recorded asnull, run still written).--dry-runmode that validates plumbing without API calls.Open questions (need to think through)
Which metrics to adopt from the menu above? Probably
schema_valid+field_level_f1+ Jaccard for array fields as the accuracy baseline; pairwise field-F1 as the reproducibility baseline. Need to think through whether different components should declare different metrics.How to compute
field_level_f1on nested JSON? Flatten to dotted paths? Recurse? Per-field weights?Controlled-vocabulary fields (
document_type,code): exact-match is right; F1 hides category errors. Metric should be confusable-matrix-aware.Free-text fields (
rationale,knowledge_notes): exact-match is meaningless. Skip, or score with ROUGE/semantic similarity?Cost control: 10 replicates × 12 cases × 5 components = 600 calls. Acceptable for a nightly run, not per-PR. Needs a schedule/opt-in policy.
Where do raw per-replicate outputs live? In-repo under a committed-but-gitignored audit folder? External blob store? Not committed at all?
Initial model set: which models populate the first
.evaluator/models.yaml? Candidates: Claude Opus 4.7, Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5, Gemini 2.5 Pro. Pick 2–3 to start or run the full matrix?Provider abstraction shape: one unified client interface (e.g.,
litellm) or hand-rolled thin wrappers per provider?Determinism settings per provider: each provider exposes
temperature,seed,top_petc. differently. Need a normalized "deterministic" profile per provider for reproducibility numbers to be meaningful.Tagging for discussion; will iterate on metric selection and model set before implementation starts.