Background
The Pages site currently renders each component's README, prompt, schema, and changelog from components/<slug>/, plus a list of eval cases under evals/cases/. It does not yet surface actual evaluation results — the data under components/<slug>/evals/reports/<date>-<model>-<runid>/ (the gpt-oss-120b-dots-r5 run on nsf-award-notice-extraction-udm is the first example) is invisible to visitors.
Once we have a few real runs, the site should make it easy to answer:
- Which cases pass / fail for this component, per model?
- How does model X do across the whole library?
- Are a component's eval results still trustworthy, or has the component moved ahead of its last validated version?
Proposed layers
Three additions, rolled in together once the data is stable:
1. Case-by-model matrix on each component detail page
Under the existing Evals section, render a matrix:
|
model-A |
model-B |
model-C |
| case-1 |
✓ |
✗ |
✓ |
| case-2 |
✓ |
✓ |
— |
- Cells link to the full run report for that (case, model) pair.
- A freshness chip derived from
validated_against_version vs. the component's current version flags stale evals ("validated at 1.0.0, component now 1.1.0 — re-run recommended").
2. Run reports embedded inline
components/<slug>/evals/reports/<date>-<model>-<runid>/ already has REPORT.md, summary.json, and charts/*.png. The generator should:
- Render
REPORT.md as a collapsible section per run.
- Embed
charts/*.png as a gallery.
- Render
summary.json as a metrics table (completion tokens, field-agreement scores, coverage, consistency — whatever the schema settles on).
3. Cross-component "By model" page under Browse/
A leaderboard-style page — docs/browse/by-model.md — showing each model's performance across every component it's been evaluated on. Same filter JS treatment as the home page.
Prerequisite (do not skip)
The dashboard is only as good as summary.json's schema stability. We currently have one run's worth of data. If the shape drifts between runs we'll rebuild the dashboard.
Suggested sequence:
- Wait for 2–3 real runs across different models on different components.
- Write a canonical
summary.json JSON Schema and commit it to the library (e.g., .evals/summary.schema.json or similar). Include model identity, run identity, per-case pass/fail + features-covered, aggregate metrics (coverage, field-agreement, completion tokens, consistency), and timestamp.
- Backfill existing runs to conform.
- Then build the dashboard against that schema.
The run-report directory-naming convention (<date>-<model>-<runid>) is already stable; the remaining work is nailing down the machine-readable contract.
Open questions
- Should run-report artifacts (PNG charts, REPORT.md) live in the repo, or be pulled from an artifact store / GitHub release? Committing is simplest for a read-only dashboard but adds repo weight per run.
- Should the
validated_against_version gate block merges if a component's version moves ahead of its eval coverage, or only warn?
- Charts: stay with pre-rendered PNGs from the eval tool, or have the Pages generator produce interactive charts client-side from
summary.json?
Out of scope
- Running the evals themselves — this issue is strictly about surfacing results that already exist.
- Redesigning the eval harness.
- Cross-repo integration with the canonical UDM (track separately if it comes up).
Background
The Pages site currently renders each component's README, prompt, schema, and changelog from
components/<slug>/, plus a list of eval cases underevals/cases/. It does not yet surface actual evaluation results — the data undercomponents/<slug>/evals/reports/<date>-<model>-<runid>/(the gpt-oss-120b-dots-r5 run onnsf-award-notice-extraction-udmis the first example) is invisible to visitors.Once we have a few real runs, the site should make it easy to answer:
Proposed layers
Three additions, rolled in together once the data is stable:
1. Case-by-model matrix on each component detail page
Under the existing Evals section, render a matrix:
validated_against_versionvs. the component's current version flags stale evals ("validated at 1.0.0, component now 1.1.0 — re-run recommended").2. Run reports embedded inline
components/<slug>/evals/reports/<date>-<model>-<runid>/already hasREPORT.md,summary.json, andcharts/*.png. The generator should:REPORT.mdas a collapsible section per run.charts/*.pngas a gallery.summary.jsonas a metrics table (completion tokens, field-agreement scores, coverage, consistency — whatever the schema settles on).3. Cross-component "By model" page under
Browse/A leaderboard-style page —
docs/browse/by-model.md— showing each model's performance across every component it's been evaluated on. Same filter JS treatment as the home page.Prerequisite (do not skip)
The dashboard is only as good as
summary.json's schema stability. We currently have one run's worth of data. If the shape drifts between runs we'll rebuild the dashboard.Suggested sequence:
summary.jsonJSON Schema and commit it to the library (e.g.,.evals/summary.schema.jsonor similar). Include model identity, run identity, per-case pass/fail + features-covered, aggregate metrics (coverage, field-agreement, completion tokens, consistency), and timestamp.The run-report directory-naming convention (
<date>-<model>-<runid>) is already stable; the remaining work is nailing down the machine-readable contract.Open questions
validated_against_versiongate block merges if a component's version moves ahead of its eval coverage, or only warn?summary.json?Out of scope