Surface eval data on the Pages site (case-by-model matrix, run reports, cross-component leaderboard)

## Background

The Pages site currently renders each component's README, prompt, schema, and changelog from `components/<slug>/`, plus a list of eval cases under `evals/cases/`. It does **not** yet surface actual evaluation results — the data under `components/<slug>/evals/reports/<date>-<model>-<runid>/` (the gpt-oss-120b-dots-r5 run on `nsf-award-notice-extraction-udm` is the first example) is invisible to visitors.

Once we have a few real runs, the site should make it easy to answer:

- Which cases pass / fail for this component, per model?
- How does model X do across the whole library?
- Are a component's eval results still trustworthy, or has the component moved ahead of its last validated version?

## Proposed layers

Three additions, rolled in together once the data is stable:

### 1. Case-by-model matrix on each component detail page

Under the existing **Evals** section, render a matrix:

|              | model-A | model-B | model-C |
| ------------ | ------- | ------- | ------- |
| case-1       | ✓       | ✗       | ✓       |
| case-2       | ✓       | ✓       | —       |

- Cells link to the full run report for that (case, model) pair.
- A freshness chip derived from `validated_against_version` vs. the component's current version flags stale evals ("validated at 1.0.0, component now 1.1.0 — re-run recommended").

### 2. Run reports embedded inline

`components/<slug>/evals/reports/<date>-<model>-<runid>/` already has `REPORT.md`, `summary.json`, and `charts/*.png`. The generator should:

- Render `REPORT.md` as a collapsible section per run.
- Embed `charts/*.png` as a gallery.
- Render `summary.json` as a metrics table (completion tokens, field-agreement scores, coverage, consistency — whatever the schema settles on).

### 3. Cross-component \"By model\" page under `Browse/`

A leaderboard-style page — `docs/browse/by-model.md` — showing each model's performance across every component it's been evaluated on. Same filter JS treatment as the home page.

## Prerequisite (do not skip)

**The dashboard is only as good as `summary.json`'s schema stability.** We currently have one run's worth of data. If the shape drifts between runs we'll rebuild the dashboard.

Suggested sequence:

1. Wait for 2–3 real runs across different models on different components.
2. Write a canonical `summary.json` JSON Schema and commit it to the library (e.g., `.evals/summary.schema.json` or similar). Include model identity, run identity, per-case pass/fail + features-covered, aggregate metrics (coverage, field-agreement, completion tokens, consistency), and timestamp.
3. Backfill existing runs to conform.
4. _Then_ build the dashboard against that schema.

The run-report directory-naming convention (`<date>-<model>-<runid>`) is already stable; the remaining work is nailing down the machine-readable contract.

## Open questions

- Should run-report artifacts (PNG charts, REPORT.md) live in the repo, or be pulled from an artifact store / GitHub release? Committing is simplest for a read-only dashboard but adds repo weight per run.
- Should the `validated_against_version` gate block merges if a component's version moves ahead of its eval coverage, or only warn?
- Charts: stay with pre-rendered PNGs from the eval tool, or have the Pages generator produce interactive charts client-side from `summary.json`?

## Out of scope

- Running the evals themselves — this issue is strictly about surfacing results that already exist.
- Redesigning the eval harness.
- Cross-repo integration with the canonical UDM (track separately if it comes up).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface eval data on the Pages site (case-by-model matrix, run reports, cross-component leaderboard) #23

Background

Proposed layers

1. Case-by-model matrix on each component detail page

2. Run reports embedded inline

3. Cross-component "By model" page under `Browse/`

Prerequisite (do not skip)

Open questions

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Surface eval data on the Pages site (case-by-model matrix, run reports, cross-component leaderboard) #23

Description

Background

Proposed layers

1. Case-by-model matrix on each component detail page

2. Run reports embedded inline

3. Cross-component "By model" page under Browse/

Prerequisite (do not skip)

Open questions

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

3. Cross-component "By model" page under `Browse/`