Skip to content

Surface eval data on the Pages site (case-by-model matrix, run reports, cross-component leaderboard) #23

@ProfessorPolymorphic

Description

@ProfessorPolymorphic

Background

The Pages site currently renders each component's README, prompt, schema, and changelog from components/<slug>/, plus a list of eval cases under evals/cases/. It does not yet surface actual evaluation results — the data under components/<slug>/evals/reports/<date>-<model>-<runid>/ (the gpt-oss-120b-dots-r5 run on nsf-award-notice-extraction-udm is the first example) is invisible to visitors.

Once we have a few real runs, the site should make it easy to answer:

  • Which cases pass / fail for this component, per model?
  • How does model X do across the whole library?
  • Are a component's eval results still trustworthy, or has the component moved ahead of its last validated version?

Proposed layers

Three additions, rolled in together once the data is stable:

1. Case-by-model matrix on each component detail page

Under the existing Evals section, render a matrix:

model-A model-B model-C
case-1
case-2
  • Cells link to the full run report for that (case, model) pair.
  • A freshness chip derived from validated_against_version vs. the component's current version flags stale evals ("validated at 1.0.0, component now 1.1.0 — re-run recommended").

2. Run reports embedded inline

components/<slug>/evals/reports/<date>-<model>-<runid>/ already has REPORT.md, summary.json, and charts/*.png. The generator should:

  • Render REPORT.md as a collapsible section per run.
  • Embed charts/*.png as a gallery.
  • Render summary.json as a metrics table (completion tokens, field-agreement scores, coverage, consistency — whatever the schema settles on).

3. Cross-component "By model" page under Browse/

A leaderboard-style page — docs/browse/by-model.md — showing each model's performance across every component it's been evaluated on. Same filter JS treatment as the home page.

Prerequisite (do not skip)

The dashboard is only as good as summary.json's schema stability. We currently have one run's worth of data. If the shape drifts between runs we'll rebuild the dashboard.

Suggested sequence:

  1. Wait for 2–3 real runs across different models on different components.
  2. Write a canonical summary.json JSON Schema and commit it to the library (e.g., .evals/summary.schema.json or similar). Include model identity, run identity, per-case pass/fail + features-covered, aggregate metrics (coverage, field-agreement, completion tokens, consistency), and timestamp.
  3. Backfill existing runs to conform.
  4. Then build the dashboard against that schema.

The run-report directory-naming convention (<date>-<model>-<runid>) is already stable; the remaining work is nailing down the machine-readable contract.

Open questions

  • Should run-report artifacts (PNG charts, REPORT.md) live in the repo, or be pulled from an artifact store / GitHub release? Committing is simplest for a read-only dashboard but adds repo weight per run.
  • Should the validated_against_version gate block merges if a component's version moves ahead of its eval coverage, or only warn?
  • Charts: stay with pre-rendered PNGs from the eval tool, or have the Pages generator produce interactive charts client-side from summary.json?

Out of scope

  • Running the evals themselves — this issue is strictly about surfacing results that already exist.
  • Redesigning the eval harness.
  • Cross-repo integration with the canonical UDM (track separately if it comes up).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions