Skip to content

feat: evaluation harness — assertion dump, diff, telemetry, dev/holdout slices #67

@deanban

Description

@deanban

Problem

There was no repeatable way to measure whether an L2 prompt change helped or hurt. Quality regressions were only noticed by eye during PR review, and per-stage cost/latency was invisible.

Proposed solution

Ship a first-class evaluation harness:

  • sema eval CLI with subcommands to dump assertions per run, diff two runs, and emit per-stage telemetry (calls, tokens, latency, recovery events).
  • eval/dev_slice.yaml — the 12-table cBioPortal dev slice used for iteration.
  • eval/holdout.yaml — a separate holdout slice to detect dev-slice overfitting.
  • Telemetry captured per-stage (A/B/C) so regressions can be localized to a specific prompt.

Tracked in OpenSpec change source-semantic-hardening, tasks.md §5–6.

Alternatives considered

  • Ad-hoc pytest snapshots — rejected; doesn't surface token/latency telemetry or support diffing arbitrary runs.
  • External eval framework — deferred; internal harness is lighter and reuses sema's own assertion model.

Closed by #63.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions