Problem
There was no repeatable way to measure whether an L2 prompt change helped or hurt. Quality regressions were only noticed by eye during PR review, and per-stage cost/latency was invisible.
Proposed solution
Ship a first-class evaluation harness:
sema eval CLI with subcommands to dump assertions per run, diff two runs, and emit per-stage telemetry (calls, tokens, latency, recovery events).
eval/dev_slice.yaml — the 12-table cBioPortal dev slice used for iteration.
eval/holdout.yaml — a separate holdout slice to detect dev-slice overfitting.
- Telemetry captured per-stage (A/B/C) so regressions can be localized to a specific prompt.
Tracked in OpenSpec change source-semantic-hardening, tasks.md §5–6.
Alternatives considered
- Ad-hoc pytest snapshots — rejected; doesn't surface token/latency telemetry or support diffing arbitrary runs.
- External eval framework — deferred; internal harness is lighter and reuses sema's own assertion model.
Closed by #63.
Problem
There was no repeatable way to measure whether an L2 prompt change helped or hurt. Quality regressions were only noticed by eye during PR review, and per-stage cost/latency was invisible.
Proposed solution
Ship a first-class evaluation harness:
sema evalCLI with subcommands to dump assertions per run, diff two runs, and emit per-stage telemetry (calls, tokens, latency, recovery events).eval/dev_slice.yaml— the 12-table cBioPortal dev slice used for iteration.eval/holdout.yaml— a separate holdout slice to detect dev-slice overfitting.Tracked in OpenSpec change
source-semantic-hardening, tasks.md §5–6.Alternatives considered
Closed by #63.