feat: evaluation harness — assertion dump, diff, telemetry, dev/holdout slices

**Problem**

There was no repeatable way to measure whether an L2 prompt change helped or hurt. Quality regressions were only noticed by eye during PR review, and per-stage cost/latency was invisible.

**Proposed solution**

Ship a first-class evaluation harness:

- `sema eval` CLI with subcommands to dump assertions per run, diff two runs, and emit per-stage telemetry (calls, tokens, latency, recovery events).
- `eval/dev_slice.yaml` — the 12-table cBioPortal dev slice used for iteration.
- `eval/holdout.yaml` — a separate holdout slice to detect dev-slice overfitting.
- Telemetry captured per-stage (A/B/C) so regressions can be localized to a specific prompt.

Tracked in OpenSpec change `source-semantic-hardening`, tasks.md §5–6.

**Alternatives considered**

- Ad-hoc pytest snapshots — rejected; doesn't surface token/latency telemetry or support diffing arbitrary runs.
- External eval framework — deferred; internal harness is lighter and reuses sema's own assertion model.

Closed by #63.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: evaluation harness — assertion dump, diff, telemetry, dev/holdout slices #67

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: evaluation harness — assertion dump, diff, telemetry, dev/holdout slices #67

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions