eval: record real-provider benchmark numbers (M9 follow-up)

## Context

M9 (PR #12) shipped the evaluation harness + asserted-fixture pytest +
methodology-only PENDING `eval/RESULTS.md`. **No numerical metric is committed
to the tree** per CLAUDE.md Golden Rule #5 — quotable numbers come only from a
real-provider run.

This issue tracks running the harness against real providers and committing the
real `eval/RESULTS.md` plus a `PROGRESS.md` "Decision log" entry recording the
numbers, the run date, and the exact provider/model strings.

## What "real-provider" means here (locked in by M9)

- **LLM:** `claude-sonnet-4-6` (Anthropic). Verified against
  https://docs.anthropic.com/en/docs/about-claude/models on 2026-05-29 — the
  4.6-generation IDs use a dateless format that is itself a pinned snapshot,
  not an evergreen pointer.
- **Embeddings:** `text-embedding-3-small` (OpenAI), 1536 dim, matches
  `backend.app.models.SCHEMA_EMBEDDING_DIM`.
- **Temperature:** `0.0` per CLAUDE.md house style.
- **k:** `5` (default `Settings.retrieval_top_k`).

## Recipe

Branch off `main` (e.g. `eval/real-numbers-202Y-MM-DD`):

```bash
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export LLM_PROVIDER=anthropic
export EMBEDDINGS_PROVIDER=openai
# (defaults already set CLAUDE_MODEL=claude-sonnet-4-6 and
#  OPENAI_EMBEDDING_MODEL=text-embedding-3-small in backend/app/config.py)

# Apply migrations to a fresh DB
make migrate

# Seed the synthetic corpus
make seed

# Run the harness — overwrites eval/RESULTS.md with real numbers
make eval

# Sanity check: nothing in the file should be a fabricated number
git diff eval/RESULTS.md
```

Then add a one-line entry to `PROGRESS.md` "Decision log":

```
- 2026-MM-DD (M9 follow-up) — Real-provider evaluation results recorded:
  extraction micro=X.XX macro=X.XX, retrieval p@5=X.XX r@5=X.XX MRR=X.XX,
  RAG citation-validity=X.XX cites-relevant=X.XX. Run with claude-sonnet-4-6
  + text-embedding-3-small at temperature=0; full breakdown in eval/RESULTS.md.
```

Open a small PR titled `eval: record real-provider numbers` (or similar; this
is a short, single-purpose PR — no other changes).

## Acceptance criteria

- [ ] `eval/RESULTS.md` has been overwritten by `make eval` running against
  real `LLM_PROVIDER=anthropic` and `EMBEDDINGS_PROVIDER=openai`.
- [ ] Every committed metric value matches the harness output exactly. No
  hand-edits to numbers.
- [ ] The "Run metadata" section reflects the actual provider, model, embedding
  model, dim, temperature, k, and run timestamp.
- [ ] `PROGRESS.md` "Decision log" gains one entry summarising the headline
  numbers + the provider/model pair + the date.
- [ ] CI passes (existing backend + frontend jobs both green; no eval-job
  changes needed).
- [ ] No API keys committed (use local env or, for an actual remote run, GitHub
  Actions secrets — not done by default for cost control).

## Out of scope

- Expanding the labeled set beyond the M9 5-invoice / 6-query / 5-question
  shape (separate backlog item, post-M11).
- Adding an LLM-judge faithfulness metric or ROUGE/BLEU (correctly excluded
  in M9; revisit only if the dataset grows).
- Running the eval in CI on every PR (would burn API tokens; stays manual or
  on a manual-dispatch GitHub Actions workflow if and when one is added).

## References

- M9 PR (harness contract): https://github.com/div0rce/sentinel/pull/12
- Methodology defense: `docs/evaluation.md`
- Asserted scorer/writer behaviour: `backend/tests/test_eval_harness.py`
- CLAUDE.md Golden Rule #5: "Never fabricate evaluation numbers."


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: record real-provider benchmark numbers (M9 follow-up) #13

Context

What "real-provider" means here (locked in by M9)

Recipe

Acceptance criteria

Out of scope

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

eval: record real-provider benchmark numbers (M9 follow-up) #13

Description

Context

What "real-provider" means here (locked in by M9)

Recipe

Acceptance criteria

Out of scope

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions