Context
M9 (PR #12) shipped the evaluation harness + asserted-fixture pytest +
methodology-only PENDING eval/RESULTS.md. No numerical metric is committed
to the tree per CLAUDE.md Golden Rule #5 — quotable numbers come only from a
real-provider run.
This issue tracks running the harness against real providers and committing the
real eval/RESULTS.md plus a PROGRESS.md "Decision log" entry recording the
numbers, the run date, and the exact provider/model strings.
What "real-provider" means here (locked in by M9)
- LLM:
claude-sonnet-4-6 (Anthropic). Verified against
https://docs.anthropic.com/en/docs/about-claude/models on 2026-05-29 — the
4.6-generation IDs use a dateless format that is itself a pinned snapshot,
not an evergreen pointer.
- Embeddings:
text-embedding-3-small (OpenAI), 1536 dim, matches
backend.app.models.SCHEMA_EMBEDDING_DIM.
- Temperature:
0.0 per CLAUDE.md house style.
- k:
5 (default Settings.retrieval_top_k).
Recipe
Branch off main (e.g. eval/real-numbers-202Y-MM-DD):
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export LLM_PROVIDER=anthropic
export EMBEDDINGS_PROVIDER=openai
# (defaults already set CLAUDE_MODEL=claude-sonnet-4-6 and
# OPENAI_EMBEDDING_MODEL=text-embedding-3-small in backend/app/config.py)
# Apply migrations to a fresh DB
make migrate
# Seed the synthetic corpus
make seed
# Run the harness — overwrites eval/RESULTS.md with real numbers
make eval
# Sanity check: nothing in the file should be a fabricated number
git diff eval/RESULTS.md
Then add a one-line entry to PROGRESS.md "Decision log":
- 2026-MM-DD (M9 follow-up) — Real-provider evaluation results recorded:
extraction micro=X.XX macro=X.XX, retrieval p@5=X.XX r@5=X.XX MRR=X.XX,
RAG citation-validity=X.XX cites-relevant=X.XX. Run with claude-sonnet-4-6
+ text-embedding-3-small at temperature=0; full breakdown in eval/RESULTS.md.
Open a small PR titled eval: record real-provider numbers (or similar; this
is a short, single-purpose PR — no other changes).
Acceptance criteria
Out of scope
- Expanding the labeled set beyond the M9 5-invoice / 6-query / 5-question
shape (separate backlog item, post-M11).
- Adding an LLM-judge faithfulness metric or ROUGE/BLEU (correctly excluded
in M9; revisit only if the dataset grows).
- Running the eval in CI on every PR (would burn API tokens; stays manual or
on a manual-dispatch GitHub Actions workflow if and when one is added).
References
Context
M9 (PR #12) shipped the evaluation harness + asserted-fixture pytest +
methodology-only PENDING
eval/RESULTS.md. No numerical metric is committedto the tree per CLAUDE.md Golden Rule #5 — quotable numbers come only from a
real-provider run.
This issue tracks running the harness against real providers and committing the
real
eval/RESULTS.mdplus aPROGRESS.md"Decision log" entry recording thenumbers, the run date, and the exact provider/model strings.
What "real-provider" means here (locked in by M9)
claude-sonnet-4-6(Anthropic). Verified againsthttps://docs.anthropic.com/en/docs/about-claude/models on 2026-05-29 — the
4.6-generation IDs use a dateless format that is itself a pinned snapshot,
not an evergreen pointer.
text-embedding-3-small(OpenAI), 1536 dim, matchesbackend.app.models.SCHEMA_EMBEDDING_DIM.0.0per CLAUDE.md house style.5(defaultSettings.retrieval_top_k).Recipe
Branch off
main(e.g.eval/real-numbers-202Y-MM-DD):Then add a one-line entry to
PROGRESS.md"Decision log":Open a small PR titled
eval: record real-provider numbers(or similar; thisis a short, single-purpose PR — no other changes).
Acceptance criteria
eval/RESULTS.mdhas been overwritten bymake evalrunning againstreal
LLM_PROVIDER=anthropicandEMBEDDINGS_PROVIDER=openai.hand-edits to numbers.
model, dim, temperature, k, and run timestamp.
PROGRESS.md"Decision log" gains one entry summarising the headlinenumbers + the provider/model pair + the date.
changes needed).
Actions secrets — not done by default for cost control).
Out of scope
shape (separate backlog item, post-M11).
in M9; revisit only if the dataset grows).
on a manual-dispatch GitHub Actions workflow if and when one is added).
References
docs/evaluation.mdbackend/tests/test_eval_harness.py