Skip to content

eval: record real-provider benchmark numbers (M9 follow-up) #13

Description

@div0rce

Context

M9 (PR #12) shipped the evaluation harness + asserted-fixture pytest +
methodology-only PENDING eval/RESULTS.md. No numerical metric is committed
to the tree
per CLAUDE.md Golden Rule #5 — quotable numbers come only from a
real-provider run.

This issue tracks running the harness against real providers and committing the
real eval/RESULTS.md plus a PROGRESS.md "Decision log" entry recording the
numbers, the run date, and the exact provider/model strings.

What "real-provider" means here (locked in by M9)

  • LLM: claude-sonnet-4-6 (Anthropic). Verified against
    https://docs.anthropic.com/en/docs/about-claude/models on 2026-05-29 — the
    4.6-generation IDs use a dateless format that is itself a pinned snapshot,
    not an evergreen pointer.
  • Embeddings: text-embedding-3-small (OpenAI), 1536 dim, matches
    backend.app.models.SCHEMA_EMBEDDING_DIM.
  • Temperature: 0.0 per CLAUDE.md house style.
  • k: 5 (default Settings.retrieval_top_k).

Recipe

Branch off main (e.g. eval/real-numbers-202Y-MM-DD):

export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export LLM_PROVIDER=anthropic
export EMBEDDINGS_PROVIDER=openai
# (defaults already set CLAUDE_MODEL=claude-sonnet-4-6 and
#  OPENAI_EMBEDDING_MODEL=text-embedding-3-small in backend/app/config.py)

# Apply migrations to a fresh DB
make migrate

# Seed the synthetic corpus
make seed

# Run the harness — overwrites eval/RESULTS.md with real numbers
make eval

# Sanity check: nothing in the file should be a fabricated number
git diff eval/RESULTS.md

Then add a one-line entry to PROGRESS.md "Decision log":

- 2026-MM-DD (M9 follow-up) — Real-provider evaluation results recorded:
  extraction micro=X.XX macro=X.XX, retrieval p@5=X.XX r@5=X.XX MRR=X.XX,
  RAG citation-validity=X.XX cites-relevant=X.XX. Run with claude-sonnet-4-6
  + text-embedding-3-small at temperature=0; full breakdown in eval/RESULTS.md.

Open a small PR titled eval: record real-provider numbers (or similar; this
is a short, single-purpose PR — no other changes).

Acceptance criteria

  • eval/RESULTS.md has been overwritten by make eval running against
    real LLM_PROVIDER=anthropic and EMBEDDINGS_PROVIDER=openai.
  • Every committed metric value matches the harness output exactly. No
    hand-edits to numbers.
  • The "Run metadata" section reflects the actual provider, model, embedding
    model, dim, temperature, k, and run timestamp.
  • PROGRESS.md "Decision log" gains one entry summarising the headline
    numbers + the provider/model pair + the date.
  • CI passes (existing backend + frontend jobs both green; no eval-job
    changes needed).
  • No API keys committed (use local env or, for an actual remote run, GitHub
    Actions secrets — not done by default for cost control).

Out of scope

  • Expanding the labeled set beyond the M9 5-invoice / 6-query / 5-question
    shape (separate backlog item, post-M11).
  • Adding an LLM-judge faithfulness metric or ROUGE/BLEU (correctly excluded
    in M9; revisit only if the dataset grows).
  • Running the eval in CI on every PR (would burn API tokens; stays manual or
    on a manual-dispatch GitHub Actions workflow if and when one is added).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions