refactor(eval): restructure harness to Code Health 10.0 by div0rce · Pull Request #18 · div0rce/sentinel

div0rce · 2026-06-08T15:30:58Z

Milestone

Standalone tech-debt refactor (not a roadmap milestone) — eval/harness.py Code Health

Summary

eval/harness.py carried pre-existing technical debt (CodeScene Code Health 7.52 — Yellow): three ~100-line evaluator functions (evaluate_extraction, evaluate_retrieval, evaluate_rag) with cyclomatic complexity 14–20, a bumpy-road nesting smell, and 5–6-argument signatures. This was flagged during the Gemini PR (#17) but deliberately kept out of that change — it's the eval logic with the honesty gates, so it belongs in its own behaviour-preserving PR. This restructures the module to Code Health 10.0 with no change to eval semantics.

What changed (behaviour-preserving)

EvalContext bundles the shared dependencies (settings, embedder, lazy llm, corpus/label dirs); every evaluator and helper now takes (session, ctx). This is exactly the "missing abstraction" CodeScene named for the argument-count smell. The LLM stays lazy (require_llm()) so a retrieval-only run never forces an LLM provider to resolve — matching the original per-evaluator resolution.
Each scoring loop is extracted into a focused tally — _ExtractionTally / _RetrievalTally / _RagTally — with score/record + to_result methods, plus shared _ingest_relevant_sources / _resolve_relevant_ids helpers and a _citation_validity_contribution helper (removes the duplicated ingest block).
_collect_filenames flattened to comprehensions.

Definition of Done

make check passes — 222 backend pytest, 7 frontend Vitest + build, ruff/ruff-format/mypy clean
Behaviour preserved — test_eval_harness.py's asserted metric values (micro/macro 0.75 & 1.0, precision@2 0.5, recall 1.0, MRR 1.0, the three RAG rates) are unchanged; only the call sites were updated to pass an EvalContext
Honesty gates intact — the provider == "fake" → quotable=False + None-metric returns are preserved verbatim in each evaluator (Golden Rule feat: schema-constrained structured extraction #5)
No secrets; no behavioural/API change beyond the internal signatures
CodeScene change-set verdict "improved", quality gates "passed"

CodeScene before → after

	Before	After
File Code Health	7.52 (Yellow)	10.0
Module mean cyclomatic complexity	6.25	3.13
`evaluate_rag` / `evaluate_extraction` / `evaluate_retrieval` cc	20 / 17 / 14	~3–5 each
Largest evaluator	105 LoC	~30 LoC
Bumpy Road / Excess-Arguments smells	present	cleared

Notes

Decoupled from the Gemini work by design — branched off main after #17 merged. No public API beyond the eval/ package changed; the only external caller, eval/run.py, now builds the context via EvalContext.create(settings).

eval/harness.py carried pre-existing technical debt (CodeScene Code Health 7.52): three ~100-line evaluator functions with cyclomatic complexity 14-20, a bumpy-road nesting smell, and 5-6 argument signatures. Refactor (behaviour-preserving — the asserted metric values in test_eval_harness.py are unchanged): - Introduce EvalContext bundling the shared deps (settings, embedder, lazy llm, corpus/label dirs); every evaluator and helper now takes (session, ctx). This is the 'missing abstraction' CodeScene flagged for the argument-count smell. - Extract each scoring loop into a focused tally (_ExtractionTally / _RetrievalTally / _RagTally) with score/record + to_result methods, plus shared _ingest_relevant_ sources / _resolve_relevant_ids helpers and a _citation_validity_contribution helper. - Flatten _collect_filenames to comprehensions. Result: every evaluator drops to cc 3-5 / ~25 lines / 2 args; module mean cc 6.25 -> 3.13; Code Health 7.52 -> 10.0. make check green (222 backend pytest, 7 frontend, ruff/mypy clean).

coderabbitai · 2026-06-08T15:31:12Z

Warning

Review limit reached

@div0rce, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 26 minutes and 16 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7e5b69f5-1fd1-4175-a695-c623af08036d

📥 Commits

Reviewing files that changed from the base of the PR and between 2151556 and 7edd0e2.

📒 Files selected for processing (4)

.coderabbit.yaml
backend/tests/test_eval_harness.py
eval/harness.py
eval/run.py

📝 Walkthrough

Walkthrough

This PR refactors the evaluation harness to consolidate scattered parameter passing into a unified EvalContext object that bundles settings, embedding/LLM providers, and corpus/labels directories. Evaluator APIs, orchestration, tests, and entry points are updated to construct and use this context instead of passing individual parameters.

Changes

Evaluation Harness Context Refactoring

Layer / File(s)	Summary
EvalContext definition and refactored helper utilities `eval/harness.py`	Introduced `EvalContext` class with `create` factory and `require_llm` method, plus refactored corpus-loading, label-loading, and ingestion helpers to pull settings/embedder/corpus_dir from the context.
Extraction evaluator with tally system `eval/harness.py`	Introduced `_ExtractionTally` for mutable per-document scoring, refactored ingestion resolution helpers, and updated `evaluate_extraction` to use context-driven control flow with `ctx.require_llm()` and tally aggregation.
Retrieval evaluator with tally system `eval/harness.py`	Introduced `_RetrievalTally` for mutable metric recording, and updated `evaluate_retrieval` to ingest relevant sources once via context, resolve chunk IDs per query, run vector retrieval, and record ranking metrics through tally aggregation.
RAG evaluator with tally system and citation parsing `eval/harness.py`	Introduced `_RagTally` with `record_answer` method and helpers (`_parse_cited_chunk_ids`, `_citation_validity_contribution`) for citation and substring scoring. Updated `evaluate_rag` to resolve LLM via `ctx.require_llm()`, ingest sources once, run answer queries, record answer metrics through tally, and return aggregated result.
Orchestration and utility updates `eval/harness.py`	Updated `run_all` signature to accept `(session, ctx)` and invoke all three evaluators with the shared context. Simplified `_collect_filenames` to comprehension-based concatenation.
Entry point integration `eval/run.py`	Updated `eval/run.py` to import `EvalContext` and construct it via `EvalContext.create(settings=settings)` before passing to `run_all`.
Test adoption of EvalContext pattern `backend/tests/test_eval_harness.py`	Updated all test cases to import `EvalContext` and construct context objects containing settings, providers, and corpus/labels directories before calling `evaluate_extraction`, `evaluate_retrieval`, and `evaluate_rag`. Covers fake provider tests, correctness tests, metric tests, and writer round-trip test.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 64.52% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'refactor(eval): restructure harness to Code Health 10.0' clearly and specifically summarizes the main change—a targeted refactor of eval/harness.py to improve Code Health metrics.
Description check	✅ Passed	The PR description closely follows the template with all major sections completed: Milestone identified, Summary provided (4 sentences covering the technical debt, refactor scope, behavior preservation), Definition of Done with checked items, and detailed Notes explaining CodeScene improvements and API changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1d49ebec-24f0-4ba5-b719-f4c1defffab5

📥 Commits

Reviewing files that changed from the base of the PR and between 8fc2eac and 2151556.

📒 Files selected for processing (3)

backend/tests/test_eval_harness.py
eval/harness.py
eval/run.py

📜 Review details

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use type hints everywhere; mypy must pass cleanly with no errors
Write docstrings on public functions; comments should explain why, not what
Use ruff for linting and formatting in Python; maintain ruff check and ruff format --check compliance

Files:

eval/run.py
backend/tests/test_eval_harness.py
eval/harness.py

backend/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/tests/**/*.py: Hard-test all deterministic logic: chunking, the workflow engine, guardrails, and the audit log
Mock LLM and embeddings in tests via fake implementations; CI runs offline with no API keys required
Workflow engine MUST have explicit determinism and idempotency tests: same input produces same output; re-run produces no duplicate side effects
Audit log MUST be append-only; include a test that reconstructs current state by replaying events
Seed any randomness in tests; pin temperatures for LLM calls used in evaluation to ensure determinism

Files:

backend/tests/test_eval_harness.py

🧠 Learnings (1)

📚 Learning: 2026-06-08T15:13:33.301Z

Learnt from: CR
Repo: div0rce/sentinel PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-06-08T15:13:33.301Z
Learning: Applies to backend/app/llm/**/*.py : Implement LLM provider abstraction in `backend/app/llm/` and embeddings provider abstraction in `backend/app/embeddings/` to keep both swappable and mockable for tests

Applied to files:

backend/tests/test_eval_harness.py

🔇 Additional comments (18)

eval/run.py (1)

14-14: LGTM!

Also applies to: 91-91

eval/harness.py (8)

1-25: LGTM!

136-202: LGTM!

207-278: LGTM!

281-317: LGTM!

360-392: LGTM!

398-472: LGTM!

475-515: LGTM!

536-567: LGTM!

backend/tests/test_eval_harness.py (9)

41-41: LGTM!

175-182: LGTM!

205-211: LGTM!

238-245: LGTM!

311-318: LGTM!

349-356: LGTM!

414-420: LGTM!

458-465: LGTM!

506-513: LGTM!

Close the genuine public-API docstring gap in the files this PR touches: the four result dataclasses (ExtractionResult/RetrievalResult/RagResult/HarnessReport) and the eval.run functions. Public API in eval/harness.py and eval/run.py is now 100% documented (private helpers stay per CLAUDE.md's 'docstrings on public functions').

…ries semantics Address CodeRabbit review on PR #18: require_llm's docstring no longer implies caching (it re-resolves each call on a frozen dataclass), and RetrievalResult documents that n_queries is the scored-query count when quotable vs the total label count in the n/a case. Docstring-only; no behaviour change.

CodeRabbit's docstring-coverage pre-merge check is scored on the PR's changed files (PR #18 was at 64.5% < 80% threshold). Document every remaining function/method/class in eval/harness.py and test_eval_harness.py with meaningful one-liners (consistent with the tests that already carried docstrings) — describing what each test verifies and what each helper does, not stub padding. Changed files now 100%; clears the advisory honestly without lowering the threshold. Code Health unchanged (10.0); make-check green.

div0rce · 2026-06-08T15:58:11Z

@coderabbitai review

coderabbitai · 2026-06-08T15:58:18Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Config-as-code in preference to hidden Org-UI settings. Encodes CLAUDE.md's docstring convention (public functions documented; comments explain why; tests self-document) as per-path docstring-generation instructions. Deliberately does NOT lower or disable the docstring-coverage pre-merge check — its threshold stays at the Org default. Side effect: the fresh review this commit triggers recomputes the (previously lagging) docstring-coverage figure at the current head, where the changed files are 100% documented.

codescene-delta-analysis

Code Health Improved (1 files improve in Code Health)

Our agent can fix these. Install it.

Gates Passed
6 Quality Gates Passed

View Improvements

File	Code Health Impact	Categories Improved
harness.py	7.52 → 10.00	Complex Method, Bumpy Road Ahead, Overall Code Complexity, Excess Number of Function Arguments

Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.

div0rce · 2026-06-08T19:47:17Z

@codex review

chatgpt-codex-connector · 2026-06-08T19:50:18Z

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Document all remaining public production definitions (embedding/LLM provider impls, repository getters, the ingest CLI entry, request-id middleware, and dashboard/review response models). Docstring-only; no behaviour change. With #18, the repo's entire public production API is documented.

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread eval/harness.py

Comment thread eval/harness.py

div0rce added 3 commits June 8, 2026 11:41

This comment was marked as outdated.

Sign in to view

codescene-delta-analysis Bot approved these changes Jun 8, 2026

View reviewed changes

div0rce mentioned this pull request Jun 8, 2026

docs: backfill docstrings on public backend/app API #19

Merged

4 tasks

div0rce merged commit 5b709ed into main Jun 8, 2026
5 checks passed

div0rce deleted the refactor/eval-harness-health branch June 8, 2026 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(eval): restructure harness to Code Health 10.0#18

refactor(eval): restructure harness to Code Health 10.0#18
div0rce merged 5 commits into
mainfrom
refactor/eval-harness-health

div0rce commented Jun 8, 2026

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

div0rce commented Jun 8, 2026

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

codescene-delta-analysis Bot left a comment

Uh oh!

div0rce commented Jun 8, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

div0rce commented Jun 8, 2026

Milestone

Summary

What changed (behaviour-preserving)

Definition of Done

CodeScene before → after

Notes

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

div0rce commented Jun 8, 2026

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codescene-delta-analysis Bot left a comment

Choose a reason for hiding this comment

Uh oh!

div0rce commented Jun 8, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading