refactor(eval): restructure harness to Code Health 10.0#18
Conversation
eval/harness.py carried pre-existing technical debt (CodeScene Code Health 7.52): three ~100-line evaluator functions with cyclomatic complexity 14-20, a bumpy-road nesting smell, and 5-6 argument signatures. Refactor (behaviour-preserving — the asserted metric values in test_eval_harness.py are unchanged): - Introduce EvalContext bundling the shared deps (settings, embedder, lazy llm, corpus/label dirs); every evaluator and helper now takes (session, ctx). This is the 'missing abstraction' CodeScene flagged for the argument-count smell. - Extract each scoring loop into a focused tally (_ExtractionTally / _RetrievalTally / _RagTally) with score/record + to_result methods, plus shared _ingest_relevant_ sources / _resolve_relevant_ids helpers and a _citation_validity_contribution helper. - Flatten _collect_filenames to comprehensions. Result: every evaluator drops to cc 3-5 / ~25 lines / 2 args; module mean cc 6.25 -> 3.13; Code Health 7.52 -> 10.0. make check green (222 backend pytest, 7 frontend, ruff/mypy clean).
|
Warning Review limit reached
More reviews will be available in 26 minutes and 16 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR refactors the evaluation harness to consolidate scattered parameter passing into a unified ChangesEvaluation Harness Context Refactoring
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1d49ebec-24f0-4ba5-b719-f4c1defffab5
📒 Files selected for processing (3)
backend/tests/test_eval_harness.pyeval/harness.pyeval/run.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use type hints everywhere;mypymust pass cleanly with no errors
Write docstrings on public functions; comments should explain why, not what
Userufffor linting and formatting in Python; maintainruff checkandruff format --checkcompliance
Files:
eval/run.pybackend/tests/test_eval_harness.pyeval/harness.py
backend/tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
backend/tests/**/*.py: Hard-test all deterministic logic: chunking, the workflow engine, guardrails, and the audit log
Mock LLM and embeddings in tests viafakeimplementations; CI runs offline with no API keys required
Workflow engine MUST have explicit determinism and idempotency tests: same input produces same output; re-run produces no duplicate side effects
Audit log MUST be append-only; include a test that reconstructs current state by replaying events
Seed any randomness in tests; pin temperatures for LLM calls used in evaluation to ensure determinism
Files:
backend/tests/test_eval_harness.py
🧠 Learnings (1)
📚 Learning: 2026-06-08T15:13:33.301Z
Learnt from: CR
Repo: div0rce/sentinel PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-06-08T15:13:33.301Z
Learning: Applies to backend/app/llm/**/*.py : Implement LLM provider abstraction in `backend/app/llm/` and embeddings provider abstraction in `backend/app/embeddings/` to keep both swappable and mockable for tests
Applied to files:
backend/tests/test_eval_harness.py
🔇 Additional comments (18)
eval/run.py (1)
14-14: LGTM!Also applies to: 91-91
eval/harness.py (8)
1-25: LGTM!
136-202: LGTM!
207-278: LGTM!
281-317: LGTM!
360-392: LGTM!
398-472: LGTM!
475-515: LGTM!
536-567: LGTM!backend/tests/test_eval_harness.py (9)
41-41: LGTM!
175-182: LGTM!
205-211: LGTM!
238-245: LGTM!
311-318: LGTM!
349-356: LGTM!
414-420: LGTM!
458-465: LGTM!
506-513: LGTM!
Close the genuine public-API docstring gap in the files this PR touches: the four result dataclasses (ExtractionResult/RetrievalResult/RagResult/HarnessReport) and the eval.run functions. Public API in eval/harness.py and eval/run.py is now 100% documented (private helpers stay per CLAUDE.md's 'docstrings on public functions').
…ries semantics Address CodeRabbit review on PR #18: require_llm's docstring no longer implies caching (it re-resolves each call on a frozen dataclass), and RetrievalResult documents that n_queries is the scored-query count when quotable vs the total label count in the n/a case. Docstring-only; no behaviour change.
CodeRabbit's docstring-coverage pre-merge check is scored on the PR's changed files (PR #18 was at 64.5% < 80% threshold). Document every remaining function/method/class in eval/harness.py and test_eval_harness.py with meaningful one-liners (consistent with the tests that already carried docstrings) — describing what each test verifies and what each helper does, not stub padding. Changed files now 100%; clears the advisory honestly without lowering the threshold. Code Health unchanged (10.0); make-check green.
|
@coderabbitai review |
✅ Action performedReview finished.
|
Config-as-code in preference to hidden Org-UI settings. Encodes CLAUDE.md's docstring convention (public functions documented; comments explain why; tests self-document) as per-path docstring-generation instructions. Deliberately does NOT lower or disable the docstring-coverage pre-merge check — its threshold stays at the Org default. Side effect: the fresh review this commit triggers recomputes the (previously lagging) docstring-coverage figure at the current head, where the changed files are 100% documented.
There was a problem hiding this comment.
Code Health Improved
(1 files improve in Code Health)
Our agent can fix these. Install it.
Gates Passed
6 Quality Gates Passed
View Improvements
| File | Code Health Impact | Categories Improved |
|---|---|---|
| harness.py | 7.52 → 10.00 | Complex Method, Bumpy Road Ahead, Overall Code Complexity, Excess Number of Function Arguments |
Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.
|
@codex review |
|
Codex Review: Didn't find any major issues. Swish! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Document all remaining public production definitions (embedding/LLM provider impls, repository getters, the ingest CLI entry, request-id middleware, and dashboard/review response models). Docstring-only; no behaviour change. With #18, the repo's entire public production API is documented.
Milestone
Standalone tech-debt refactor (not a roadmap milestone) —
eval/harness.pyCode HealthSummary
eval/harness.pycarried pre-existing technical debt (CodeScene Code Health 7.52 — Yellow): three ~100-line evaluator functions (evaluate_extraction,evaluate_retrieval,evaluate_rag) with cyclomatic complexity 14–20, a bumpy-road nesting smell, and 5–6-argument signatures. This was flagged during the Gemini PR (#17) but deliberately kept out of that change — it's the eval logic with the honesty gates, so it belongs in its own behaviour-preserving PR. This restructures the module to Code Health 10.0 with no change to eval semantics.What changed (behaviour-preserving)
EvalContextbundles the shared dependencies (settings,embedder, lazyllm, corpus/label dirs); every evaluator and helper now takes(session, ctx). This is exactly the "missing abstraction" CodeScene named for the argument-count smell. The LLM stays lazy (require_llm()) so a retrieval-only run never forces an LLM provider to resolve — matching the original per-evaluator resolution._ExtractionTally/_RetrievalTally/_RagTally— withscore/record+to_resultmethods, plus shared_ingest_relevant_sources/_resolve_relevant_idshelpers and a_citation_validity_contributionhelper (removes the duplicated ingest block)._collect_filenamesflattened to comprehensions.Definition of Done
make checkpasses — 222 backend pytest, 7 frontend Vitest + build,ruff/ruff-format/mypycleantest_eval_harness.py's asserted metric values (micro/macro 0.75 & 1.0, precision@2 0.5, recall 1.0, MRR 1.0, the three RAG rates) are unchanged; only the call sites were updated to pass anEvalContextprovider == "fake"→quotable=False+None-metric returns are preserved verbatim in each evaluator (Golden Rule feat: schema-constrained structured extraction #5)CodeScene before → after
evaluate_rag/evaluate_extraction/evaluate_retrievalccNotes
Decoupled from the Gemini work by design — branched off
mainafter #17 merged. No public API beyond theeval/package changed; the only external caller,eval/run.py, now builds the context viaEvalContext.create(settings).