Skip to content

refactor(eval): restructure harness to Code Health 10.0#18

Merged
div0rce merged 5 commits into
mainfrom
refactor/eval-harness-health
Jun 8, 2026
Merged

refactor(eval): restructure harness to Code Health 10.0#18
div0rce merged 5 commits into
mainfrom
refactor/eval-harness-health

Conversation

@div0rce

@div0rce div0rce commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Milestone

Standalone tech-debt refactor (not a roadmap milestone) — eval/harness.py Code Health

Summary

eval/harness.py carried pre-existing technical debt (CodeScene Code Health 7.52 — Yellow): three ~100-line evaluator functions (evaluate_extraction, evaluate_retrieval, evaluate_rag) with cyclomatic complexity 14–20, a bumpy-road nesting smell, and 5–6-argument signatures. This was flagged during the Gemini PR (#17) but deliberately kept out of that change — it's the eval logic with the honesty gates, so it belongs in its own behaviour-preserving PR. This restructures the module to Code Health 10.0 with no change to eval semantics.

What changed (behaviour-preserving)

  • EvalContext bundles the shared dependencies (settings, embedder, lazy llm, corpus/label dirs); every evaluator and helper now takes (session, ctx). This is exactly the "missing abstraction" CodeScene named for the argument-count smell. The LLM stays lazy (require_llm()) so a retrieval-only run never forces an LLM provider to resolve — matching the original per-evaluator resolution.
  • Each scoring loop is extracted into a focused tally — _ExtractionTally / _RetrievalTally / _RagTally — with score/record + to_result methods, plus shared _ingest_relevant_sources / _resolve_relevant_ids helpers and a _citation_validity_contribution helper (removes the duplicated ingest block).
  • _collect_filenames flattened to comprehensions.

Definition of Done

  • make check passes — 222 backend pytest, 7 frontend Vitest + build, ruff/ruff-format/mypy clean
  • Behaviour preservedtest_eval_harness.py's asserted metric values (micro/macro 0.75 & 1.0, precision@2 0.5, recall 1.0, MRR 1.0, the three RAG rates) are unchanged; only the call sites were updated to pass an EvalContext
  • Honesty gates intact — the provider == "fake"quotable=False + None-metric returns are preserved verbatim in each evaluator (Golden Rule feat: schema-constrained structured extraction #5)
  • No secrets; no behavioural/API change beyond the internal signatures
  • CodeScene change-set verdict "improved", quality gates "passed"

CodeScene before → after

Before After
File Code Health 7.52 (Yellow) 10.0
Module mean cyclomatic complexity 6.25 3.13
evaluate_rag / evaluate_extraction / evaluate_retrieval cc 20 / 17 / 14 ~3–5 each
Largest evaluator 105 LoC ~30 LoC
Bumpy Road / Excess-Arguments smells present cleared

Notes

Decoupled from the Gemini work by design — branched off main after #17 merged. No public API beyond the eval/ package changed; the only external caller, eval/run.py, now builds the context via EvalContext.create(settings).

eval/harness.py carried pre-existing technical debt (CodeScene Code Health 7.52):
three ~100-line evaluator functions with cyclomatic complexity 14-20, a bumpy-road
nesting smell, and 5-6 argument signatures.

Refactor (behaviour-preserving — the asserted metric values in test_eval_harness.py
are unchanged):
- Introduce EvalContext bundling the shared deps (settings, embedder, lazy llm,
  corpus/label dirs); every evaluator and helper now takes (session, ctx). This is
  the 'missing abstraction' CodeScene flagged for the argument-count smell.
- Extract each scoring loop into a focused tally (_ExtractionTally / _RetrievalTally
  / _RagTally) with score/record + to_result methods, plus shared _ingest_relevant_
  sources / _resolve_relevant_ids helpers and a _citation_validity_contribution helper.
- Flatten _collect_filenames to comprehensions.

Result: every evaluator drops to cc 3-5 / ~25 lines / 2 args; module mean cc 6.25 -> 3.13;
Code Health 7.52 -> 10.0. make check green (222 backend pytest, 7 frontend, ruff/mypy clean).
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@div0rce, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 26 minutes and 16 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7e5b69f5-1fd1-4175-a695-c623af08036d

📥 Commits

Reviewing files that changed from the base of the PR and between 2151556 and 7edd0e2.

📒 Files selected for processing (4)
  • .coderabbit.yaml
  • backend/tests/test_eval_harness.py
  • eval/harness.py
  • eval/run.py
📝 Walkthrough

Walkthrough

This PR refactors the evaluation harness to consolidate scattered parameter passing into a unified EvalContext object that bundles settings, embedding/LLM providers, and corpus/labels directories. Evaluator APIs, orchestration, tests, and entry points are updated to construct and use this context instead of passing individual parameters.

Changes

Evaluation Harness Context Refactoring

Layer / File(s) Summary
EvalContext definition and refactored helper utilities
eval/harness.py
Introduced EvalContext class with create factory and require_llm method, plus refactored corpus-loading, label-loading, and ingestion helpers to pull settings/embedder/corpus_dir from the context.
Extraction evaluator with tally system
eval/harness.py
Introduced _ExtractionTally for mutable per-document scoring, refactored ingestion resolution helpers, and updated evaluate_extraction to use context-driven control flow with ctx.require_llm() and tally aggregation.
Retrieval evaluator with tally system
eval/harness.py
Introduced _RetrievalTally for mutable metric recording, and updated evaluate_retrieval to ingest relevant sources once via context, resolve chunk IDs per query, run vector retrieval, and record ranking metrics through tally aggregation.
RAG evaluator with tally system and citation parsing
eval/harness.py
Introduced _RagTally with record_answer method and helpers (_parse_cited_chunk_ids, _citation_validity_contribution) for citation and substring scoring. Updated evaluate_rag to resolve LLM via ctx.require_llm(), ingest sources once, run answer queries, record answer metrics through tally, and return aggregated result.
Orchestration and utility updates
eval/harness.py
Updated run_all signature to accept (session, ctx) and invoke all three evaluators with the shared context. Simplified _collect_filenames to comprehension-based concatenation.
Entry point integration
eval/run.py
Updated eval/run.py to import EvalContext and construct it via EvalContext.create(settings=settings) before passing to run_all.
Test adoption of EvalContext pattern
backend/tests/test_eval_harness.py
Updated all test cases to import EvalContext and construct context objects containing settings, providers, and corpus/labels directories before calling evaluate_extraction, evaluate_retrieval, and evaluate_rag. Covers fake provider tests, correctness tests, metric tests, and writer round-trip test.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 64.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'refactor(eval): restructure harness to Code Health 10.0' clearly and specifically summarizes the main change—a targeted refactor of eval/harness.py to improve Code Health metrics.
Description check ✅ Passed The PR description closely follows the template with all major sections completed: Milestone identified, Summary provided (4 sentences covering the technical debt, refactor scope, behavior preservation), Definition of Done with checked items, and detailed Notes explaining CodeScene improvements and API changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1d49ebec-24f0-4ba5-b719-f4c1defffab5

📥 Commits

Reviewing files that changed from the base of the PR and between 8fc2eac and 2151556.

📒 Files selected for processing (3)
  • backend/tests/test_eval_harness.py
  • eval/harness.py
  • eval/run.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use type hints everywhere; mypy must pass cleanly with no errors
Write docstrings on public functions; comments should explain why, not what
Use ruff for linting and formatting in Python; maintain ruff check and ruff format --check compliance

Files:

  • eval/run.py
  • backend/tests/test_eval_harness.py
  • eval/harness.py
backend/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/tests/**/*.py: Hard-test all deterministic logic: chunking, the workflow engine, guardrails, and the audit log
Mock LLM and embeddings in tests via fake implementations; CI runs offline with no API keys required
Workflow engine MUST have explicit determinism and idempotency tests: same input produces same output; re-run produces no duplicate side effects
Audit log MUST be append-only; include a test that reconstructs current state by replaying events
Seed any randomness in tests; pin temperatures for LLM calls used in evaluation to ensure determinism

Files:

  • backend/tests/test_eval_harness.py
🧠 Learnings (1)
📚 Learning: 2026-06-08T15:13:33.301Z
Learnt from: CR
Repo: div0rce/sentinel PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-06-08T15:13:33.301Z
Learning: Applies to backend/app/llm/**/*.py : Implement LLM provider abstraction in `backend/app/llm/` and embeddings provider abstraction in `backend/app/embeddings/` to keep both swappable and mockable for tests

Applied to files:

  • backend/tests/test_eval_harness.py
🔇 Additional comments (18)
eval/run.py (1)

14-14: LGTM!

Also applies to: 91-91

eval/harness.py (8)

1-25: LGTM!


136-202: LGTM!


207-278: LGTM!


281-317: LGTM!


360-392: LGTM!


398-472: LGTM!


475-515: LGTM!


536-567: LGTM!

backend/tests/test_eval_harness.py (9)

41-41: LGTM!


175-182: LGTM!


205-211: LGTM!


238-245: LGTM!


311-318: LGTM!


349-356: LGTM!


414-420: LGTM!


458-465: LGTM!


506-513: LGTM!

Comment thread eval/harness.py
Comment thread eval/harness.py
div0rce added 3 commits June 8, 2026 11:41
Close the genuine public-API docstring gap in the files this PR touches: the four
result dataclasses (ExtractionResult/RetrievalResult/RagResult/HarnessReport) and the
eval.run functions. Public API in eval/harness.py and eval/run.py is now 100%
documented (private helpers stay per CLAUDE.md's 'docstrings on public functions').
…ries semantics

Address CodeRabbit review on PR #18: require_llm's docstring no longer implies
caching (it re-resolves each call on a frozen dataclass), and RetrievalResult
documents that n_queries is the scored-query count when quotable vs the total
label count in the n/a case. Docstring-only; no behaviour change.
CodeRabbit's docstring-coverage pre-merge check is scored on the PR's changed files
(PR #18 was at 64.5% < 80% threshold). Document every remaining function/method/class
in eval/harness.py and test_eval_harness.py with meaningful one-liners (consistent with
the tests that already carried docstrings) — describing what each test verifies and what
each helper does, not stub padding. Changed files now 100%; clears the advisory honestly
without lowering the threshold. Code Health unchanged (10.0); make-check green.
codescene-delta-analysis[bot]

This comment was marked as outdated.

@div0rce

div0rce commented Jun 8, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Config-as-code in preference to hidden Org-UI settings. Encodes CLAUDE.md's
docstring convention (public functions documented; comments explain why; tests
self-document) as per-path docstring-generation instructions. Deliberately does
NOT lower or disable the docstring-coverage pre-merge check — its threshold stays
at the Org default. Side effect: the fresh review this commit triggers recomputes
the (previously lagging) docstring-coverage figure at the current head, where the
changed files are 100% documented.

@codescene-delta-analysis codescene-delta-analysis Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Health Improved (1 files improve in Code Health)

Our agent can fix these. Install it.

Gates Passed
6 Quality Gates Passed

View Improvements
File Code Health Impact Categories Improved
harness.py 7.52 → 10.00 Complex Method, Bumpy Road Ahead, Overall Code Complexity, Excess Number of Function Arguments

Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.

@div0rce

div0rce commented Jun 8, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@div0rce div0rce merged commit 5b709ed into main Jun 8, 2026
5 checks passed
@div0rce div0rce deleted the refactor/eval-harness-health branch June 8, 2026 19:53
div0rce added a commit that referenced this pull request Jun 8, 2026
Document all remaining public production definitions (embedding/LLM provider impls, repository getters, the ingest CLI entry, request-id middleware, and dashboard/review response models). Docstring-only; no behaviour change. With #18, the repo's entire public production API is documented.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant