Add RAGAS faithfulness gate to agent graph by TylerAnderton · Pull Request #2 · TylerAnderton/Healthcare-Assistant

TylerAnderton · 2026-06-13T21:07:10Z

Summary

Embeds a RAGAS faithfulness gate inside the LangGraph agent and adds an
offline RAGAS evaluation harness. Also fixes several agent data-plumbing
defects the harness surfaced.

RAGAS gate (commit: feat)

New ragas_evaluator node scores every response (faithfulness,
relevance, context precision) before it reaches the user.
Low faithfulness triggers one forced re-retrieval via ragas_rewrite.
Grader, rewrite, and judge run on a non-thinking JUDGE_MODEL
(qwen3:4b-instruct-2507) — thinking judges burned ~5k hidden tokens
per verdict (minutes per call), and qwen3 *-2507 thinking variants
ignore think=false.

Data-plumbing fixes (commit: fix)

Surfaced by the offline suite — agent was returning wrong/empty data:

labs_tool: NaN cells (e.g. flag, 483/519 rows) leaked into
Optional[str] outputs → Pydantic validation failures.
meds_tool.list_current: date_start as pandas Timestamp and
missing cells as NaN, both rejected by the str output schema.
whoop_tool.recent: anchored window to wall-clock now(), so stale
ingestions returned nothing; now falls back to a window ending at the
latest date in the data.
structured_context.load_whoop_recent: iterated DataFrames as lists
("truth value of a DataFrame is ambiguous"), swallowed silently → WHOOP
snapshot block always empty.
prompts: instruct the model to discover exact analyte/medication
names via labs_list_analytes / meds_list_medications before lookups
(fixes wrong-row matches like ALT vs ALT (SGPT)).

Offline eval suite (diagnostic, opt-in)

tests/test_rag_eval.py runs the real agent + judge via Ollama, gated
behind --run-eval (never runs in normal CI). It is a diagnostic
harness, not a pass/fail gate: AnswerCorrectness scores are limited by
the small local judge (noisy at temperature=0, penalizes verbose-but-
correct answers).

Latest --run-eval: 5 passed / 10 failed. Structural fixes verified
working (multi-domain, meds-current, WHOOP cases now pass). Remaining
failures are dominated by judge noise / answer verbosity on otherwise-
correct answers.

Known follow-ups (not in scope)

most recent strain should source from workouts, not recovery
(recovery's strain is NaN) — ragas_012.
Small-judge AnswerCorrectness noise; consider a stronger judge or
terser-answer constraint if this suite becomes a gate.
Pre-existing failures unrelated to this PR: 5 meds_dosage_on_date,
7 whoop_accuracy (fixture/data drift).

Test plan

Unit tests for RAGAS nodes/routing (test_ragas_nodes.py) — pass, mocked
New unit tests: labs/meds/whoop/structured_context sanitization — pass
Fast suite (pytest tests --no-chat) — 234 passed, 12 pre-existing failures, no new regressions
Full --run-eval suite run live (1:15:08) — 5 pass / 10 fail, documented above
Live spot-check: ALT query now discovers ALT (SGPT) and returns 38.0 / 2025-03-27 (faithfulness 0.9)

🤖 Generated with Claude Code

Every response now passes through a ragas_evaluator node (faithfulness, relevance, context precision) before reaching the user; low faithfulness triggers one forced re-retrieval loop via ragas_rewrite. Grader, rewrite, and RAGAS judge run on a non-thinking JUDGE_MODEL (qwen3:4b-instruct-2507): thinking judges burned ~5k hidden tokens per verdict (minutes per call), and qwen3 *-2507 thinking variants ignore think=false entirely. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The offline RAGAS suite exposed several defects that made the agent return wrong or empty data: - labs_tool: missing cells (e.g. flag, NaN in 483/519 rows) leaked float NaN into Optional[str] outputs, failing Pydantic validation. - meds_tool.list_current: date_start came out as pandas Timestamp and missing cells as NaN, both rejected by the str output schema. - whoop_tool.recent: anchored its window strictly to wall-clock now(), so stale ingestions returned nothing; now falls back to a window ending at the latest date present in the data. - structured_context.load_whoop_recent: iterated DataFrames as lists of dicts ("truth value of a DataFrame is ambiguous"), silently swallowed, so the WHOOP snapshot block was always empty. - prompts: instruct the model to discover exact analyte/medication names via labs_list_analytes / meds_list_medications before lookups instead of guessing (fixed wrong-row matches like ALT vs ALT (SGPT)). Also fix the offline harness: AnswerCorrectness needs answer_similarity set explicitly when scored via single_turn_ascore (ragas only builds it inside evaluate()), and the judge needs format="json". Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

TylerAnderton and others added 2 commits June 11, 2026 23:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RAGAS faithfulness gate to agent graph#2

Add RAGAS faithfulness gate to agent graph#2
TylerAnderton wants to merge 2 commits into
mainfrom
eval/ragas

TylerAnderton commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TylerAnderton commented Jun 13, 2026

Summary

RAGAS gate (commit: feat)

Data-plumbing fixes (commit: fix)

Offline eval suite (diagnostic, opt-in)

Known follow-ups (not in scope)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant