Skip to content

Add RAGAS faithfulness gate to agent graph#2

Open
TylerAnderton wants to merge 2 commits into
mainfrom
eval/ragas
Open

Add RAGAS faithfulness gate to agent graph#2
TylerAnderton wants to merge 2 commits into
mainfrom
eval/ragas

Conversation

@TylerAnderton

Copy link
Copy Markdown
Owner

Summary

Embeds a RAGAS faithfulness gate inside the LangGraph agent and adds an
offline RAGAS evaluation harness. Also fixes several agent data-plumbing
defects the harness surfaced.

RAGAS gate (commit: feat)

  • New ragas_evaluator node scores every response (faithfulness,
    relevance, context precision) before it reaches the user.
  • Low faithfulness triggers one forced re-retrieval via ragas_rewrite.
  • Grader, rewrite, and judge run on a non-thinking JUDGE_MODEL
    (qwen3:4b-instruct-2507) — thinking judges burned ~5k hidden tokens
    per verdict (minutes per call), and qwen3 *-2507 thinking variants
    ignore think=false.

Data-plumbing fixes (commit: fix)

Surfaced by the offline suite — agent was returning wrong/empty data:

  • labs_tool: NaN cells (e.g. flag, 483/519 rows) leaked into
    Optional[str] outputs → Pydantic validation failures.
  • meds_tool.list_current: date_start as pandas Timestamp and
    missing cells as NaN, both rejected by the str output schema.
  • whoop_tool.recent: anchored window to wall-clock now(), so stale
    ingestions returned nothing; now falls back to a window ending at the
    latest date in the data.
  • structured_context.load_whoop_recent: iterated DataFrames as lists
    ("truth value of a DataFrame is ambiguous"), swallowed silently → WHOOP
    snapshot block always empty.
  • prompts: instruct the model to discover exact analyte/medication
    names via labs_list_analytes / meds_list_medications before lookups
    (fixes wrong-row matches like ALT vs ALT (SGPT)).

Offline eval suite (diagnostic, opt-in)

tests/test_rag_eval.py runs the real agent + judge via Ollama, gated
behind --run-eval (never runs in normal CI). It is a diagnostic
harness, not a pass/fail gate
: AnswerCorrectness scores are limited by
the small local judge (noisy at temperature=0, penalizes verbose-but-
correct answers).

Latest --run-eval: 5 passed / 10 failed. Structural fixes verified
working (multi-domain, meds-current, WHOOP cases now pass). Remaining
failures are dominated by judge noise / answer verbosity on otherwise-
correct answers.

Known follow-ups (not in scope)

  • most recent strain should source from workouts, not recovery
    (recovery's strain is NaN) — ragas_012.
  • Small-judge AnswerCorrectness noise; consider a stronger judge or
    terser-answer constraint if this suite becomes a gate.
  • Pre-existing failures unrelated to this PR: 5 meds_dosage_on_date,
    7 whoop_accuracy (fixture/data drift).

Test plan

  • Unit tests for RAGAS nodes/routing (test_ragas_nodes.py) — pass, mocked
  • New unit tests: labs/meds/whoop/structured_context sanitization — pass
  • Fast suite (pytest tests --no-chat) — 234 passed, 12 pre-existing failures, no new regressions
  • Full --run-eval suite run live (1:15:08) — 5 pass / 10 fail, documented above
  • Live spot-check: ALT query now discovers ALT (SGPT) and returns 38.0 / 2025-03-27 (faithfulness 0.9)

🤖 Generated with Claude Code

TylerAnderton and others added 2 commits June 11, 2026 23:16
Every response now passes through a ragas_evaluator node (faithfulness,
relevance, context precision) before reaching the user; low faithfulness
triggers one forced re-retrieval loop via ragas_rewrite.

Grader, rewrite, and RAGAS judge run on a non-thinking JUDGE_MODEL
(qwen3:4b-instruct-2507): thinking judges burned ~5k hidden tokens per
verdict (minutes per call), and qwen3 *-2507 thinking variants ignore
think=false entirely.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The offline RAGAS suite exposed several defects that made the agent
return wrong or empty data:

- labs_tool: missing cells (e.g. flag, NaN in 483/519 rows) leaked
  float NaN into Optional[str] outputs, failing Pydantic validation.
- meds_tool.list_current: date_start came out as pandas Timestamp and
  missing cells as NaN, both rejected by the str output schema.
- whoop_tool.recent: anchored its window strictly to wall-clock now(),
  so stale ingestions returned nothing; now falls back to a window
  ending at the latest date present in the data.
- structured_context.load_whoop_recent: iterated DataFrames as lists of
  dicts ("truth value of a DataFrame is ambiguous"), silently swallowed,
  so the WHOOP snapshot block was always empty.
- prompts: instruct the model to discover exact analyte/medication names
  via labs_list_analytes / meds_list_medications before lookups instead
  of guessing (fixed wrong-row matches like ALT vs ALT (SGPT)).

Also fix the offline harness: AnswerCorrectness needs answer_similarity
set explicitly when scored via single_turn_ascore (ragas only builds it
inside evaluate()), and the judge needs format="json".

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant