Add RAGAS faithfulness gate to agent graph#2
Open
TylerAnderton wants to merge 2 commits into
Open
Conversation
Every response now passes through a ragas_evaluator node (faithfulness, relevance, context precision) before reaching the user; low faithfulness triggers one forced re-retrieval loop via ragas_rewrite. Grader, rewrite, and RAGAS judge run on a non-thinking JUDGE_MODEL (qwen3:4b-instruct-2507): thinking judges burned ~5k hidden tokens per verdict (minutes per call), and qwen3 *-2507 thinking variants ignore think=false entirely. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The offline RAGAS suite exposed several defects that made the agent
return wrong or empty data:
- labs_tool: missing cells (e.g. flag, NaN in 483/519 rows) leaked
float NaN into Optional[str] outputs, failing Pydantic validation.
- meds_tool.list_current: date_start came out as pandas Timestamp and
missing cells as NaN, both rejected by the str output schema.
- whoop_tool.recent: anchored its window strictly to wall-clock now(),
so stale ingestions returned nothing; now falls back to a window
ending at the latest date present in the data.
- structured_context.load_whoop_recent: iterated DataFrames as lists of
dicts ("truth value of a DataFrame is ambiguous"), silently swallowed,
so the WHOOP snapshot block was always empty.
- prompts: instruct the model to discover exact analyte/medication names
via labs_list_analytes / meds_list_medications before lookups instead
of guessing (fixed wrong-row matches like ALT vs ALT (SGPT)).
Also fix the offline harness: AnswerCorrectness needs answer_similarity
set explicitly when scored via single_turn_ascore (ragas only builds it
inside evaluate()), and the judge needs format="json".
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Embeds a RAGAS faithfulness gate inside the LangGraph agent and adds an
offline RAGAS evaluation harness. Also fixes several agent data-plumbing
defects the harness surfaced.
RAGAS gate (commit: feat)
ragas_evaluatornode scores every response (faithfulness,relevance, context precision) before it reaches the user.
ragas_rewrite.JUDGE_MODEL(
qwen3:4b-instruct-2507) — thinking judges burned ~5k hidden tokensper verdict (minutes per call), and qwen3 *-2507 thinking variants
ignore
think=false.Data-plumbing fixes (commit: fix)
Surfaced by the offline suite — agent was returning wrong/empty data:
flag, 483/519 rows) leaked intoOptional[str]outputs → Pydantic validation failures.date_startas pandasTimestampandmissing cells as NaN, both rejected by the str output schema.
now(), so staleingestions returned nothing; now falls back to a window ending at the
latest date in the data.
("truth value of a DataFrame is ambiguous"), swallowed silently → WHOOP
snapshot block always empty.
names via
labs_list_analytes/meds_list_medicationsbefore lookups(fixes wrong-row matches like
ALTvsALT (SGPT)).Offline eval suite (diagnostic, opt-in)
tests/test_rag_eval.pyruns the real agent + judge via Ollama, gatedbehind
--run-eval(never runs in normal CI). It is a diagnosticharness, not a pass/fail gate:
AnswerCorrectnessscores are limited bythe small local judge (noisy at temperature=0, penalizes verbose-but-
correct answers).
Latest
--run-eval: 5 passed / 10 failed. Structural fixes verifiedworking (multi-domain, meds-current, WHOOP cases now pass). Remaining
failures are dominated by judge noise / answer verbosity on otherwise-
correct answers.
Known follow-ups (not in scope)
most recent strainshould source from workouts, not recovery(recovery's
strainis NaN) — ragas_012.AnswerCorrectnessnoise; consider a stronger judge orterser-answer constraint if this suite becomes a gate.
meds_dosage_on_date,7
whoop_accuracy(fixture/data drift).Test plan
test_ragas_nodes.py) — pass, mockedpytest tests --no-chat) — 234 passed, 12 pre-existing failures, no new regressions--run-evalsuite run live (1:15:08) — 5 pass / 10 fail, documented aboveALT (SGPT)and returns 38.0 / 2025-03-27 (faithfulness 0.9)🤖 Generated with Claude Code