feat: add evaluations for RAG quickst#147
Conversation
mhdawson
commented
Feb 24, 2026
- add framework to run conversations against RAG quickstart UI and capture the user request, agent response along with the RAG results shown in the UI
- add framework to evaluate captured conversation using deep_eval to evaluate the response given "expected" RAG results and to evaluate the actual RAG chunks against the expected answer and the "expected" RAG results
- add initial set of conversations for the hr and legal databases using the questions suggested in the UI
- see evaluations/README.md on how to setup and run
- add framework to run conversations against RAG quickstart
UI and capture the user request, agent response along
with the RAG results shown in the UI
- add framework to evaluate captured conversation using
deep_eval to evaluate the response given "expected"
RAG results and to evaluate the actual RAG chunks against
the expected answer and the "expected" RAG results
- add initial set of conversations for the hr and legal
databases using the questions suggested in the UI
- see evaluations/README.md on how to setup and run
Signed-off-by: Michael Dawson <midawson@redhat.com>
|
Dev02 is not working right now, so can't do final testing, but wanted to submit this as I'll be out Wed-Friday. If I can get a working deployment of the RAG quickstart tomorrow (Tuesday) I'll do the final testing |
|
A few things I have learned so far running/building the evals:
|
|
this is an example of running just 2 of the test conversations, one from hr and one from legal: [hr_benefits_eap_direct_20260219_163309.json]
✅ Response Accuracy [Conversational GEval] (score: 1.00, threshold: 0.7)
✅ Response Completeness [Conversational GEval] (score: 1.00, threshold: 0.6)
✅ Answer Relevance [Conversational GEval] (score: 1.00, threshold: 0.7)
[legal_liability_clause_direct_20260219_164109.json]
✅ Response Accuracy [Conversational GEval] (score: 1.00, threshold: 0.7)
❌ Response Completeness [Conversational GEval] (score: 0.50, threshold: 0.6)
Reason: The assistant's response mentions the liability clause as one of the key areas of review, specifically under 'Risk Assessment'. It also provides a general explanation of what the liability clause typi...
❌ Answer Relevance [Conversational GEval] (score: 0.60, threshold: 0.7)
Reason: The assistant's response attempts to address the user's specific question about the liability clause by mentioning its relevance in the contract review process and providing context about its typical ...
Retrieval Results (Limited Chunks):
[hr_benefits_eap_direct_20260219_163309.json_turn_1]
✅ Chunk Count Limit (score: 1.00, threshold: 1.0)
✅ Chunk Deduplication (score: 1.00, threshold: 1.0)
❌ Chunk Alignment [GEval] (score: 0.20, threshold: 0.7)
Reason: The retrieval context contains information about various employee benefits, perks, and company culture at FantaCo, but it does not specifically mention the 'Cry Closet' described in the context. The '...
❌ Contextual Precision (score: 0.33, threshold: 0.7)
Reason: The score is 0.33 because the relevant node, ranked 3rd, is outranked by 2 irrelevant nodes, and there are 7 more irrelevant nodes ranked lower than it, indicating that the relevant node is not ranked...
❌ Contextual Relevancy (score: 0.00, threshold: 0.7)
Reason: The score is 0.00 because the retrieval context does not mention an employee assistance program and instead describes unrelated office perks, benefits, and company policies, such as catering for dieta...
❌ Faithfulness (score: 0.14, threshold: 0.7)
Reason: The score is 0.14 because the actual output likely mentions a 'Cry Closet' or 'Champagne, Compliments & Catharsis Chamber' which is not present in the retrieval context, indicating a significant devia...
[legal_liability_clause_direct_20260219_164109.json_turn_1]
✅ Chunk Count Limit (score: 1.00, threshold: 1.0)
✅ Chunk Deduplication (score: 1.00, threshold: 1.0)
✅ Chunk Alignment [GEval] (score: 0.80, threshold: 0.7)
✅ Contextual Precision (score: 0.76, threshold: 0.7)
❌ Contextual Relevancy (score: 0.06, threshold: 0.7)
Reason: The score is 0.06 because the retrieval context is largely irrelevant to the input question about the liability clause. As stated, 'The provided context does not contain information about the liabilit...
✅ Faithfulness (score: 0.83, threshold: 0.7)
================================================================================
EVALUATION SUMMARY
================================================================================
✗ FAIL hr_benefits_eap_direct_20260219_163309.json
FAILURES:
• [Retrieval] Chunk Alignment [GEval] (score: 0.20, threshold: 0.7)
The retrieval context contains information about various employee
benefits, perks, and company culture at FantaCo, but i...
• [Retrieval] Contextual Precision (score: 0.33, threshold: 0.7)
The score is 0.33 because the relevant node, ranked 3rd, is outranked
by 2 irrelevant nodes, and there are 7 more irrele...
• [Retrieval] Contextual Relevancy (score: 0.00, threshold: 0.7)
The score is 0.00 because the retrieval context does not mention an
employee assistance program and instead describes un...
• [Retrieval] Faithfulness (score: 0.14, threshold: 0.7)
The score is 0.14 because the actual output likely mentions a 'Cry
Closet' or 'Champagne, Compliments & Catharsis Chambe...
✗ FAIL legal_liability_clause_direct_20260219_164109.json
FAILURES:
• [Conversational] Response Completeness [Conversational GEval] (score: 0.50, threshold: 0.6)
The assistant's response mentions the liability clause as one of the
key areas of review, specifically under 'Risk Asses...
• [Conversational] Answer Relevance [Conversational GEval] (score: 0.60, threshold: 0.7)
The assistant's response attempts to address the user's specific
question about the liability clause by mentioning its r...
• [Retrieval] Contextual Relevancy (score: 0.06, threshold: 0.7)
The score is 0.06 because the retrieval context is largely irrelevant
to the input question about the liability clause. ...
================================================================================
================================================================================
Evaluation Complete!
================================================================================
Conversational evaluations: 2
Retrieval evaluations: 2
Results saved to: results/deep_eval_results/evaluation_results_20260223_202954.json
Token Usage:
LLM API requests: 42
Input tokens: 67,146
Output tokens: 15,622
Total tokens: 82,768
================================================================================ |
❌ Faithfulness (score: 0.14, threshold: 0.7)
Reason: The score is 0.14 because the actual output likely mentions a 'Cry Closet' or 'Champagne, Compliments & Catharsis Chamber' which is not present in the retrieval context, indicating a significant devia...Is an example of where the "fancifal" nature gives the built in deep_eval Faithfulness metric trouble. Despite the "Cry closet" being in the chunks retrieved, it seems to discard it when creating facts to cross check the claims in the agent response (at least that is the assessment of claude and it tends to align with having to tell the LLM to not discard facts that don't seem realistic in the evals we created ourselves) |
| @echo -e "$(GREEN)Command-Line Parameters (override values file):$(NC)" | ||
| @echo -e " LLM - Enable specific LLM model (e.g., llama-3-2-3b-instruct)" | ||
| @echo -e " SAFETY - Enable specific safety model (e.g., llama-guard-3-8b)" | ||
| @echo -e " LLM_ID - Model ID for LLM (required for remote models)" |
There was a problem hiding this comment.
These allow you to use external models when setting them through environment variables. We had to add them in the it-self-service-agent quickstart as well.
|
Managed to get the quickstart to run and ran generation/evals on existing conversations/tests. This is with scout17b as the model for the RAG quickstart, and scout17b for the LLM used with deep_eval. These were the results: ================================================================================
EVALUATION SUMMARY
================================================================================
✗ FAIL hr_benefits_eap_direct_20260224_095612.json
FAILURES:
• [Retrieval] Chunk Alignment [GEval] (score: 0.60, threshold: 0.7)
The retrieval context contains information about various employee
benefits, perks, and company culture at FantaCo, but i...
• [Retrieval] Contextual Precision (score: 0.17, threshold: 0.7)
The score is 0.17 because the relevant node, which mentions the
'Champagne, Compliments & Catharsis Chamber' (also known...
• [Retrieval] Contextual Relevancy (score: 0.00, threshold: 0.7)
The score is 0.00 because the retrieval context is completely
unrelated to the input, with all provided context discussi...
✗ FAIL hr_benefits_enrollment_direct_20260224_095639.json
FAILURES:
• [Conversational] Answer Relevance [Conversational GEval] (score: 0.60, threshold: 0.7)
The assistant's response attempts to address the user's specific
question about enrolling in benefits by suggesting a po...
• [Retrieval] Contextual Precision (score: 0.00, threshold: 0.7)
The score is 0.00 because all nodes in retrieval contexts, which are
irrelevant to enrolling in benefits, are ranked abo...
• [Retrieval] Contextual Relevancy (score: 0.02, threshold: 0.7)
The score is 0.02 because the retrieval context is largely irrelevant
to enrolling in benefits, mentioning items such as...
• [Retrieval] Faithfulness (score: 0.67, threshold: 0.7)
The score is 0.67 because the actual output seems to be missing key
information, such as the FantaCo Human Resources, Ha...
✗ FAIL hr_benefits_health_insurance_direct_20260224_095703.json
FAILURES:
• [Retrieval] Chunk Alignment [GEval] (score: 0.60, threshold: 0.7)
The retrieval context contains a significant portion of the expected
content, including details about the 'Fountain of Y...
• [Retrieval] Contextual Relevancy (score: 0.04, threshold: 0.7)
The score is 0.04 because the retrieval context is largely irrelevant
to the input question about health insurance benef...
• [Retrieval] Faithfulness (score: 0.56, threshold: 0.7)
The score is 0.56 because the actual output likely overpromises the
benefits of FantaCo's health insurance, which accord...
✗ FAIL hr_benefits_parental_leave_direct_20260224_095730.json
FAILURES:
• [Retrieval] Chunk Alignment [GEval] (score: 0.50, threshold: 0.7)
The context is empty, and the retrieval context contains a large,
duplicate-free chunk of text describing the benefits a...
• [Retrieval] Contextual Precision (score: 0.00, threshold: 0.7)
The score is 0.00 because all nodes in retrieval contexts are
irrelevant to the input and are ranked together, with no r...
• [Retrieval] Contextual Relevancy (score: 0.03, threshold: 0.7)
The score is 0.03 because the retrieval context is largely irrelevant
to the input question about parental leave policy....
✗ FAIL hr_benefits_retirement_direct_20260224_095753.json
FAILURES:
• [Retrieval] Contextual Relevancy (score: 0.17, threshold: 0.7)
The score is 0.17 because the retrieval context is largely irrelevant
to the input question about retirement benefits. M...
✗ FAIL hr_benefits_test_agent_20260224_095819.json
FAILURES:
• [Retrieval] Contextual Relevancy (score: 0.31, threshold: 0.7)
The score is 0.31 because the retrieval context, while describing
various FantaCo benefits like the "Midas Touch & Beyon...
• [Retrieval] Faithfulness (score: 0.62, threshold: 0.7)
The score is 0.62 because the actual output appears to be a
fantastical and humorous description of a workplace or healt...
✗ FAIL hr_benefits_test_agent_all_collections_20260224_095849.json
FAILURES:
• [Retrieval] Chunk Count Limit (score: 0.00, threshold: 1.0)
Retrieved 36 chunks, which exceeds the limit of 10
• [Retrieval] Chunk Alignment [GEval] (score: 0.00, threshold: 0.7)
The retrieval context does not contain any information related to the
context provided. The context discusses employee b...
• [Retrieval] Contextual Precision (score: 0.00, threshold: 0.7)
The score is 0.00 because all nodes in retrieval contexts, which are
ranked in order, are irrelevant to the input and me...
• [Retrieval] Contextual Relevancy (score: 0.00, threshold: 0.7)
The score is 0.00 because the retrieval context is completely
unrelated to the input query about 'HR benefits for FantaC...
✗ FAIL hr_benefits_test_direct_20260224_095931.json
FAILURES:
• [Retrieval] Contextual Relevancy (score: 0.34, threshold: 0.7)
The score is 0.34 because the retrieval context, while describing
various benefits and a unique work environment at Fant...
✗ FAIL hr_benefits_test_fail_20260224_095959.json
FAILURES:
• [Conversational] Answer Relevance [Conversational GEval] (score: 0.40, threshold: 0.7)
The assistant's response provides general information about common HR
benefits that companies might offer, but does not ...
• [Retrieval] Chunk Alignment [GEval] (score: 0.00, threshold: 0.7)
The retrieval context is empty, containing no chunks to evaluate
against the provided context. As a result, there is no ...
• [Retrieval] Contextual Precision (score: 0.00, threshold: 0.7)
The score is 0.00 because there are no nodes in the retrieval
contexts, making it impossible to assess the ranking of re...
• [Retrieval] Contextual Relevancy (score: 0.00, threshold: 0.7)
The score is 0.00 because there are no relevant statements in the
retrieval context to support the input query about HR ...
✗ FAIL hr_benefits_vacation_days_direct_20260224_100024.json
FAILURES:
• [Retrieval] Contextual Relevancy (score: 0.26, threshold: 0.7)
The score is 0.26 because the retrieval context, although creative and
describing various employee benefits, does not di...
✗ FAIL legal_compliance_requirements_direct_20260224_100048.json
FAILURES:
• [Retrieval] Contextual Relevancy (score: 0.67, threshold: 0.7)
The score is 0.67 because although the retrieval context contains some
relevant information about compliance requirement...
✗ FAIL legal_dispute_resolution_direct_20260224_100113.json
FAILURES:
• [Retrieval] Contextual Relevancy (score: 0.31, threshold: 0.7)
The score is 0.31 because the retrieval context is largely irrelevant
to the dispute resolution process, with most state...
✗ FAIL legal_intellectual_property_direct_20260224_100138.json
FAILURES:
• [Retrieval] Contextual Precision (score: 0.10, threshold: 0.7)
The score is 0.10 because the relevant node, ranked 10th, mentions
intellectual property rights in the context of Legal ...
• [Retrieval] Contextual Relevancy (score: 0.11, threshold: 0.7)
The score is 0.11 because the retrieval context is largely irrelevant
to intellectual property rights, primarily focusin...
• [Retrieval] Faithfulness (score: 0.67, threshold: 0.7)
The score is 0.67 because the actual output likely introduced an
extraneous element, 'FantaCo Legal Guidelines for Contr...
✗ FAIL legal_key_contract_terms_direct_20260224_100204.json
FAILURES:
• [Retrieval] Chunk Alignment [GEval] (score: 0.40, threshold: 0.7)
The retrieval context contains detailed policy and procedure sections
related to contract review, including key areas of...
• [Retrieval] Contextual Precision (score: 0.33, threshold: 0.7)
The score is 0.33 because although the relevant node, which clearly
addresses the question by identifying 'six key areas...
• [Retrieval] Contextual Relevancy (score: 0.60, threshold: 0.7)
The score is 0.60 because although the retrieval context contains some
relevant information about contract review proces...
✗ FAIL legal_liability_clause_direct_20260224_100229.json
FAILURES:
• [Retrieval] Contextual Relevancy (score: 0.04, threshold: 0.7)
The score is 0.04 because the retrieval context is largely irrelevant
to the input question about the liability clause. ...
✗ FAIL legal_termination_conditions_direct_20260224_100254.json
FAILURES:
• [Conversational] Response Completeness [Conversational GEval] (score: 0.20, threshold: 0.6)
The assistant's response mentions some key facts from the retrieval
context, specifically 'clear start and end dates', '...
• [Retrieval] Contextual Relevancy (score: 0.11, threshold: 0.7)
The score is 0.11 because the retrieval context is largely irrelevant
to the input question about termination conditions...
================================================================================
================================================================================
Evaluation Complete!
================================================================================
Conversational evaluations: 16
Retrieval evaluations: 16
Results saved to: results/deep_eval_results/evaluation_results_20260224_101133.json
Token Usage:
LLM API requests: 326
Input tokens: 543,758
Output tokens: 120,554
Total tokens: 664,312
================================================================================So I think this PR is ready for review, further work will be needed to either tweak the tests to pass either by improving the precision of the RAG lookup, adjusting the thresholds that we consider a pass, or possibly tweaking the questions we ask. However, I think that should be done in a follow on PR, with this PR adding the framework and initial evaluations. |