feat: add evaluations for RAG quickst by mhdawson · Pull Request #147 · rh-ai-quickstart/RAG

mhdawson · 2026-02-24T01:06:18Z

add framework to run conversations against RAG quickstart UI and capture the user request, agent response along with the RAG results shown in the UI
add framework to evaluate captured conversation using deep_eval to evaluate the response given "expected" RAG results and to evaluate the actual RAG chunks against the expected answer and the "expected" RAG results
add initial set of conversations for the hr and legal databases using the questions suggested in the UI
see evaluations/README.md on how to setup and run

- add framework to run conversations against RAG quickstart UI and capture the user request, agent response along with the RAG results shown in the UI - add framework to evaluate captured conversation using deep_eval to evaluate the response given "expected" RAG results and to evaluate the actual RAG chunks against the expected answer and the "expected" RAG results - add initial set of conversations for the hr and legal databases using the questions suggested in the UI - see evaluations/README.md on how to setup and run Signed-off-by: Michael Dawson <midawson@redhat.com>

mhdawson · 2026-02-24T01:07:35Z

Dev02 is not working right now, so can't do final testing, but wanted to submit this as I'll be out Wed-Friday.

If I can get a working deployment of the RAG quickstart tomorrow (Tuesday) I'll do the final testing

mhdawson · 2026-02-24T01:31:58Z

A few things I have learned so far running/building the evals:

By default it seems that llamastack will return the top 10 chunks when searching a collection.
If you specify multiple collections it returns the top 10 by default for each collection and does not do anything to re-order based on similarity so the first ones will be the top 10 for the first collection, the second 10 for the next collection etc.
Running the deep_eval built in evals related to RAG take a larger context window than we have with our 70b model so the best model I've been able to test with is scout17b.
Generally the RAG lookups are returning lots of chunks which are not really related as there is no similarity cutoff by default. It will return the top 10 even if all 10 are not that similar to the query. When you run the evals you can see this in that the metrics that check how well the chunks related to the expected answer are reporting a low score. This is because for many of the questions there are at most 1 or 2 chunks that contain the information that is needed, and the other 8 have unrelated content.
Despite 4), scout seems to be able to generate reasonable answers, finding the right info in the chunks that are returned. The metrics that check how good the answer is based on the "ideal" rag chunks from the source documents (as configured for the test) pass.
LLM's have a harder time with "fanciful" content. The pre-built deep_eval metrics struggle on the hr content because not being grounded in what they know as "reality" they will discard some facts that are in the context or retrieved chunks. The metric results on the hr content is poorer than the metric results on the legal content which is more concrete. For the metrics we contructed we had to tell the LLM to consider what we passed in as the "ground truth" even if it did not seem realistic.
For the hr content the similarity results seem to be closely clustered even for questions that are different. That may againt be partly because of their "fanciful" nature. I've not had time to see if the same is true for the legal content.

mhdawson · 2026-02-24T01:34:15Z

this is an example of running just 2 of the test conversations, one from hr and one from legal:

 [hr_benefits_eap_direct_20260219_163309.json]
    ✅ Response Accuracy [Conversational GEval] (score: 1.00, threshold: 0.7)
    ✅ Response Completeness [Conversational GEval] (score: 1.00, threshold: 0.6)
    ✅ Answer Relevance [Conversational GEval] (score: 1.00, threshold: 0.7)

  [legal_liability_clause_direct_20260219_164109.json]
    ✅ Response Accuracy [Conversational GEval] (score: 1.00, threshold: 0.7)
    ❌ Response Completeness [Conversational GEval] (score: 0.50, threshold: 0.6)
       Reason: The assistant's response mentions the liability clause as one of the key areas of review, specifically under 'Risk Assessment'. It also provides a general explanation of what the liability clause typi...
    ❌ Answer Relevance [Conversational GEval] (score: 0.60, threshold: 0.7)
       Reason: The assistant's response attempts to address the user's specific question about the liability clause by mentioning its relevance in the contract review process and providing context about its typical ...



Retrieval Results (Limited Chunks):

  [hr_benefits_eap_direct_20260219_163309.json_turn_1]
    ✅ Chunk Count Limit (score: 1.00, threshold: 1.0)
    ✅ Chunk Deduplication (score: 1.00, threshold: 1.0)
    ❌ Chunk Alignment [GEval] (score: 0.20, threshold: 0.7)
       Reason: The retrieval context contains information about various employee benefits, perks, and company culture at FantaCo, but it does not specifically mention the 'Cry Closet' described in the context. The '...
    ❌ Contextual Precision (score: 0.33, threshold: 0.7)
       Reason: The score is 0.33 because the relevant node, ranked 3rd, is outranked by 2 irrelevant nodes, and there are 7 more irrelevant nodes ranked lower than it, indicating that the relevant node is not ranked...
    ❌ Contextual Relevancy (score: 0.00, threshold: 0.7)
       Reason: The score is 0.00 because the retrieval context does not mention an employee assistance program and instead describes unrelated office perks, benefits, and company policies, such as catering for dieta...
    ❌ Faithfulness (score: 0.14, threshold: 0.7)
       Reason: The score is 0.14 because the actual output likely mentions a 'Cry Closet' or 'Champagne, Compliments & Catharsis Chamber' which is not present in the retrieval context, indicating a significant devia...

  [legal_liability_clause_direct_20260219_164109.json_turn_1]
    ✅ Chunk Count Limit (score: 1.00, threshold: 1.0)
    ✅ Chunk Deduplication (score: 1.00, threshold: 1.0)
    ✅ Chunk Alignment [GEval] (score: 0.80, threshold: 0.7)
    ✅ Contextual Precision (score: 0.76, threshold: 0.7)
    ❌ Contextual Relevancy (score: 0.06, threshold: 0.7)
       Reason: The score is 0.06 because the retrieval context is largely irrelevant to the input question about the liability clause. As stated, 'The provided context does not contain information about the liabilit...
    ✅ Faithfulness (score: 0.83, threshold: 0.7)

================================================================================
EVALUATION SUMMARY
================================================================================

✗ FAIL hr_benefits_eap_direct_20260219_163309.json
  FAILURES:
    • [Retrieval] Chunk Alignment [GEval] (score: 0.20, threshold: 0.7)
      The retrieval context contains information about various employee
      benefits, perks, and company culture at FantaCo, but i...
    • [Retrieval] Contextual Precision (score: 0.33, threshold: 0.7)
      The score is 0.33 because the relevant node, ranked 3rd, is outranked
      by 2 irrelevant nodes, and there are 7 more irrele...
    • [Retrieval] Contextual Relevancy (score: 0.00, threshold: 0.7)
      The score is 0.00 because the retrieval context does not mention an
      employee assistance program and instead describes un...
    • [Retrieval] Faithfulness (score: 0.14, threshold: 0.7)
      The score is 0.14 because the actual output likely mentions a 'Cry
      Closet' or 'Champagne, Compliments & Catharsis Chambe...

✗ FAIL legal_liability_clause_direct_20260219_164109.json
  FAILURES:
    • [Conversational] Response Completeness [Conversational GEval] (score: 0.50, threshold: 0.6)
      The assistant's response mentions the liability clause as one of the
      key areas of review, specifically under 'Risk Asses...
    • [Conversational] Answer Relevance [Conversational GEval] (score: 0.60, threshold: 0.7)
      The assistant's response attempts to address the user's specific
      question about the liability clause by mentioning its r...
    • [Retrieval] Contextual Relevancy (score: 0.06, threshold: 0.7)
      The score is 0.06 because the retrieval context is largely irrelevant
      to the input question about the liability clause. ...

================================================================================


================================================================================
Evaluation Complete!
================================================================================
Conversational evaluations: 2
Retrieval evaluations: 2
Results saved to: results/deep_eval_results/evaluation_results_20260223_202954.json

Token Usage:
  LLM API requests: 42
  Input tokens:     67,146
  Output tokens:    15,622
  Total tokens:     82,768
================================================================================

mhdawson · 2026-02-24T01:39:08Z

    ❌ Faithfulness (score: 0.14, threshold: 0.7)
       Reason: The score is 0.14 because the actual output likely mentions a 'Cry Closet' or 'Champagne, Compliments & Catharsis Chamber' which is not present in the retrieval context, indicating a significant devia...

Is an example of where the "fancifal" nature gives the built in deep_eval Faithfulness metric trouble. Despite the "Cry closet" being in the chunks retrieved, it seems to discard it when creating facts to cross check the claims in the agent response (at least that is the assessment of claude and it tends to align with having to tell the LLM to not discard facts that don't seem realistic in the evals we created ourselves)

mhdawson · 2026-02-24T01:40:34Z

 	@echo -e "$(GREEN)Command-Line Parameters (override values file):$(NC)"
 	@echo -e "  LLM           - Enable specific LLM model (e.g., llama-3-2-3b-instruct)"
 	@echo -e "  SAFETY        - Enable specific safety model (e.g., llama-guard-3-8b)"
+	@echo -e "  LLM_ID        - Model ID for LLM (required for remote models)"


These allow you to use external models when setting them through environment variables. We had to add them in the it-self-service-agent quickstart as well.

mhdawson · 2026-02-24T15:30:17Z

Managed to get the quickstart to run and ran generation/evals on existing conversations/tests.

This is with scout17b as the model for the RAG quickstart, and scout17b for the LLM used with deep_eval.

These were the results:

================================================================================
EVALUATION SUMMARY
================================================================================

✗ FAIL hr_benefits_eap_direct_20260224_095612.json
  FAILURES:
    • [Retrieval] Chunk Alignment [GEval] (score: 0.60, threshold: 0.7)
      The retrieval context contains information about various employee
      benefits, perks, and company culture at FantaCo, but i...
    • [Retrieval] Contextual Precision (score: 0.17, threshold: 0.7)
      The score is 0.17 because the relevant node, which mentions the
      'Champagne, Compliments & Catharsis Chamber' (also known...
    • [Retrieval] Contextual Relevancy (score: 0.00, threshold: 0.7)
      The score is 0.00 because the retrieval context is completely
      unrelated to the input, with all provided context discussi...

✗ FAIL hr_benefits_enrollment_direct_20260224_095639.json
  FAILURES:
    • [Conversational] Answer Relevance [Conversational GEval] (score: 0.60, threshold: 0.7)
      The assistant's response attempts to address the user's specific
      question about enrolling in benefits by suggesting a po...
    • [Retrieval] Contextual Precision (score: 0.00, threshold: 0.7)
      The score is 0.00 because all nodes in retrieval contexts, which are
      irrelevant to enrolling in benefits, are ranked abo...
    • [Retrieval] Contextual Relevancy (score: 0.02, threshold: 0.7)
      The score is 0.02 because the retrieval context is largely irrelevant
      to enrolling in benefits, mentioning items such as...
    • [Retrieval] Faithfulness (score: 0.67, threshold: 0.7)
      The score is 0.67 because the actual output seems to be missing key
      information, such as the FantaCo Human Resources, Ha...

✗ FAIL hr_benefits_health_insurance_direct_20260224_095703.json
  FAILURES:
    • [Retrieval] Chunk Alignment [GEval] (score: 0.60, threshold: 0.7)
      The retrieval context contains a significant portion of the expected
      content, including details about the 'Fountain of Y...
    • [Retrieval] Contextual Relevancy (score: 0.04, threshold: 0.7)
      The score is 0.04 because the retrieval context is largely irrelevant
      to the input question about health insurance benef...
    • [Retrieval] Faithfulness (score: 0.56, threshold: 0.7)
      The score is 0.56 because the actual output likely overpromises the
      benefits of FantaCo's health insurance, which accord...

✗ FAIL hr_benefits_parental_leave_direct_20260224_095730.json
  FAILURES:
    • [Retrieval] Chunk Alignment [GEval] (score: 0.50, threshold: 0.7)
      The context is empty, and the retrieval context contains a large,
      duplicate-free chunk of text describing the benefits a...
    • [Retrieval] Contextual Precision (score: 0.00, threshold: 0.7)
      The score is 0.00 because all nodes in retrieval contexts are
      irrelevant to the input and are ranked together, with no r...
    • [Retrieval] Contextual Relevancy (score: 0.03, threshold: 0.7)
      The score is 0.03 because the retrieval context is largely irrelevant
      to the input question about parental leave policy....

✗ FAIL hr_benefits_retirement_direct_20260224_095753.json
  FAILURES:
    • [Retrieval] Contextual Relevancy (score: 0.17, threshold: 0.7)
      The score is 0.17 because the retrieval context is largely irrelevant
      to the input question about retirement benefits. M...

✗ FAIL hr_benefits_test_agent_20260224_095819.json
  FAILURES:
    • [Retrieval] Contextual Relevancy (score: 0.31, threshold: 0.7)
      The score is 0.31 because the retrieval context, while describing
      various FantaCo benefits like the "Midas Touch & Beyon...
    • [Retrieval] Faithfulness (score: 0.62, threshold: 0.7)
      The score is 0.62 because the actual output appears to be a
      fantastical and humorous description of a workplace or healt...

✗ FAIL hr_benefits_test_agent_all_collections_20260224_095849.json
  FAILURES:
    • [Retrieval] Chunk Count Limit (score: 0.00, threshold: 1.0)
      Retrieved 36 chunks, which exceeds the limit of 10
    • [Retrieval] Chunk Alignment [GEval] (score: 0.00, threshold: 0.7)
      The retrieval context does not contain any information related to the
      context provided. The context discusses employee b...
    • [Retrieval] Contextual Precision (score: 0.00, threshold: 0.7)
      The score is 0.00 because all nodes in retrieval contexts, which are
      ranked in order, are irrelevant to the input and me...
    • [Retrieval] Contextual Relevancy (score: 0.00, threshold: 0.7)
      The score is 0.00 because the retrieval context is completely
      unrelated to the input query about 'HR benefits for FantaC...

✗ FAIL hr_benefits_test_direct_20260224_095931.json
  FAILURES:
    • [Retrieval] Contextual Relevancy (score: 0.34, threshold: 0.7)
      The score is 0.34 because the retrieval context, while describing
      various benefits and a unique work environment at Fant...

✗ FAIL hr_benefits_test_fail_20260224_095959.json
  FAILURES:
    • [Conversational] Answer Relevance [Conversational GEval] (score: 0.40, threshold: 0.7)
      The assistant's response provides general information about common HR
      benefits that companies might offer, but does not ...
    • [Retrieval] Chunk Alignment [GEval] (score: 0.00, threshold: 0.7)
      The retrieval context is empty, containing no chunks to evaluate
      against the provided context. As a result, there is no ...
    • [Retrieval] Contextual Precision (score: 0.00, threshold: 0.7)
      The score is 0.00 because there are no nodes in the retrieval
      contexts, making it impossible to assess the ranking of re...
    • [Retrieval] Contextual Relevancy (score: 0.00, threshold: 0.7)
      The score is 0.00 because there are no relevant statements in the
      retrieval context to support the input query about HR ...

✗ FAIL hr_benefits_vacation_days_direct_20260224_100024.json
  FAILURES:
    • [Retrieval] Contextual Relevancy (score: 0.26, threshold: 0.7)
      The score is 0.26 because the retrieval context, although creative and
      describing various employee benefits, does not di...

✗ FAIL legal_compliance_requirements_direct_20260224_100048.json
  FAILURES:
    • [Retrieval] Contextual Relevancy (score: 0.67, threshold: 0.7)
      The score is 0.67 because although the retrieval context contains some
      relevant information about compliance requirement...

✗ FAIL legal_dispute_resolution_direct_20260224_100113.json
  FAILURES:
    • [Retrieval] Contextual Relevancy (score: 0.31, threshold: 0.7)
      The score is 0.31 because the retrieval context is largely irrelevant
      to the dispute resolution process, with most state...

✗ FAIL legal_intellectual_property_direct_20260224_100138.json
  FAILURES:
    • [Retrieval] Contextual Precision (score: 0.10, threshold: 0.7)
      The score is 0.10 because the relevant node, ranked 10th, mentions
      intellectual property rights in the context of Legal ...
    • [Retrieval] Contextual Relevancy (score: 0.11, threshold: 0.7)
      The score is 0.11 because the retrieval context is largely irrelevant
      to intellectual property rights, primarily focusin...
    • [Retrieval] Faithfulness (score: 0.67, threshold: 0.7)
      The score is 0.67 because the actual output likely introduced an
      extraneous element, 'FantaCo Legal Guidelines for Contr...

✗ FAIL legal_key_contract_terms_direct_20260224_100204.json
  FAILURES:
    • [Retrieval] Chunk Alignment [GEval] (score: 0.40, threshold: 0.7)
      The retrieval context contains detailed policy and procedure sections
      related to contract review, including key areas of...
    • [Retrieval] Contextual Precision (score: 0.33, threshold: 0.7)
      The score is 0.33 because although the relevant node, which clearly
      addresses the question by identifying 'six key areas...
    • [Retrieval] Contextual Relevancy (score: 0.60, threshold: 0.7)
      The score is 0.60 because although the retrieval context contains some
      relevant information about contract review proces...

✗ FAIL legal_liability_clause_direct_20260224_100229.json
  FAILURES:
    • [Retrieval] Contextual Relevancy (score: 0.04, threshold: 0.7)
      The score is 0.04 because the retrieval context is largely irrelevant
      to the input question about the liability clause. ...

✗ FAIL legal_termination_conditions_direct_20260224_100254.json
  FAILURES:
    • [Conversational] Response Completeness [Conversational GEval] (score: 0.20, threshold: 0.6)
      The assistant's response mentions some key facts from the retrieval
      context, specifically 'clear start and end dates', '...
    • [Retrieval] Contextual Relevancy (score: 0.11, threshold: 0.7)
      The score is 0.11 because the retrieval context is largely irrelevant
      to the input question about termination conditions...

================================================================================


================================================================================
Evaluation Complete!
================================================================================
Conversational evaluations: 16
Retrieval evaluations: 16
Results saved to: results/deep_eval_results/evaluation_results_20260224_101133.json

Token Usage:
  LLM API requests: 326
  Input tokens:     543,758
  Output tokens:    120,554
  Total tokens:     664,312
================================================================================

So I think this PR is ready for review, further work will be needed to either tweak the tests to pass either by improving the precision of the RAG lookup, adjusting the thresholds that we consider a pass, or possibly tweaking the questions we ask.

However, I think that should be done in a follow on PR, with this PR adding the framework and initial evaluations.

mhdawson marked this pull request as draft February 24, 2026 01:06

mhdawson commented Feb 24, 2026

View reviewed changes

mhdawson marked this pull request as ready for review February 24, 2026 15:30

sauagarwa merged commit 6b6a527 into rh-ai-quickstart:main Feb 25, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add evaluations for RAG quickst#147

feat: add evaluations for RAG quickst#147
sauagarwa merged 1 commit into
rh-ai-quickstart:mainfrom
mhdawson:rag-evals

mhdawson commented Feb 24, 2026

Uh oh!

mhdawson commented Feb 24, 2026

Uh oh!

mhdawson commented Feb 24, 2026

Uh oh!

mhdawson commented Feb 24, 2026 •

edited

Loading

Uh oh!

mhdawson commented Feb 24, 2026

Uh oh!

mhdawson Feb 24, 2026

Uh oh!

mhdawson commented Feb 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mhdawson commented Feb 24, 2026

Uh oh!

mhdawson commented Feb 24, 2026

Uh oh!

mhdawson commented Feb 24, 2026

Uh oh!

mhdawson commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhdawson commented Feb 24, 2026

Uh oh!

mhdawson Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

mhdawson commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mhdawson commented Feb 24, 2026 •

edited

Loading

mhdawson commented Feb 24, 2026 •

edited

Loading