Skip to content

feat: add evaluations for RAG quickst#147

Merged
sauagarwa merged 1 commit into
rh-ai-quickstart:mainfrom
mhdawson:rag-evals
Feb 25, 2026
Merged

feat: add evaluations for RAG quickst#147
sauagarwa merged 1 commit into
rh-ai-quickstart:mainfrom
mhdawson:rag-evals

Conversation

@mhdawson
Copy link
Copy Markdown
Member

  • add framework to run conversations against RAG quickstart UI and capture the user request, agent response along with the RAG results shown in the UI
  • add framework to evaluate captured conversation using deep_eval to evaluate the response given "expected" RAG results and to evaluate the actual RAG chunks against the expected answer and the "expected" RAG results
  • add initial set of conversations for the hr and legal databases using the questions suggested in the UI
  • see evaluations/README.md on how to setup and run

  - add framework to run conversations against RAG quickstart
    UI and capture the user request, agent response along
    with the RAG results shown in the UI
  - add framework to evaluate captured conversation using
    deep_eval to evaluate the response given "expected"
    RAG results and to evaluate the actual RAG chunks against
    the expected answer and the "expected" RAG results
  - add initial set of conversations for the hr and legal
    databases using the questions suggested in the UI
  - see evaluations/README.md on how to setup and run

Signed-off-by: Michael Dawson <midawson@redhat.com>
@mhdawson mhdawson marked this pull request as draft February 24, 2026 01:06
@mhdawson
Copy link
Copy Markdown
Member Author

Dev02 is not working right now, so can't do final testing, but wanted to submit this as I'll be out Wed-Friday.

If I can get a working deployment of the RAG quickstart tomorrow (Tuesday) I'll do the final testing

@mhdawson
Copy link
Copy Markdown
Member Author

A few things I have learned so far running/building the evals:

  1. By default it seems that llamastack will return the top 10 chunks when searching a collection.
  2. If you specify multiple collections it returns the top 10 by default for each collection and does not do anything to re-order based on similarity so the first ones will be the top 10 for the first collection, the second 10 for the next collection etc.
  3. Running the deep_eval built in evals related to RAG take a larger context window than we have with our 70b model so the best model I've been able to test with is scout17b.
  4. Generally the RAG lookups are returning lots of chunks which are not really related as there is no similarity cutoff by default. It will return the top 10 even if all 10 are not that similar to the query. When you run the evals you can see this in that the metrics that check how well the chunks related to the expected answer are reporting a low score. This is because for many of the questions there are at most 1 or 2 chunks that contain the information that is needed, and the other 8 have unrelated content.
  5. Despite 4), scout seems to be able to generate reasonable answers, finding the right info in the chunks that are returned. The metrics that check how good the answer is based on the "ideal" rag chunks from the source documents (as configured for the test) pass.
  6. LLM's have a harder time with "fanciful" content. The pre-built deep_eval metrics struggle on the hr content because not being grounded in what they know as "reality" they will discard some facts that are in the context or retrieved chunks. The metric results on the hr content is poorer than the metric results on the legal content which is more concrete. For the metrics we contructed we had to tell the LLM to consider what we passed in as the "ground truth" even if it did not seem realistic.
  7. For the hr content the similarity results seem to be closely clustered even for questions that are different. That may againt be partly because of their "fanciful" nature. I've not had time to see if the same is true for the legal content.

@mhdawson
Copy link
Copy Markdown
Member Author

mhdawson commented Feb 24, 2026

this is an example of running just 2 of the test conversations, one from hr and one from legal:

 [hr_benefits_eap_direct_20260219_163309.json]
    ✅ Response Accuracy [Conversational GEval] (score: 1.00, threshold: 0.7)
    ✅ Response Completeness [Conversational GEval] (score: 1.00, threshold: 0.6)
    ✅ Answer Relevance [Conversational GEval] (score: 1.00, threshold: 0.7)

  [legal_liability_clause_direct_20260219_164109.json]
    ✅ Response Accuracy [Conversational GEval] (score: 1.00, threshold: 0.7)
    ❌ Response Completeness [Conversational GEval] (score: 0.50, threshold: 0.6)
       Reason: The assistant's response mentions the liability clause as one of the key areas of review, specifically under 'Risk Assessment'. It also provides a general explanation of what the liability clause typi...
    ❌ Answer Relevance [Conversational GEval] (score: 0.60, threshold: 0.7)
       Reason: The assistant's response attempts to address the user's specific question about the liability clause by mentioning its relevance in the contract review process and providing context about its typical ...



Retrieval Results (Limited Chunks):

  [hr_benefits_eap_direct_20260219_163309.json_turn_1]
    ✅ Chunk Count Limit (score: 1.00, threshold: 1.0)
    ✅ Chunk Deduplication (score: 1.00, threshold: 1.0)
    ❌ Chunk Alignment [GEval] (score: 0.20, threshold: 0.7)
       Reason: The retrieval context contains information about various employee benefits, perks, and company culture at FantaCo, but it does not specifically mention the 'Cry Closet' described in the context. The '...
    ❌ Contextual Precision (score: 0.33, threshold: 0.7)
       Reason: The score is 0.33 because the relevant node, ranked 3rd, is outranked by 2 irrelevant nodes, and there are 7 more irrelevant nodes ranked lower than it, indicating that the relevant node is not ranked...
    ❌ Contextual Relevancy (score: 0.00, threshold: 0.7)
       Reason: The score is 0.00 because the retrieval context does not mention an employee assistance program and instead describes unrelated office perks, benefits, and company policies, such as catering for dieta...
    ❌ Faithfulness (score: 0.14, threshold: 0.7)
       Reason: The score is 0.14 because the actual output likely mentions a 'Cry Closet' or 'Champagne, Compliments & Catharsis Chamber' which is not present in the retrieval context, indicating a significant devia...

  [legal_liability_clause_direct_20260219_164109.json_turn_1]
    ✅ Chunk Count Limit (score: 1.00, threshold: 1.0)
    ✅ Chunk Deduplication (score: 1.00, threshold: 1.0)
    ✅ Chunk Alignment [GEval] (score: 0.80, threshold: 0.7)
    ✅ Contextual Precision (score: 0.76, threshold: 0.7)
    ❌ Contextual Relevancy (score: 0.06, threshold: 0.7)
       Reason: The score is 0.06 because the retrieval context is largely irrelevant to the input question about the liability clause. As stated, 'The provided context does not contain information about the liabilit...
    ✅ Faithfulness (score: 0.83, threshold: 0.7)

================================================================================
EVALUATION SUMMARY
================================================================================

✗ FAIL hr_benefits_eap_direct_20260219_163309.json
  FAILURES:
    • [Retrieval] Chunk Alignment [GEval] (score: 0.20, threshold: 0.7)
      The retrieval context contains information about various employee
      benefits, perks, and company culture at FantaCo, but i...
    • [Retrieval] Contextual Precision (score: 0.33, threshold: 0.7)
      The score is 0.33 because the relevant node, ranked 3rd, is outranked
      by 2 irrelevant nodes, and there are 7 more irrele...
    • [Retrieval] Contextual Relevancy (score: 0.00, threshold: 0.7)
      The score is 0.00 because the retrieval context does not mention an
      employee assistance program and instead describes un...
    • [Retrieval] Faithfulness (score: 0.14, threshold: 0.7)
      The score is 0.14 because the actual output likely mentions a 'Cry
      Closet' or 'Champagne, Compliments & Catharsis Chambe...

✗ FAIL legal_liability_clause_direct_20260219_164109.json
  FAILURES:
    • [Conversational] Response Completeness [Conversational GEval] (score: 0.50, threshold: 0.6)
      The assistant's response mentions the liability clause as one of the
      key areas of review, specifically under 'Risk Asses...
    • [Conversational] Answer Relevance [Conversational GEval] (score: 0.60, threshold: 0.7)
      The assistant's response attempts to address the user's specific
      question about the liability clause by mentioning its r...
    • [Retrieval] Contextual Relevancy (score: 0.06, threshold: 0.7)
      The score is 0.06 because the retrieval context is largely irrelevant
      to the input question about the liability clause. ...

================================================================================


================================================================================
Evaluation Complete!
================================================================================
Conversational evaluations: 2
Retrieval evaluations: 2
Results saved to: results/deep_eval_results/evaluation_results_20260223_202954.json

Token Usage:
  LLM API requests: 42
  Input tokens:     67,146
  Output tokens:    15,622
  Total tokens:     82,768
================================================================================

@mhdawson
Copy link
Copy Markdown
Member Author

    ❌ Faithfulness (score: 0.14, threshold: 0.7)
       Reason: The score is 0.14 because the actual output likely mentions a 'Cry Closet' or 'Champagne, Compliments & Catharsis Chamber' which is not present in the retrieval context, indicating a significant devia...

Is an example of where the "fancifal" nature gives the built in deep_eval Faithfulness metric trouble. Despite the "Cry closet" being in the chunks retrieved, it seems to discard it when creating facts to cross check the claims in the agent response (at least that is the assessment of claude and it tends to align with having to tell the LLM to not discard facts that don't seem realistic in the evals we created ourselves)

Comment thread deploy/helm/Makefile
@echo -e "$(GREEN)Command-Line Parameters (override values file):$(NC)"
@echo -e " LLM - Enable specific LLM model (e.g., llama-3-2-3b-instruct)"
@echo -e " SAFETY - Enable specific safety model (e.g., llama-guard-3-8b)"
@echo -e " LLM_ID - Model ID for LLM (required for remote models)"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These allow you to use external models when setting them through environment variables. We had to add them in the it-self-service-agent quickstart as well.

@mhdawson
Copy link
Copy Markdown
Member Author

mhdawson commented Feb 24, 2026

Managed to get the quickstart to run and ran generation/evals on existing conversations/tests.

This is with scout17b as the model for the RAG quickstart, and scout17b for the LLM used with deep_eval.

These were the results:

================================================================================
EVALUATION SUMMARY
================================================================================

✗ FAIL hr_benefits_eap_direct_20260224_095612.json
  FAILURES:
    • [Retrieval] Chunk Alignment [GEval] (score: 0.60, threshold: 0.7)
      The retrieval context contains information about various employee
      benefits, perks, and company culture at FantaCo, but i...
    • [Retrieval] Contextual Precision (score: 0.17, threshold: 0.7)
      The score is 0.17 because the relevant node, which mentions the
      'Champagne, Compliments & Catharsis Chamber' (also known...
    • [Retrieval] Contextual Relevancy (score: 0.00, threshold: 0.7)
      The score is 0.00 because the retrieval context is completely
      unrelated to the input, with all provided context discussi...

✗ FAIL hr_benefits_enrollment_direct_20260224_095639.json
  FAILURES:
    • [Conversational] Answer Relevance [Conversational GEval] (score: 0.60, threshold: 0.7)
      The assistant's response attempts to address the user's specific
      question about enrolling in benefits by suggesting a po...
    • [Retrieval] Contextual Precision (score: 0.00, threshold: 0.7)
      The score is 0.00 because all nodes in retrieval contexts, which are
      irrelevant to enrolling in benefits, are ranked abo...
    • [Retrieval] Contextual Relevancy (score: 0.02, threshold: 0.7)
      The score is 0.02 because the retrieval context is largely irrelevant
      to enrolling in benefits, mentioning items such as...
    • [Retrieval] Faithfulness (score: 0.67, threshold: 0.7)
      The score is 0.67 because the actual output seems to be missing key
      information, such as the FantaCo Human Resources, Ha...

✗ FAIL hr_benefits_health_insurance_direct_20260224_095703.json
  FAILURES:
    • [Retrieval] Chunk Alignment [GEval] (score: 0.60, threshold: 0.7)
      The retrieval context contains a significant portion of the expected
      content, including details about the 'Fountain of Y...
    • [Retrieval] Contextual Relevancy (score: 0.04, threshold: 0.7)
      The score is 0.04 because the retrieval context is largely irrelevant
      to the input question about health insurance benef...
    • [Retrieval] Faithfulness (score: 0.56, threshold: 0.7)
      The score is 0.56 because the actual output likely overpromises the
      benefits of FantaCo's health insurance, which accord...

✗ FAIL hr_benefits_parental_leave_direct_20260224_095730.json
  FAILURES:
    • [Retrieval] Chunk Alignment [GEval] (score: 0.50, threshold: 0.7)
      The context is empty, and the retrieval context contains a large,
      duplicate-free chunk of text describing the benefits a...
    • [Retrieval] Contextual Precision (score: 0.00, threshold: 0.7)
      The score is 0.00 because all nodes in retrieval contexts are
      irrelevant to the input and are ranked together, with no r...
    • [Retrieval] Contextual Relevancy (score: 0.03, threshold: 0.7)
      The score is 0.03 because the retrieval context is largely irrelevant
      to the input question about parental leave policy....

✗ FAIL hr_benefits_retirement_direct_20260224_095753.json
  FAILURES:
    • [Retrieval] Contextual Relevancy (score: 0.17, threshold: 0.7)
      The score is 0.17 because the retrieval context is largely irrelevant
      to the input question about retirement benefits. M...

✗ FAIL hr_benefits_test_agent_20260224_095819.json
  FAILURES:
    • [Retrieval] Contextual Relevancy (score: 0.31, threshold: 0.7)
      The score is 0.31 because the retrieval context, while describing
      various FantaCo benefits like the "Midas Touch & Beyon...
    • [Retrieval] Faithfulness (score: 0.62, threshold: 0.7)
      The score is 0.62 because the actual output appears to be a
      fantastical and humorous description of a workplace or healt...

✗ FAIL hr_benefits_test_agent_all_collections_20260224_095849.json
  FAILURES:
    • [Retrieval] Chunk Count Limit (score: 0.00, threshold: 1.0)
      Retrieved 36 chunks, which exceeds the limit of 10
    • [Retrieval] Chunk Alignment [GEval] (score: 0.00, threshold: 0.7)
      The retrieval context does not contain any information related to the
      context provided. The context discusses employee b...
    • [Retrieval] Contextual Precision (score: 0.00, threshold: 0.7)
      The score is 0.00 because all nodes in retrieval contexts, which are
      ranked in order, are irrelevant to the input and me...
    • [Retrieval] Contextual Relevancy (score: 0.00, threshold: 0.7)
      The score is 0.00 because the retrieval context is completely
      unrelated to the input query about 'HR benefits for FantaC...

✗ FAIL hr_benefits_test_direct_20260224_095931.json
  FAILURES:
    • [Retrieval] Contextual Relevancy (score: 0.34, threshold: 0.7)
      The score is 0.34 because the retrieval context, while describing
      various benefits and a unique work environment at Fant...

✗ FAIL hr_benefits_test_fail_20260224_095959.json
  FAILURES:
    • [Conversational] Answer Relevance [Conversational GEval] (score: 0.40, threshold: 0.7)
      The assistant's response provides general information about common HR
      benefits that companies might offer, but does not ...
    • [Retrieval] Chunk Alignment [GEval] (score: 0.00, threshold: 0.7)
      The retrieval context is empty, containing no chunks to evaluate
      against the provided context. As a result, there is no ...
    • [Retrieval] Contextual Precision (score: 0.00, threshold: 0.7)
      The score is 0.00 because there are no nodes in the retrieval
      contexts, making it impossible to assess the ranking of re...
    • [Retrieval] Contextual Relevancy (score: 0.00, threshold: 0.7)
      The score is 0.00 because there are no relevant statements in the
      retrieval context to support the input query about HR ...

✗ FAIL hr_benefits_vacation_days_direct_20260224_100024.json
  FAILURES:
    • [Retrieval] Contextual Relevancy (score: 0.26, threshold: 0.7)
      The score is 0.26 because the retrieval context, although creative and
      describing various employee benefits, does not di...

✗ FAIL legal_compliance_requirements_direct_20260224_100048.json
  FAILURES:
    • [Retrieval] Contextual Relevancy (score: 0.67, threshold: 0.7)
      The score is 0.67 because although the retrieval context contains some
      relevant information about compliance requirement...

✗ FAIL legal_dispute_resolution_direct_20260224_100113.json
  FAILURES:
    • [Retrieval] Contextual Relevancy (score: 0.31, threshold: 0.7)
      The score is 0.31 because the retrieval context is largely irrelevant
      to the dispute resolution process, with most state...

✗ FAIL legal_intellectual_property_direct_20260224_100138.json
  FAILURES:
    • [Retrieval] Contextual Precision (score: 0.10, threshold: 0.7)
      The score is 0.10 because the relevant node, ranked 10th, mentions
      intellectual property rights in the context of Legal ...
    • [Retrieval] Contextual Relevancy (score: 0.11, threshold: 0.7)
      The score is 0.11 because the retrieval context is largely irrelevant
      to intellectual property rights, primarily focusin...
    • [Retrieval] Faithfulness (score: 0.67, threshold: 0.7)
      The score is 0.67 because the actual output likely introduced an
      extraneous element, 'FantaCo Legal Guidelines for Contr...

✗ FAIL legal_key_contract_terms_direct_20260224_100204.json
  FAILURES:
    • [Retrieval] Chunk Alignment [GEval] (score: 0.40, threshold: 0.7)
      The retrieval context contains detailed policy and procedure sections
      related to contract review, including key areas of...
    • [Retrieval] Contextual Precision (score: 0.33, threshold: 0.7)
      The score is 0.33 because although the relevant node, which clearly
      addresses the question by identifying 'six key areas...
    • [Retrieval] Contextual Relevancy (score: 0.60, threshold: 0.7)
      The score is 0.60 because although the retrieval context contains some
      relevant information about contract review proces...

✗ FAIL legal_liability_clause_direct_20260224_100229.json
  FAILURES:
    • [Retrieval] Contextual Relevancy (score: 0.04, threshold: 0.7)
      The score is 0.04 because the retrieval context is largely irrelevant
      to the input question about the liability clause. ...

✗ FAIL legal_termination_conditions_direct_20260224_100254.json
  FAILURES:
    • [Conversational] Response Completeness [Conversational GEval] (score: 0.20, threshold: 0.6)
      The assistant's response mentions some key facts from the retrieval
      context, specifically 'clear start and end dates', '...
    • [Retrieval] Contextual Relevancy (score: 0.11, threshold: 0.7)
      The score is 0.11 because the retrieval context is largely irrelevant
      to the input question about termination conditions...

================================================================================


================================================================================
Evaluation Complete!
================================================================================
Conversational evaluations: 16
Retrieval evaluations: 16
Results saved to: results/deep_eval_results/evaluation_results_20260224_101133.json

Token Usage:
  LLM API requests: 326
  Input tokens:     543,758
  Output tokens:    120,554
  Total tokens:     664,312
================================================================================

So I think this PR is ready for review, further work will be needed to either tweak the tests to pass either by improving the precision of the RAG lookup, adjusting the thresholds that we consider a pass, or possibly tweaking the questions we ask.

However, I think that should be done in a follow on PR, with this PR adding the framework and initial evaluations.

@mhdawson mhdawson marked this pull request as ready for review February 24, 2026 15:30
@sauagarwa sauagarwa merged commit 6b6a527 into rh-ai-quickstart:main Feb 25, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants