NVIDIA · yczhang-nv · Mar 9, 2026 · Mar 9, 2026 · willkill07 · Mar 10, 2026
@@ -241,6 +241,37 @@ eval:
       normalize_case: true
 ```
 
+### Using Ragas Multi-turn Metrics with ATIF Trajectories
+
+The built-in `ragas` evaluator supports both single-turn and multi-turn evaluation modes. Single-turn mode (`sample_type: single_turn`, the default) is suitable for metrics like `AnswerAccuracy` and `ContextRelevance`. Multi-turn mode (`sample_type: multi_turn`) is required for trajectory-aware metrics such as `AgentGoalAccuracyWithoutReference` and `ToolCallAccuracy`.
+
+In multi-turn mode, the ATIF trajectory is converted into a sequence of Ragas message types:
+
+- Steps with `source: user` become `HumanMessage`
+- Steps with `source: agent` become `AIMessage`, with ATIF `tool_calls` mapped to Ragas `ToolCall` objects
+- Tool observation results become `ToolMessage`
+
+To use multi-turn metrics, add an evaluator with `sample_type: multi_turn` and `enable_atif_evaluator: true`:
+
+```yaml
+eval:
+  evaluators:
+    agent_goal:
+      _type: ragas
+      metric: AgentGoalAccuracyWithoutReference
+      llm_name: nim_rag_eval_llm
+      enable_atif_evaluator: true
+      sample_type: multi_turn
+    tool_call_accuracy:
+      _type: ragas
+      metric: ToolCallAccuracy
+      llm_name: nim_rag_eval_llm
+      enable_atif_evaluator: true
+      sample_type: multi_turn
+```
+
+For `ToolCallAccuracy`, you can optionally supply expected tool calls through the `reference_tool_calls` key in `AtifEvalSample.metadata` when using the Python API.
+
 ### Display all evaluators
 To display all evaluators, run the following command:
 ```bash

@@ -267,6 +267,45 @@ For a complete list of up-to-date judge LLMs, refer to the [Ragas NV metrics lea
 
 For more information on the prompt used by the judge LLM, refer to the [Ragas NV metrics](https://github.com/explodinggradients/ragas/blob/v0.2.14/src/ragas/metrics/_nv_metrics.py). The prompt for these metrics is not configurable. If you need a custom prompt, you can use the [Tunable RAG Evaluator](#tunable-rag-evaluator) or implement your own evaluator using the [Custom Evaluator](../extend/custom-components/custom-evaluator.md) documentation.
 
+#### Multi-turn Agent Evaluation with Ragas
+
+The metrics listed above use `SingleTurnSample` and evaluate single question-answer pairs. For agent workflows with tool calls, Ragas provides trajectory-aware metrics that require `MultiTurnSample`. These metrics evaluate the full conversation sequence, including tool calls and tool results.
+
+To enable multi-turn evaluation, set `sample_type: multi_turn` and `enable_atif_evaluator: true` in the evaluator config:
+
+```yaml
+eval:
+  evaluators:
+    agent_goal:
+      _type: ragas
+      metric: AgentGoalAccuracyWithoutReference
+      llm_name: nim_rag_eval_llm
+      enable_atif_evaluator: true
+      sample_type: multi_turn
+    tool_call_accuracy:
+      _type: ragas
+      metric: ToolCallAccuracy
+      llm_name: nim_rag_eval_llm
+      enable_atif_evaluator: true
+      sample_type: multi_turn
+```
+
+The multi-turn conversion maps ATIF trajectory steps to Ragas message types:
+
+- ATIF steps with `source: user` become `HumanMessage`
+- ATIF steps with `source: agent` become `AIMessage`, with `tool_calls` converted to Ragas `ToolCall` objects
+- Observation results from tool executions become `ToolMessage`
+
+The following Ragas metrics are recommended for agent evaluation with multi-turn trajectories:
+
+`AgentGoalAccuracyWithoutReference`: Evaluates whether the agent achieved the user's goal based on the conversation trajectory, without requiring a reference answer.
+
+`ToolCallAccuracy`: Evaluates whether the agent made the correct tool calls with the right arguments. When using this metric, you can provide expected tool calls through the `reference_tool_calls` key in the `AtifEvalSample.metadata` field.
+
+:::{note}
+Multi-turn evaluation requires `enable_atif_evaluator: true` since it operates on ATIF trajectories. The default `sample_type` is `single_turn` for backward compatibility.
+:::
+
 ### Trajectory Evaluator
 This evaluator uses the intermediate steps generated by the workflow to evaluate the workflow trajectory. The evaluator configuration includes the evaluator type and any additional parameters required by the evaluator.
 
@@ -791,6 +830,28 @@ The following `ragas` metrics are recommended for RAG like workflows -
 `ContextRelevance`: Evaluates the relevance of the context retrieved by the workflow against the question.
 `ResponseGroundedness`: Evaluates the `groundedness` of the response generated by the workflow based on the context retrieved by the workflow.
 
+##### Multi-turn Ragas Metrics for Agent Evaluation
+
+For agent workflows that involve tool calls and multi-step interactions, you can use Ragas multi-turn metrics. These metrics operate on the full conversation trajectory rather than a single question-answer pair.
+
+Set `sample_type: multi_turn` and `enable_atif_evaluator: true` to activate this mode:
+
+```yaml
+eval:
+  evaluators:
+    agent_goal:
+      _type: ragas
+      metric: AgentGoalAccuracyWithoutReference
+      llm_name: nim_rag_eval_llm
+      enable_atif_evaluator: true
+      sample_type: multi_turn
+```
+
+The `sample_type` parameter controls how ATIF trajectories are converted for Ragas:
+
+- `single_turn` (default): Converts to `SingleTurnSample` with `user_input`, `response`, and `retrieved_contexts`. Suitable for `AnswerAccuracy`, `ContextRelevance`, and `ResponseGroundedness`.
+- `multi_turn`: Converts to `MultiTurnSample` with a full `HumanMessage`/`AIMessage`/`ToolMessage` conversation sequence. Required for `AgentGoalAccuracyWithoutReference`, `ToolCallAccuracy`, and other trajectory-aware metrics.
+
 #### Agent Trajectory Evaluator
 The `trajectory` evaluator uses LangChain/LangGraph agent trajectory evaluation to evaluate the workflow. To use the `trajectory` evaluator, add the following configuration to the `config.yml` file.
 ```yaml

@@ -84,7 +84,26 @@ Notes:
 - The custom evaluator is ATIF-only and registered from `nat_simple_web_query_eval`.
 - It scores using token cosine similarity and includes trajectory metadata (`trajectory_tool_call_count`) in reasoning.
 
-## 3) Optional quick compare
+## 3) Test ATIF multi-turn Ragas evaluators
+
+For agent workflows with tool calls, you can use Ragas multi-turn metrics such as `AgentGoalAccuracyWithoutReference`
+and `ToolCallAccuracy`. These require `sample_type: multi_turn` and `enable_atif_evaluator: true` in the evaluator config:
+
+```yaml
+eval:
+  evaluators:
+    agent_goal:
+      _type: ragas
+      metric: AgentGoalAccuracyWithoutReference
+      llm_name: nim_rag_eval_llm
+      enable_atif_evaluator: true
+      sample_type: multi_turn
+```
+
+Multi-turn mode converts ATIF trajectory steps into a Ragas `MultiTurnSample` with `HumanMessage`,
+`AIMessage` (with `ToolCall`), and `ToolMessage` sequences, preserving the full agent interaction history.
+
+## 4) Optional quick compare
 
 Compare two run directories:
 

@@ -16,12 +16,15 @@
 import logging
 import typing
 from collections.abc import Sequence
+from typing import Literal
 
 from tqdm import tqdm
 
 from nat.data_models.atif import ATIFObservationResult
+from nat.data_models.atif import ATIFStep
 from nat.data_models.atif import ATIFTrajectory
 from nat.data_models.evaluator import EvalOutput
+from nat.plugins.eval.evaluator.atif_evaluator import AtifEvalSample
 from nat.plugins.eval.evaluator.atif_evaluator import AtifEvalSampleList
 from nat.plugins.eval.utils.tqdm_position_registry import TqdmPositionRegistry
 from nat.utils.atif_message_utils import message_to_text
@@ -32,10 +35,15 @@
 if typing.TYPE_CHECKING:
     from ragas import EvaluationDataset
     from ragas.llms import LangchainLLMWrapper
+    from ragas.messages import AIMessage
+    from ragas.messages import HumanMessage
+    from ragas.messages import ToolMessage
     from ragas.metrics import Metric
 
 logger = logging.getLogger(__name__)
 
+SampleType = Literal["single_turn", "multi_turn"]
+
 
 def _observation_result_to_text(result: ATIFObservationResult) -> str:
     return message_to_text(result.content)
@@ -53,32 +61,132 @@ def _trajectory_to_retrieved_contexts(trajectory: ATIFTrajectory) -> list[str]:
     return contexts
 
 
+def _join_non_empty(parts: Sequence[str], separator: str = "\n\n") -> str:
+    return separator.join([part for part in parts if part])
+
+
+def _atif_step_to_ragas_messages(step: ATIFStep, ) -> list["HumanMessage | AIMessage | ToolMessage"]:
+    """Convert a single ATIF step into one or more RAGAS message objects.
+
+    Mapping:
+        source="user"  → HumanMessage
+        source="agent" → AIMessage (with optional ToolCall list)
+                         followed by ToolMessage per observation result
+        source="system" → HumanMessage (best-effort; RAGAS has no SystemMessage)
-    Mapping:
-        source="user"  → HumanMessage
-        source="agent" → AIMessage (with optional ToolCall list)
-                         followed by ToolMessage per observation result
-        source="system" → HumanMessage (best-effort; RAGAS has no SystemMessage)
+    Mapping:
+
+    * source="user"  → HumanMessage
+    * source="agent" → AIMessage (with optional ToolCall list)
+                         followed by ToolMessage per observation result
+    * source="system" → HumanMessage (best-effort; RAGAS has no SystemMessage)
-    Mapping:
-        source="user"  → HumanMessage
-        source="agent" → AIMessage (with optional ToolCall list)
-                         followed by ToolMessage per observation result
-        source="system" → HumanMessage (best-effort; RAGAS has no SystemMessage)
+    Mapping:
+
+    * source="user"  → HumanMessage
+    * source="agent" → AIMessage (with optional ToolCall list)
+                         followed by ToolMessage per observation result
+    * source="system" → HumanMessage (best-effort; RAGAS has no SystemMessage)
+    """
+    from ragas.messages import AIMessage as RagasAIMessage
+    from ragas.messages import HumanMessage as RagasHumanMessage
+    from ragas.messages import ToolCall as RagasToolCall
+    from ragas.messages import ToolMessage as RagasToolMessage
+
+    messages: list[RagasHumanMessage | RagasAIMessage | RagasToolMessage] = []
+    text = message_to_text(step.message)
+
+    if step.source == "user":
+        messages.append(RagasHumanMessage(content=text))
+        return messages
+
+    if step.source == "system":
+        messages.append(RagasHumanMessage(content=text))
+        return messages
+
+    ragas_tool_calls = []
+    if step.tool_calls:
+        for tc in step.tool_calls:
+            ragas_tool_calls.append(RagasToolCall(name=tc.function_name, args=tc.arguments))
+
+    observation_texts = []
+    if step.observation and step.observation.results:
+        for result in step.observation.results:
+            tool_content = _observation_result_to_text(result)
+            if tool_content:
+                observation_texts.append(tool_content)
+
+    # RAGAS only allows ToolMessage after an AIMessage with tool_calls.
+    # If ATIF contains observation results without explicit tool_calls,
+    # fold them into the AI content as a best-effort representation.
+    ai_content = text
+    if observation_texts and not ragas_tool_calls:
+        ai_content = _join_non_empty([text, *observation_texts])
+
+    messages.append(RagasAIMessage(content=ai_content, tool_calls=ragas_tool_calls or None))
+
+    if ragas_tool_calls:
+        for observation_text in observation_texts:
+            messages.append(RagasToolMessage(content=observation_text))
+
+    return messages
+
+
+def _atif_trajectory_to_multi_turn_messages(
+    trajectory: ATIFTrajectory, ) -> list["HumanMessage | AIMessage | ToolMessage"]:
+    """Convert an entire ATIF trajectory into a RAGAS multi-turn message sequence."""
+    messages: list = []
+    for step in trajectory.steps:
+        messages.extend(_atif_step_to_ragas_messages(step))
+    return messages
+
+
 class RAGAtifEvaluator:
 
-    def __init__(self, evaluator_llm: "LangchainLLMWrapper", metrics: Sequence["Metric"], max_concurrency=8):
+    def __init__(
+        self,
+        evaluator_llm: "LangchainLLMWrapper",
+        metrics: Sequence["Metric"],
+        max_concurrency: int = 8,
+        sample_type: SampleType = "single_turn",
+    ):
         self.evaluator_llm = evaluator_llm
         self.metrics = metrics
         self.max_concurrency = max_concurrency
+        self.sample_type = sample_type
+
+    def _build_single_turn_sample(self, sample: AtifEvalSample):
+        """Build a RAGAS SingleTurnSample from an ATIF eval sample."""
+        from ragas import SingleTurnSample
+
+        user_input = trajectory_to_user_input(sample.trajectory)
+        reference = sample.expected_output_obj
+        response = sample.output_obj
+        reference_contexts = [""]
+        retrieved_contexts = _trajectory_to_retrieved_contexts(sample.trajectory)
+        return SingleTurnSample(
+            user_input=user_input,
+            reference=reference,
+            response=response,
+            reference_contexts=reference_contexts,
+            retrieved_contexts=retrieved_contexts,
+        )
+
+    def _build_multi_turn_sample(self, sample: AtifEvalSample):
+        """Build a RAGAS MultiTurnSample from an ATIF eval sample."""
+        from ragas import MultiTurnSample
+
+        conversation = _atif_trajectory_to_multi_turn_messages(sample.trajectory)
+        reference = sample.expected_output_obj
+        kwargs: dict[str, typing.Any] = {"user_input": conversation}
+        if reference is not None:
+            kwargs["reference"] = str(reference)
+
+        reference_tool_calls = sample.metadata.get("reference_tool_calls")
+        if reference_tool_calls is not None:
+            kwargs["reference_tool_calls"] = reference_tool_calls
+
+        return MultiTurnSample(**kwargs)
 
     def atif_samples_to_ragas(self, atif_samples: AtifEvalSampleList) -> "EvaluationDataset":
-        """Converts ATIF-native samples into a Ragas-compatible EvaluationDataset."""
+        """Converts ATIF-native samples into a Ragas-compatible EvaluationDataset.
+
+        Uses SingleTurnSample or MultiTurnSample depending on ``self.sample_type``.
+        """
         from ragas import EvaluationDataset
-        from ragas import SingleTurnSample
 
         samples = []
         for sample in atif_samples:
-            user_input = trajectory_to_user_input(sample.trajectory)
-            reference = sample.expected_output_obj
-            response = sample.output_obj
-            reference_contexts = [""]
-            retrieved_contexts = _trajectory_to_retrieved_contexts(sample.trajectory)
-            ragas_sample = SingleTurnSample(
-                user_input=user_input,
-                reference=reference,
-                response=response,
-                reference_contexts=reference_contexts,
-                retrieved_contexts=retrieved_contexts,
-            )
+            if self.sample_type == "multi_turn":
+                ragas_sample = self._build_multi_turn_sample(sample)
+            else:
+                ragas_sample = self._build_single_turn_sample(sample)
             samples.append(ragas_sample)
         return EvaluationDataset(samples=samples)
 

@@ -14,6 +14,7 @@
 # limitations under the License.
 
 import logging
+from typing import Literal
 
 from pydantic import BaseModel
 from pydantic import Field
@@ -54,6 +55,13 @@ class RagasEvaluatorConfig(EvaluatorLLMConfig, name="ragas"):
         default=False,
         description="Enable ATIF-native RAGAS evaluator lane. Disabled by default until rollout stabilization.",
     )
+    sample_type: Literal["single_turn", "multi_turn"] = Field(
+        default="single_turn",
+        description=(
+            "RAGAS sample type for ATIF evaluation. "
+            "'single_turn' uses SingleTurnSample (for AnswerAccuracy, ContextRelevance, etc.). "
+            "'multi_turn' uses MultiTurnSample (for AgentGoalAccuracyWithoutReference, ToolCallAccuracy, etc.)."),
+    )
 
     @model_validator(mode="before")
     @classmethod
@@ -70,6 +78,9 @@ def validate_metric(cls, values):
         elif not isinstance(metric, str):
             raise ValueError("Metric must be either a string or a single-item dictionary.")
 
+        if values.get("sample_type") == "multi_turn" and not values.get("enable_atif_evaluator"):
+            raise ValueError("sample_type='multi_turn' requires enable_atif_evaluator=True.")
+
         return values
 
     @property
@@ -149,9 +160,11 @@ async def evaluate_atif_fn(atif_samples: AtifEvalSampleList) -> EvalOutput:
                              metrics=metrics,
                              max_concurrency=builder.get_max_concurrency(),
                              input_obj_field=config.input_obj_field) if metrics else None
-    atif_evaluator = RAGAtifEvaluator(
-        evaluator_llm=llm, metrics=metrics,
-        max_concurrency=builder.get_max_concurrency()) if (metrics and config.enable_atif_evaluator) else None
+    atif_evaluator = RAGAtifEvaluator(evaluator_llm=llm,
+                                      metrics=metrics,
+                                      max_concurrency=builder.get_max_concurrency(),
+                                      sample_type=config.sample_type) if (metrics
+                                                                          and config.enable_atif_evaluator) else None
 
     evaluator_info = EvaluatorInfo(config=config, evaluate_fn=evaluate_fn, description="Evaluator for RAGAS metrics")
     if config.enable_atif_evaluator: