diff --git a/docs/source/extend/custom-components/custom-evaluator.md b/docs/source/extend/custom-components/custom-evaluator.md index e132763731..ec93fd3842 100644 --- a/docs/source/extend/custom-components/custom-evaluator.md +++ b/docs/source/extend/custom-components/custom-evaluator.md @@ -241,6 +241,37 @@ eval: normalize_case: true ``` +### Using Ragas Multi-turn Metrics with ATIF Trajectories + +The built-in `ragas` evaluator supports both single-turn and multi-turn evaluation modes. Single-turn mode (`sample_type: single_turn`, the default) is suitable for metrics like `AnswerAccuracy` and `ContextRelevance`. Multi-turn mode (`sample_type: multi_turn`) is required for trajectory-aware metrics such as `AgentGoalAccuracyWithoutReference` and `ToolCallAccuracy`. + +In multi-turn mode, the ATIF trajectory is converted into a sequence of Ragas message types: + +- Steps with `source: user` become `HumanMessage` +- Steps with `source: agent` become `AIMessage`, with ATIF `tool_calls` mapped to Ragas `ToolCall` objects +- Tool observation results become `ToolMessage` + +To use multi-turn metrics, add an evaluator with `sample_type: multi_turn` and `enable_atif_evaluator: true`: + +```yaml +eval: + evaluators: + agent_goal: + _type: ragas + metric: AgentGoalAccuracyWithoutReference + llm_name: nim_rag_eval_llm + enable_atif_evaluator: true + sample_type: multi_turn + tool_call_accuracy: + _type: ragas + metric: ToolCallAccuracy + llm_name: nim_rag_eval_llm + enable_atif_evaluator: true + sample_type: multi_turn +``` + +For `ToolCallAccuracy`, you can optionally supply expected tool calls through the `reference_tool_calls` key in `AtifEvalSample.metadata` when using the Python API. + ### Display all evaluators To display all evaluators, run the following command: ```bash diff --git a/docs/source/improve-workflows/evaluate.md b/docs/source/improve-workflows/evaluate.md index 37d3743f2b..f3fd4cd659 100644 --- a/docs/source/improve-workflows/evaluate.md +++ b/docs/source/improve-workflows/evaluate.md @@ -267,6 +267,45 @@ For a complete list of up-to-date judge LLMs, refer to the [Ragas NV metrics lea For more information on the prompt used by the judge LLM, refer to the [Ragas NV metrics](https://github.com/explodinggradients/ragas/blob/v0.2.14/src/ragas/metrics/_nv_metrics.py). The prompt for these metrics is not configurable. If you need a custom prompt, you can use the [Tunable RAG Evaluator](#tunable-rag-evaluator) or implement your own evaluator using the [Custom Evaluator](../extend/custom-components/custom-evaluator.md) documentation. +#### Multi-turn Agent Evaluation with Ragas + +The metrics listed above use `SingleTurnSample` and evaluate single question-answer pairs. For agent workflows with tool calls, Ragas provides trajectory-aware metrics that require `MultiTurnSample`. These metrics evaluate the full conversation sequence, including tool calls and tool results. + +To enable multi-turn evaluation, set `sample_type: multi_turn` and `enable_atif_evaluator: true` in the evaluator config: + +```yaml +eval: + evaluators: + agent_goal: + _type: ragas + metric: AgentGoalAccuracyWithoutReference + llm_name: nim_rag_eval_llm + enable_atif_evaluator: true + sample_type: multi_turn + tool_call_accuracy: + _type: ragas + metric: ToolCallAccuracy + llm_name: nim_rag_eval_llm + enable_atif_evaluator: true + sample_type: multi_turn +``` + +The multi-turn conversion maps ATIF trajectory steps to Ragas message types: + +- ATIF steps with `source: user` become `HumanMessage` +- ATIF steps with `source: agent` become `AIMessage`, with `tool_calls` converted to Ragas `ToolCall` objects +- Observation results from tool executions become `ToolMessage` + +The following Ragas metrics are recommended for agent evaluation with multi-turn trajectories: + +`AgentGoalAccuracyWithoutReference`: Evaluates whether the agent achieved the user's goal based on the conversation trajectory, without requiring a reference answer. + +`ToolCallAccuracy`: Evaluates whether the agent made the correct tool calls with the right arguments. When using this metric, you can provide expected tool calls through the `reference_tool_calls` key in the `AtifEvalSample.metadata` field. + +:::{note} +Multi-turn evaluation requires `enable_atif_evaluator: true` since it operates on ATIF trajectories. The default `sample_type` is `single_turn` for backward compatibility. +::: + ### Trajectory Evaluator This evaluator uses the intermediate steps generated by the workflow to evaluate the workflow trajectory. The evaluator configuration includes the evaluator type and any additional parameters required by the evaluator. @@ -791,6 +830,28 @@ The following `ragas` metrics are recommended for RAG like workflows - `ContextRelevance`: Evaluates the relevance of the context retrieved by the workflow against the question. `ResponseGroundedness`: Evaluates the `groundedness` of the response generated by the workflow based on the context retrieved by the workflow. +##### Multi-turn Ragas Metrics for Agent Evaluation + +For agent workflows that involve tool calls and multi-step interactions, you can use Ragas multi-turn metrics. These metrics operate on the full conversation trajectory rather than a single question-answer pair. + +Set `sample_type: multi_turn` and `enable_atif_evaluator: true` to activate this mode: + +```yaml +eval: + evaluators: + agent_goal: + _type: ragas + metric: AgentGoalAccuracyWithoutReference + llm_name: nim_rag_eval_llm + enable_atif_evaluator: true + sample_type: multi_turn +``` + +The `sample_type` parameter controls how ATIF trajectories are converted for Ragas: + +- `single_turn` (default): Converts to `SingleTurnSample` with `user_input`, `response`, and `retrieved_contexts`. Suitable for `AnswerAccuracy`, `ContextRelevance`, and `ResponseGroundedness`. +- `multi_turn`: Converts to `MultiTurnSample` with a full `HumanMessage`/`AIMessage`/`ToolMessage` conversation sequence. Required for `AgentGoalAccuracyWithoutReference`, `ToolCallAccuracy`, and other trajectory-aware metrics. + #### Agent Trajectory Evaluator The `trajectory` evaluator uses LangChain/LangGraph agent trajectory evaluation to evaluate the workflow. To use the `trajectory` evaluator, add the following configuration to the `config.yml` file. ```yaml diff --git a/examples/evaluation_and_profiling/simple_web_query_eval/atif-eval-readme.md b/examples/evaluation_and_profiling/simple_web_query_eval/atif-eval-readme.md index c5352b2abd..2e7e6da741 100644 --- a/examples/evaluation_and_profiling/simple_web_query_eval/atif-eval-readme.md +++ b/examples/evaluation_and_profiling/simple_web_query_eval/atif-eval-readme.md @@ -84,7 +84,26 @@ Notes: - The custom evaluator is ATIF-only and registered from `nat_simple_web_query_eval`. - It scores using token cosine similarity and includes trajectory metadata (`trajectory_tool_call_count`) in reasoning. -## 3) Optional quick compare +## 3) Test ATIF multi-turn Ragas evaluators + +For agent workflows with tool calls, you can use Ragas multi-turn metrics such as `AgentGoalAccuracyWithoutReference` +and `ToolCallAccuracy`. These require `sample_type: multi_turn` and `enable_atif_evaluator: true` in the evaluator config: + +```yaml +eval: + evaluators: + agent_goal: + _type: ragas + metric: AgentGoalAccuracyWithoutReference + llm_name: nim_rag_eval_llm + enable_atif_evaluator: true + sample_type: multi_turn +``` + +Multi-turn mode converts ATIF trajectory steps into a Ragas `MultiTurnSample` with `HumanMessage`, +`AIMessage` (with `ToolCall`), and `ToolMessage` sequences, preserving the full agent interaction history. + +## 4) Optional quick compare Compare two run directories: diff --git a/packages/nvidia_nat_ragas/src/nat/plugins/ragas/rag_evaluator/atif_evaluate.py b/packages/nvidia_nat_ragas/src/nat/plugins/ragas/rag_evaluator/atif_evaluate.py index 44a727061e..e7da27cdda 100644 --- a/packages/nvidia_nat_ragas/src/nat/plugins/ragas/rag_evaluator/atif_evaluate.py +++ b/packages/nvidia_nat_ragas/src/nat/plugins/ragas/rag_evaluator/atif_evaluate.py @@ -16,12 +16,15 @@ import logging import typing from collections.abc import Sequence +from typing import Literal from tqdm import tqdm from nat.data_models.atif import ATIFObservationResult +from nat.data_models.atif import ATIFStep from nat.data_models.atif import ATIFTrajectory from nat.data_models.evaluator import EvalOutput +from nat.plugins.eval.evaluator.atif_evaluator import AtifEvalSample from nat.plugins.eval.evaluator.atif_evaluator import AtifEvalSampleList from nat.plugins.eval.utils.tqdm_position_registry import TqdmPositionRegistry from nat.utils.atif_message_utils import message_to_text @@ -32,10 +35,15 @@ if typing.TYPE_CHECKING: from ragas import EvaluationDataset from ragas.llms import LangchainLLMWrapper + from ragas.messages import AIMessage + from ragas.messages import HumanMessage + from ragas.messages import ToolMessage from ragas.metrics import Metric logger = logging.getLogger(__name__) +SampleType = Literal["single_turn", "multi_turn"] + def _observation_result_to_text(result: ATIFObservationResult) -> str: return message_to_text(result.content) @@ -53,32 +61,132 @@ def _trajectory_to_retrieved_contexts(trajectory: ATIFTrajectory) -> list[str]: return contexts +def _join_non_empty(parts: Sequence[str], separator: str = "\n\n") -> str: + return separator.join([part for part in parts if part]) + + +def _atif_step_to_ragas_messages(step: ATIFStep, ) -> list["HumanMessage | AIMessage | ToolMessage"]: + """Convert a single ATIF step into one or more RAGAS message objects. + + Mapping: + source="user" → HumanMessage + source="agent" → AIMessage (with optional ToolCall list) + followed by ToolMessage per observation result + source="system" → HumanMessage (best-effort; RAGAS has no SystemMessage) + """ + from ragas.messages import AIMessage as RagasAIMessage + from ragas.messages import HumanMessage as RagasHumanMessage + from ragas.messages import ToolCall as RagasToolCall + from ragas.messages import ToolMessage as RagasToolMessage + + messages: list[RagasHumanMessage | RagasAIMessage | RagasToolMessage] = [] + text = message_to_text(step.message) + + if step.source == "user": + messages.append(RagasHumanMessage(content=text)) + return messages + + if step.source == "system": + messages.append(RagasHumanMessage(content=text)) + return messages + + ragas_tool_calls = [] + if step.tool_calls: + for tc in step.tool_calls: + ragas_tool_calls.append(RagasToolCall(name=tc.function_name, args=tc.arguments)) + + observation_texts = [] + if step.observation and step.observation.results: + for result in step.observation.results: + tool_content = _observation_result_to_text(result) + if tool_content: + observation_texts.append(tool_content) + + # RAGAS only allows ToolMessage after an AIMessage with tool_calls. + # If ATIF contains observation results without explicit tool_calls, + # fold them into the AI content as a best-effort representation. + ai_content = text + if observation_texts and not ragas_tool_calls: + ai_content = _join_non_empty([text, *observation_texts]) + + messages.append(RagasAIMessage(content=ai_content, tool_calls=ragas_tool_calls or None)) + + if ragas_tool_calls: + for observation_text in observation_texts: + messages.append(RagasToolMessage(content=observation_text)) + + return messages + + +def _atif_trajectory_to_multi_turn_messages( + trajectory: ATIFTrajectory, ) -> list["HumanMessage | AIMessage | ToolMessage"]: + """Convert an entire ATIF trajectory into a RAGAS multi-turn message sequence.""" + messages: list = [] + for step in trajectory.steps: + messages.extend(_atif_step_to_ragas_messages(step)) + return messages + + class RAGAtifEvaluator: - def __init__(self, evaluator_llm: "LangchainLLMWrapper", metrics: Sequence["Metric"], max_concurrency=8): + def __init__( + self, + evaluator_llm: "LangchainLLMWrapper", + metrics: Sequence["Metric"], + max_concurrency: int = 8, + sample_type: SampleType = "single_turn", + ): self.evaluator_llm = evaluator_llm self.metrics = metrics self.max_concurrency = max_concurrency + self.sample_type = sample_type + + def _build_single_turn_sample(self, sample: AtifEvalSample): + """Build a RAGAS SingleTurnSample from an ATIF eval sample.""" + from ragas import SingleTurnSample + + user_input = trajectory_to_user_input(sample.trajectory) + reference = sample.expected_output_obj + response = sample.output_obj + reference_contexts = [""] + retrieved_contexts = _trajectory_to_retrieved_contexts(sample.trajectory) + return SingleTurnSample( + user_input=user_input, + reference=reference, + response=response, + reference_contexts=reference_contexts, + retrieved_contexts=retrieved_contexts, + ) + + def _build_multi_turn_sample(self, sample: AtifEvalSample): + """Build a RAGAS MultiTurnSample from an ATIF eval sample.""" + from ragas import MultiTurnSample + + conversation = _atif_trajectory_to_multi_turn_messages(sample.trajectory) + reference = sample.expected_output_obj + kwargs: dict[str, typing.Any] = {"user_input": conversation} + if reference is not None: + kwargs["reference"] = str(reference) + + reference_tool_calls = sample.metadata.get("reference_tool_calls") + if reference_tool_calls is not None: + kwargs["reference_tool_calls"] = reference_tool_calls + + return MultiTurnSample(**kwargs) def atif_samples_to_ragas(self, atif_samples: AtifEvalSampleList) -> "EvaluationDataset": - """Converts ATIF-native samples into a Ragas-compatible EvaluationDataset.""" + """Converts ATIF-native samples into a Ragas-compatible EvaluationDataset. + + Uses SingleTurnSample or MultiTurnSample depending on ``self.sample_type``. + """ from ragas import EvaluationDataset - from ragas import SingleTurnSample samples = [] for sample in atif_samples: - user_input = trajectory_to_user_input(sample.trajectory) - reference = sample.expected_output_obj - response = sample.output_obj - reference_contexts = [""] - retrieved_contexts = _trajectory_to_retrieved_contexts(sample.trajectory) - ragas_sample = SingleTurnSample( - user_input=user_input, - reference=reference, - response=response, - reference_contexts=reference_contexts, - retrieved_contexts=retrieved_contexts, - ) + if self.sample_type == "multi_turn": + ragas_sample = self._build_multi_turn_sample(sample) + else: + ragas_sample = self._build_single_turn_sample(sample) samples.append(ragas_sample) return EvaluationDataset(samples=samples) diff --git a/packages/nvidia_nat_ragas/src/nat/plugins/ragas/rag_evaluator/register.py b/packages/nvidia_nat_ragas/src/nat/plugins/ragas/rag_evaluator/register.py index 9f4b75ab41..7d60e1d591 100644 --- a/packages/nvidia_nat_ragas/src/nat/plugins/ragas/rag_evaluator/register.py +++ b/packages/nvidia_nat_ragas/src/nat/plugins/ragas/rag_evaluator/register.py @@ -14,6 +14,7 @@ # limitations under the License. import logging +from typing import Literal from pydantic import BaseModel from pydantic import Field @@ -54,6 +55,13 @@ class RagasEvaluatorConfig(EvaluatorLLMConfig, name="ragas"): default=False, description="Enable ATIF-native RAGAS evaluator lane. Disabled by default until rollout stabilization.", ) + sample_type: Literal["single_turn", "multi_turn"] = Field( + default="single_turn", + description=( + "RAGAS sample type for ATIF evaluation. " + "'single_turn' uses SingleTurnSample (for AnswerAccuracy, ContextRelevance, etc.). " + "'multi_turn' uses MultiTurnSample (for AgentGoalAccuracyWithoutReference, ToolCallAccuracy, etc.)."), + ) @model_validator(mode="before") @classmethod @@ -70,6 +78,9 @@ def validate_metric(cls, values): elif not isinstance(metric, str): raise ValueError("Metric must be either a string or a single-item dictionary.") + if values.get("sample_type") == "multi_turn" and not values.get("enable_atif_evaluator"): + raise ValueError("sample_type='multi_turn' requires enable_atif_evaluator=True.") + return values @property @@ -149,9 +160,11 @@ async def evaluate_atif_fn(atif_samples: AtifEvalSampleList) -> EvalOutput: metrics=metrics, max_concurrency=builder.get_max_concurrency(), input_obj_field=config.input_obj_field) if metrics else None - atif_evaluator = RAGAtifEvaluator( - evaluator_llm=llm, metrics=metrics, - max_concurrency=builder.get_max_concurrency()) if (metrics and config.enable_atif_evaluator) else None + atif_evaluator = RAGAtifEvaluator(evaluator_llm=llm, + metrics=metrics, + max_concurrency=builder.get_max_concurrency(), + sample_type=config.sample_type) if (metrics + and config.enable_atif_evaluator) else None evaluator_info = EvaluatorInfo(config=config, evaluate_fn=evaluate_fn, description="Evaluator for RAGAS metrics") if config.enable_atif_evaluator: diff --git a/packages/nvidia_nat_ragas/tests/test_rag_evaluate.py b/packages/nvidia_nat_ragas/tests/test_rag_evaluate.py index 1bc7579b16..22f36f9a82 100644 --- a/packages/nvidia_nat_ragas/tests/test_rag_evaluate.py +++ b/packages/nvidia_nat_ragas/tests/test_rag_evaluate.py @@ -614,3 +614,353 @@ async def test_register_ragas_evaluator_atif_lane_enabled(): assert callable(getattr(evaluator_info, "evaluate_atif_fn", None)) builder.get_llm.assert_awaited_once() + + +# --------------------------------------------------------------------------- +# MultiTurnSample conversion tests +# --------------------------------------------------------------------------- + + +@pytest.fixture(name="multi_turn_atif_samples") +def fixture_multi_turn_atif_samples(): + """ATIF samples with tool calls suitable for multi-turn conversion.""" + from nat.data_models.atif import ATIFAgentConfig + from nat.data_models.atif import ATIFObservation + from nat.data_models.atif import ATIFObservationResult + from nat.data_models.atif import ATIFStep + from nat.data_models.atif import ATIFToolCall + from nat.data_models.atif import ATIFTrajectory + from nat.plugins.eval.evaluator.atif_evaluator import AtifEvalSample + + trajectory = ATIFTrajectory( + session_id="multi-turn-1", + agent=ATIFAgentConfig(name="nat-agent", version="0.0.0"), + steps=[ + ATIFStep(step_id=1, source="user", message="What is 12 * 15 + 8?"), + ATIFStep( + step_id=2, + source="agent", + message="Let me calculate that.", + tool_calls=[ + ATIFToolCall( + tool_call_id="call-1", + function_name="calculator__multiply", + arguments={"numbers": [12, 15]}, + ), + ], + observation=ATIFObservation(results=[ATIFObservationResult(source_call_id="call-1", + content="180.0")], ), + ), + ATIFStep( + step_id=3, + source="agent", + message="Now adding 8.", + tool_calls=[ + ATIFToolCall( + tool_call_id="call-2", + function_name="calculator__add", + arguments={"numbers": [180.0, 8]}, + ), + ], + observation=ATIFObservation(results=[ATIFObservationResult(source_call_id="call-2", + content="188.0")], ), + ), + ATIFStep(step_id=4, source="agent", message="The answer is 188.0"), + ], + ) + return [ + AtifEvalSample( + item_id=1, + trajectory=trajectory, + expected_output_obj="188.0", + output_obj="The answer is 188.0", + metadata={}, + ) + ] + + +def test_atif_samples_to_ragas_multi_turn(ragas_judge_llm, ragas_metrics, multi_turn_atif_samples): + """Verify ATIF → MultiTurnSample conversion produces correct message sequence.""" + from ragas import MultiTurnSample + from ragas.messages import AIMessage as RagasAIMessage + from ragas.messages import HumanMessage as RagasHumanMessage + from ragas.messages import ToolMessage as RagasToolMessage + + from nat.plugins.ragas.rag_evaluator.atif_evaluate import RAGAtifEvaluator + + evaluator = RAGAtifEvaluator( + evaluator_llm=ragas_judge_llm, + metrics=ragas_metrics, + sample_type="multi_turn", + ) + dataset = evaluator.atif_samples_to_ragas(multi_turn_atif_samples) + + assert len(dataset.samples) == 1 + sample = dataset.samples[0] + assert isinstance(sample, MultiTurnSample) + + messages = sample.user_input + assert isinstance(messages[0], RagasHumanMessage) + assert messages[0].content == "What is 12 * 15 + 8?" + + assert isinstance(messages[1], RagasAIMessage) + assert messages[1].content == "Let me calculate that." + assert len(messages[1].tool_calls) == 1 + assert messages[1].tool_calls[0].name == "calculator__multiply" + assert messages[1].tool_calls[0].args == {"numbers": [12, 15]} + + assert isinstance(messages[2], RagasToolMessage) + assert messages[2].content == "180.0" + + assert isinstance(messages[3], RagasAIMessage) + assert messages[3].content == "Now adding 8." + assert len(messages[3].tool_calls) == 1 + assert messages[3].tool_calls[0].name == "calculator__add" + + assert isinstance(messages[4], RagasToolMessage) + assert messages[4].content == "188.0" + + assert isinstance(messages[5], RagasAIMessage) + assert messages[5].content == "The answer is 188.0" + assert messages[5].tool_calls is None + + assert sample.reference == "188.0" + + +def test_atif_samples_to_ragas_multi_turn_no_tools(ragas_judge_llm, ragas_metrics): + """Multi-turn conversion handles trajectories without tool calls.""" + from ragas import MultiTurnSample + from ragas.messages import AIMessage as RagasAIMessage + from ragas.messages import HumanMessage as RagasHumanMessage + + from nat.data_models.atif import ATIFAgentConfig + from nat.data_models.atif import ATIFStep + from nat.data_models.atif import ATIFTrajectory + from nat.plugins.eval.evaluator.atif_evaluator import AtifEvalSample + from nat.plugins.ragas.rag_evaluator.atif_evaluate import RAGAtifEvaluator + + trajectory = ATIFTrajectory( + session_id="no-tools", + agent=ATIFAgentConfig(name="nat-agent", version="0.0.0"), + steps=[ + ATIFStep(step_id=1, source="user", message="Hello"), + ATIFStep(step_id=2, source="agent", message="Hi there!"), + ], + ) + samples = [ + AtifEvalSample(item_id=1, trajectory=trajectory, expected_output_obj="Hi", output_obj="Hi there!", metadata={}) + ] + evaluator = RAGAtifEvaluator(evaluator_llm=ragas_judge_llm, metrics=ragas_metrics, sample_type="multi_turn") + dataset = evaluator.atif_samples_to_ragas(samples) + + assert len(dataset.samples) == 1 + sample = dataset.samples[0] + assert isinstance(sample, MultiTurnSample) + assert len(sample.user_input) == 2 + assert isinstance(sample.user_input[0], RagasHumanMessage) + assert isinstance(sample.user_input[1], RagasAIMessage) + + +def test_atif_samples_to_ragas_multi_turn_empty_trajectory(ragas_judge_llm, ragas_metrics): + """Multi-turn conversion handles empty trajectory gracefully.""" + from ragas import MultiTurnSample + + from nat.data_models.atif import ATIFAgentConfig + from nat.data_models.atif import ATIFTrajectory + from nat.plugins.eval.evaluator.atif_evaluator import AtifEvalSample + from nat.plugins.ragas.rag_evaluator.atif_evaluate import RAGAtifEvaluator + + trajectory = ATIFTrajectory( + session_id="empty", + agent=ATIFAgentConfig(name="nat-agent", version="0.0.0"), + steps=[], + ) + samples = [AtifEvalSample(item_id=1, trajectory=trajectory, expected_output_obj=None, output_obj=None, metadata={})] + evaluator = RAGAtifEvaluator(evaluator_llm=ragas_judge_llm, metrics=ragas_metrics, sample_type="multi_turn") + dataset = evaluator.atif_samples_to_ragas(samples) + + assert len(dataset.samples) == 1 + sample = dataset.samples[0] + assert isinstance(sample, MultiTurnSample) + assert len(sample.user_input) == 0 + + +def test_atif_samples_to_ragas_multi_turn_multiple_tool_calls(ragas_judge_llm, ragas_metrics): + """Multi-turn conversion handles a step with multiple parallel tool calls.""" + from ragas import MultiTurnSample + from ragas.messages import AIMessage as RagasAIMessage + from ragas.messages import ToolMessage as RagasToolMessage + + from nat.data_models.atif import ATIFAgentConfig + from nat.data_models.atif import ATIFObservation + from nat.data_models.atif import ATIFObservationResult + from nat.data_models.atif import ATIFStep + from nat.data_models.atif import ATIFToolCall + from nat.data_models.atif import ATIFTrajectory + from nat.plugins.eval.evaluator.atif_evaluator import AtifEvalSample + from nat.plugins.ragas.rag_evaluator.atif_evaluate import RAGAtifEvaluator + + trajectory = ATIFTrajectory( + session_id="multi-tool", + agent=ATIFAgentConfig(name="nat-agent", version="0.0.0"), + steps=[ + ATIFStep(step_id=1, source="user", message="query"), + ATIFStep( + step_id=2, + source="agent", + message="Running two tools.", + tool_calls=[ + ATIFToolCall(tool_call_id="c1", function_name="tool_a", arguments={"x": 1}), + ATIFToolCall(tool_call_id="c2", function_name="tool_b", arguments={"y": 2}), + ], + observation=ATIFObservation(results=[ + ATIFObservationResult(source_call_id="c1", content="result_a"), + ATIFObservationResult(source_call_id="c2", content="result_b"), + ], ), + ), + ], + ) + samples = [AtifEvalSample(item_id=1, trajectory=trajectory, expected_output_obj=None, output_obj=None, metadata={})] + evaluator = RAGAtifEvaluator(evaluator_llm=ragas_judge_llm, metrics=ragas_metrics, sample_type="multi_turn") + dataset = evaluator.atif_samples_to_ragas(samples) + + sample = dataset.samples[0] + assert isinstance(sample, MultiTurnSample) + + ai_msg = sample.user_input[1] + assert isinstance(ai_msg, RagasAIMessage) + assert len(ai_msg.tool_calls) == 2 + assert ai_msg.tool_calls[0].name == "tool_a" + assert ai_msg.tool_calls[1].name == "tool_b" + + assert isinstance(sample.user_input[2], RagasToolMessage) + assert sample.user_input[2].content == "result_a" + assert isinstance(sample.user_input[3], RagasToolMessage) + assert sample.user_input[3].content == "result_b" + + +def test_atif_samples_to_ragas_multi_turn_observation_without_tool_calls(ragas_judge_llm, ragas_metrics): + """Observation-only ATIF steps should not emit invalid orphan ToolMessages.""" + from ragas import MultiTurnSample + from ragas.messages import AIMessage as RagasAIMessage + from ragas.messages import HumanMessage as RagasHumanMessage + + from nat.data_models.atif import ATIFAgentConfig + from nat.data_models.atif import ATIFObservation + from nat.data_models.atif import ATIFObservationResult + from nat.data_models.atif import ATIFStep + from nat.data_models.atif import ATIFTrajectory + from nat.plugins.eval.evaluator.atif_evaluator import AtifEvalSample + from nat.plugins.ragas.rag_evaluator.atif_evaluate import RAGAtifEvaluator + + trajectory = ATIFTrajectory( + session_id="observation-no-tools", + agent=ATIFAgentConfig(name="nat-agent", version="0.0.0"), + steps=[ + ATIFStep(step_id=1, source="user", message="Summarize the latest result."), + ATIFStep( + step_id=2, + source="agent", + message="Here is what I found.", + observation=ATIFObservation(results=[ATIFObservationResult(content="The latest result is 188.0")], ), + ), + ], + ) + samples = [AtifEvalSample(item_id=1, trajectory=trajectory, expected_output_obj=None, output_obj=None, metadata={})] + + evaluator = RAGAtifEvaluator(evaluator_llm=ragas_judge_llm, metrics=ragas_metrics, sample_type="multi_turn") + dataset = evaluator.atif_samples_to_ragas(samples) + + sample = dataset.samples[0] + assert isinstance(sample, MultiTurnSample) + assert len(sample.user_input) == 2 + assert isinstance(sample.user_input[0], RagasHumanMessage) + assert isinstance(sample.user_input[1], RagasAIMessage) + assert sample.user_input[1].tool_calls is None + assert sample.user_input[1].content == "Here is what I found.\n\nThe latest result is 188.0" + + +def test_atif_samples_to_ragas_multi_turn_with_reference_tool_calls(ragas_judge_llm, ragas_metrics): + """Multi-turn conversion passes reference_tool_calls from metadata.""" + from ragas import MultiTurnSample + from ragas.messages import ToolCall as RagasToolCall + + from nat.data_models.atif import ATIFAgentConfig + from nat.data_models.atif import ATIFStep + from nat.data_models.atif import ATIFTrajectory + from nat.plugins.eval.evaluator.atif_evaluator import AtifEvalSample + from nat.plugins.ragas.rag_evaluator.atif_evaluate import RAGAtifEvaluator + + trajectory = ATIFTrajectory( + session_id="ref-tc", + agent=ATIFAgentConfig(name="nat-agent", version="0.0.0"), + steps=[ + ATIFStep(step_id=1, source="user", message="compute"), + ATIFStep(step_id=2, source="agent", message="done"), + ], + ) + ref_tool_calls = [RagasToolCall(name="calculator__add", args={"numbers": [1, 2]})] + samples = [ + AtifEvalSample( + item_id=1, + trajectory=trajectory, + expected_output_obj="3", + output_obj="done", + metadata={"reference_tool_calls": ref_tool_calls}, + ) + ] + evaluator = RAGAtifEvaluator(evaluator_llm=ragas_judge_llm, metrics=ragas_metrics, sample_type="multi_turn") + dataset = evaluator.atif_samples_to_ragas(samples) + + sample = dataset.samples[0] + assert isinstance(sample, MultiTurnSample) + assert sample.reference_tool_calls == ref_tool_calls + + +def test_default_sample_type_is_single_turn(ragas_judge_llm, ragas_metrics, atif_samples): + """When sample_type is not specified, SingleTurnSample is used (backward compat).""" + from ragas import SingleTurnSample + + from nat.plugins.ragas.rag_evaluator.atif_evaluate import RAGAtifEvaluator + + evaluator = RAGAtifEvaluator(evaluator_llm=ragas_judge_llm, metrics=ragas_metrics) + dataset = evaluator.atif_samples_to_ragas(atif_samples) + + for sample in dataset.samples: + assert isinstance(sample, SingleTurnSample) + + +def test_ragas_evaluator_config_multi_turn_requires_atif_lane(): + """Multi-turn sample conversion is only valid on the ATIF evaluator lane.""" + from nat.plugins.ragas.rag_evaluator.register import RagasEvaluatorConfig + + with pytest.raises(ValueError, match="sample_type='multi_turn' requires enable_atif_evaluator=True"): + RagasEvaluatorConfig( + llm_name="judge", + metric="AgentGoalAccuracyWithoutReference", + sample_type="multi_turn", + enable_atif_evaluator=False, + ) + + +async def test_register_ragas_evaluator_multi_turn_sample_type(): + """Ensure sample_type=multi_turn is wired through registration.""" + from nat.plugins.ragas.rag_evaluator.register import RagasEvaluatorConfig + from nat.plugins.ragas.rag_evaluator.register import register_ragas_evaluator + + builder = MagicMock() + builder.get_llm = AsyncMock(return_value=MagicMock()) + builder.get_max_concurrency = MagicMock(return_value=1) + + config = RagasEvaluatorConfig( + llm_name="judge", + metric={"AgentGoalAccuracyWithoutReference": { + "skip": True + }}, + enable_atif_evaluator=True, + sample_type="multi_turn", + ) + async with register_ragas_evaluator(config=config, builder=builder) as evaluator_info: + assert callable(getattr(evaluator_info, "evaluate_atif_fn", None)) + + builder.get_llm.assert_awaited_once()