Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions docs/source/extend/custom-components/custom-evaluator.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,6 +241,37 @@ eval:
normalize_case: true
```

### Using Ragas Multi-turn Metrics with ATIF Trajectories

The built-in `ragas` evaluator supports both single-turn and multi-turn evaluation modes. Single-turn mode (`sample_type: single_turn`, the default) is suitable for metrics like `AnswerAccuracy` and `ContextRelevance`. Multi-turn mode (`sample_type: multi_turn`) is required for trajectory-aware metrics such as `AgentGoalAccuracyWithoutReference` and `ToolCallAccuracy`.

In multi-turn mode, the ATIF trajectory is converted into a sequence of Ragas message types:

- Steps with `source: user` become `HumanMessage`
- Steps with `source: agent` become `AIMessage`, with ATIF `tool_calls` mapped to Ragas `ToolCall` objects
- Tool observation results become `ToolMessage`

To use multi-turn metrics, add an evaluator with `sample_type: multi_turn` and `enable_atif_evaluator: true`:

```yaml
eval:
evaluators:
agent_goal:
_type: ragas
metric: AgentGoalAccuracyWithoutReference
llm_name: nim_rag_eval_llm
enable_atif_evaluator: true
sample_type: multi_turn
tool_call_accuracy:
_type: ragas
metric: ToolCallAccuracy
llm_name: nim_rag_eval_llm
enable_atif_evaluator: true
sample_type: multi_turn
```

For `ToolCallAccuracy`, you can optionally supply expected tool calls through the `reference_tool_calls` key in `AtifEvalSample.metadata` when using the Python API.

### Display all evaluators
To display all evaluators, run the following command:
```bash
Expand Down
61 changes: 61 additions & 0 deletions docs/source/improve-workflows/evaluate.md
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,45 @@ For a complete list of up-to-date judge LLMs, refer to the [Ragas NV metrics lea

For more information on the prompt used by the judge LLM, refer to the [Ragas NV metrics](https://github.com/explodinggradients/ragas/blob/v0.2.14/src/ragas/metrics/_nv_metrics.py). The prompt for these metrics is not configurable. If you need a custom prompt, you can use the [Tunable RAG Evaluator](#tunable-rag-evaluator) or implement your own evaluator using the [Custom Evaluator](../extend/custom-components/custom-evaluator.md) documentation.

#### Multi-turn Agent Evaluation with Ragas

The metrics listed above use `SingleTurnSample` and evaluate single question-answer pairs. For agent workflows with tool calls, Ragas provides trajectory-aware metrics that require `MultiTurnSample`. These metrics evaluate the full conversation sequence, including tool calls and tool results.

To enable multi-turn evaluation, set `sample_type: multi_turn` and `enable_atif_evaluator: true` in the evaluator config:

```yaml
eval:
evaluators:
agent_goal:
_type: ragas
metric: AgentGoalAccuracyWithoutReference
llm_name: nim_rag_eval_llm
enable_atif_evaluator: true
sample_type: multi_turn
tool_call_accuracy:
_type: ragas
metric: ToolCallAccuracy
llm_name: nim_rag_eval_llm
enable_atif_evaluator: true
sample_type: multi_turn
```

The multi-turn conversion maps ATIF trajectory steps to Ragas message types:

- ATIF steps with `source: user` become `HumanMessage`
- ATIF steps with `source: agent` become `AIMessage`, with `tool_calls` converted to Ragas `ToolCall` objects
- Observation results from tool executions become `ToolMessage`

The following Ragas metrics are recommended for agent evaluation with multi-turn trajectories:

`AgentGoalAccuracyWithoutReference`: Evaluates whether the agent achieved the user's goal based on the conversation trajectory, without requiring a reference answer.

`ToolCallAccuracy`: Evaluates whether the agent made the correct tool calls with the right arguments. When using this metric, you can provide expected tool calls through the `reference_tool_calls` key in the `AtifEvalSample.metadata` field.

:::{note}
Multi-turn evaluation requires `enable_atif_evaluator: true` since it operates on ATIF trajectories. The default `sample_type` is `single_turn` for backward compatibility.
:::

### Trajectory Evaluator
This evaluator uses the intermediate steps generated by the workflow to evaluate the workflow trajectory. The evaluator configuration includes the evaluator type and any additional parameters required by the evaluator.

Expand Down Expand Up @@ -791,6 +830,28 @@ The following `ragas` metrics are recommended for RAG like workflows -
`ContextRelevance`: Evaluates the relevance of the context retrieved by the workflow against the question.
`ResponseGroundedness`: Evaluates the `groundedness` of the response generated by the workflow based on the context retrieved by the workflow.

##### Multi-turn Ragas Metrics for Agent Evaluation

For agent workflows that involve tool calls and multi-step interactions, you can use Ragas multi-turn metrics. These metrics operate on the full conversation trajectory rather than a single question-answer pair.

Set `sample_type: multi_turn` and `enable_atif_evaluator: true` to activate this mode:

```yaml
eval:
evaluators:
agent_goal:
_type: ragas
metric: AgentGoalAccuracyWithoutReference
llm_name: nim_rag_eval_llm
enable_atif_evaluator: true
sample_type: multi_turn
```

The `sample_type` parameter controls how ATIF trajectories are converted for Ragas:

- `single_turn` (default): Converts to `SingleTurnSample` with `user_input`, `response`, and `retrieved_contexts`. Suitable for `AnswerAccuracy`, `ContextRelevance`, and `ResponseGroundedness`.
- `multi_turn`: Converts to `MultiTurnSample` with a full `HumanMessage`/`AIMessage`/`ToolMessage` conversation sequence. Required for `AgentGoalAccuracyWithoutReference`, `ToolCallAccuracy`, and other trajectory-aware metrics.

#### Agent Trajectory Evaluator
The `trajectory` evaluator uses LangChain/LangGraph agent trajectory evaluation to evaluate the workflow. To use the `trajectory` evaluator, add the following configuration to the `config.yml` file.
```yaml
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,26 @@ Notes:
- The custom evaluator is ATIF-only and registered from `nat_simple_web_query_eval`.
- It scores using token cosine similarity and includes trajectory metadata (`trajectory_tool_call_count`) in reasoning.

## 3) Optional quick compare
## 3) Test ATIF multi-turn Ragas evaluators

For agent workflows with tool calls, you can use Ragas multi-turn metrics such as `AgentGoalAccuracyWithoutReference`
and `ToolCallAccuracy`. These require `sample_type: multi_turn` and `enable_atif_evaluator: true` in the evaluator config:

```yaml
eval:
evaluators:
agent_goal:
_type: ragas
metric: AgentGoalAccuracyWithoutReference
llm_name: nim_rag_eval_llm
enable_atif_evaluator: true
sample_type: multi_turn
```

Multi-turn mode converts ATIF trajectory steps into a Ragas `MultiTurnSample` with `HumanMessage`,
`AIMessage` (with `ToolCall`), and `ToolMessage` sequences, preserving the full agent interaction history.

## 4) Optional quick compare

Compare two run directories:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,15 @@
import logging
import typing
from collections.abc import Sequence
from typing import Literal

from tqdm import tqdm

from nat.data_models.atif import ATIFObservationResult
from nat.data_models.atif import ATIFStep
from nat.data_models.atif import ATIFTrajectory
from nat.data_models.evaluator import EvalOutput
from nat.plugins.eval.evaluator.atif_evaluator import AtifEvalSample
from nat.plugins.eval.evaluator.atif_evaluator import AtifEvalSampleList
from nat.plugins.eval.utils.tqdm_position_registry import TqdmPositionRegistry
from nat.utils.atif_message_utils import message_to_text
Expand All @@ -32,10 +35,15 @@
if typing.TYPE_CHECKING:
from ragas import EvaluationDataset
from ragas.llms import LangchainLLMWrapper
from ragas.messages import AIMessage
from ragas.messages import HumanMessage
from ragas.messages import ToolMessage
from ragas.metrics import Metric

logger = logging.getLogger(__name__)

SampleType = Literal["single_turn", "multi_turn"]


def _observation_result_to_text(result: ATIFObservationResult) -> str:
return message_to_text(result.content)
Expand All @@ -53,32 +61,132 @@ def _trajectory_to_retrieved_contexts(trajectory: ATIFTrajectory) -> list[str]:
return contexts


def _join_non_empty(parts: Sequence[str], separator: str = "\n\n") -> str:
return separator.join([part for part in parts if part])


def _atif_step_to_ragas_messages(step: ATIFStep, ) -> list["HumanMessage | AIMessage | ToolMessage"]:
"""Convert a single ATIF step into one or more RAGAS message objects.

Mapping:
source="user" → HumanMessage
source="agent" → AIMessage (with optional ToolCall list)
followed by ToolMessage per observation result
source="system" → HumanMessage (best-effort; RAGAS has no SystemMessage)
Comment on lines +71 to +75
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should fix the documentation build failure

Suggested change
Mapping:
source="user"HumanMessage
source="agent"AIMessage (with optional ToolCall list)
followed by ToolMessage per observation result
source="system"HumanMessage (best-effort; RAGAS has no SystemMessage)
Mapping:
* source="user"HumanMessage
* source="agent"AIMessage (with optional ToolCall list)
followed by ToolMessage per observation result
* source="system"HumanMessage (best-effort; RAGAS has no SystemMessage)

"""
from ragas.messages import AIMessage as RagasAIMessage
from ragas.messages import HumanMessage as RagasHumanMessage
from ragas.messages import ToolCall as RagasToolCall
from ragas.messages import ToolMessage as RagasToolMessage

messages: list[RagasHumanMessage | RagasAIMessage | RagasToolMessage] = []
text = message_to_text(step.message)

if step.source == "user":
messages.append(RagasHumanMessage(content=text))
return messages

if step.source == "system":
messages.append(RagasHumanMessage(content=text))
return messages

ragas_tool_calls = []
if step.tool_calls:
for tc in step.tool_calls:
ragas_tool_calls.append(RagasToolCall(name=tc.function_name, args=tc.arguments))

observation_texts = []
if step.observation and step.observation.results:
for result in step.observation.results:
tool_content = _observation_result_to_text(result)
if tool_content:
observation_texts.append(tool_content)

# RAGAS only allows ToolMessage after an AIMessage with tool_calls.
# If ATIF contains observation results without explicit tool_calls,
# fold them into the AI content as a best-effort representation.
ai_content = text
if observation_texts and not ragas_tool_calls:
ai_content = _join_non_empty([text, *observation_texts])

messages.append(RagasAIMessage(content=ai_content, tool_calls=ragas_tool_calls or None))

if ragas_tool_calls:
for observation_text in observation_texts:
messages.append(RagasToolMessage(content=observation_text))

return messages


def _atif_trajectory_to_multi_turn_messages(
trajectory: ATIFTrajectory, ) -> list["HumanMessage | AIMessage | ToolMessage"]:
"""Convert an entire ATIF trajectory into a RAGAS multi-turn message sequence."""
messages: list = []
for step in trajectory.steps:
messages.extend(_atif_step_to_ragas_messages(step))
return messages


class RAGAtifEvaluator:

def __init__(self, evaluator_llm: "LangchainLLMWrapper", metrics: Sequence["Metric"], max_concurrency=8):
def __init__(
self,
evaluator_llm: "LangchainLLMWrapper",
metrics: Sequence["Metric"],
max_concurrency: int = 8,
sample_type: SampleType = "single_turn",
):
self.evaluator_llm = evaluator_llm
self.metrics = metrics
self.max_concurrency = max_concurrency
self.sample_type = sample_type

def _build_single_turn_sample(self, sample: AtifEvalSample):
"""Build a RAGAS SingleTurnSample from an ATIF eval sample."""
from ragas import SingleTurnSample

user_input = trajectory_to_user_input(sample.trajectory)
reference = sample.expected_output_obj
response = sample.output_obj
reference_contexts = [""]
retrieved_contexts = _trajectory_to_retrieved_contexts(sample.trajectory)
return SingleTurnSample(
user_input=user_input,
reference=reference,
response=response,
reference_contexts=reference_contexts,
retrieved_contexts=retrieved_contexts,
)

def _build_multi_turn_sample(self, sample: AtifEvalSample):
"""Build a RAGAS MultiTurnSample from an ATIF eval sample."""
from ragas import MultiTurnSample

conversation = _atif_trajectory_to_multi_turn_messages(sample.trajectory)
reference = sample.expected_output_obj
kwargs: dict[str, typing.Any] = {"user_input": conversation}
if reference is not None:
kwargs["reference"] = str(reference)

reference_tool_calls = sample.metadata.get("reference_tool_calls")
if reference_tool_calls is not None:
kwargs["reference_tool_calls"] = reference_tool_calls

return MultiTurnSample(**kwargs)

def atif_samples_to_ragas(self, atif_samples: AtifEvalSampleList) -> "EvaluationDataset":
"""Converts ATIF-native samples into a Ragas-compatible EvaluationDataset."""
"""Converts ATIF-native samples into a Ragas-compatible EvaluationDataset.

Uses SingleTurnSample or MultiTurnSample depending on ``self.sample_type``.
"""
from ragas import EvaluationDataset
from ragas import SingleTurnSample

samples = []
for sample in atif_samples:
user_input = trajectory_to_user_input(sample.trajectory)
reference = sample.expected_output_obj
response = sample.output_obj
reference_contexts = [""]
retrieved_contexts = _trajectory_to_retrieved_contexts(sample.trajectory)
ragas_sample = SingleTurnSample(
user_input=user_input,
reference=reference,
response=response,
reference_contexts=reference_contexts,
retrieved_contexts=retrieved_contexts,
)
if self.sample_type == "multi_turn":
ragas_sample = self._build_multi_turn_sample(sample)
else:
ragas_sample = self._build_single_turn_sample(sample)
samples.append(ragas_sample)
return EvaluationDataset(samples=samples)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
# limitations under the License.

import logging
from typing import Literal

from pydantic import BaseModel
from pydantic import Field
Expand Down Expand Up @@ -54,6 +55,13 @@ class RagasEvaluatorConfig(EvaluatorLLMConfig, name="ragas"):
default=False,
description="Enable ATIF-native RAGAS evaluator lane. Disabled by default until rollout stabilization.",
)
sample_type: Literal["single_turn", "multi_turn"] = Field(
default="single_turn",
description=(
"RAGAS sample type for ATIF evaluation. "
"'single_turn' uses SingleTurnSample (for AnswerAccuracy, ContextRelevance, etc.). "
"'multi_turn' uses MultiTurnSample (for AgentGoalAccuracyWithoutReference, ToolCallAccuracy, etc.)."),
)

@model_validator(mode="before")
@classmethod
Expand All @@ -70,6 +78,9 @@ def validate_metric(cls, values):
elif not isinstance(metric, str):
raise ValueError("Metric must be either a string or a single-item dictionary.")

if values.get("sample_type") == "multi_turn" and not values.get("enable_atif_evaluator"):
raise ValueError("sample_type='multi_turn' requires enable_atif_evaluator=True.")

return values

@property
Expand Down Expand Up @@ -149,9 +160,11 @@ async def evaluate_atif_fn(atif_samples: AtifEvalSampleList) -> EvalOutput:
metrics=metrics,
max_concurrency=builder.get_max_concurrency(),
input_obj_field=config.input_obj_field) if metrics else None
atif_evaluator = RAGAtifEvaluator(
evaluator_llm=llm, metrics=metrics,
max_concurrency=builder.get_max_concurrency()) if (metrics and config.enable_atif_evaluator) else None
atif_evaluator = RAGAtifEvaluator(evaluator_llm=llm,
metrics=metrics,
max_concurrency=builder.get_max_concurrency(),
sample_type=config.sample_type) if (metrics
and config.enable_atif_evaluator) else None

evaluator_info = EvaluatorInfo(config=config, evaluate_fn=evaluate_fn, description="Evaluator for RAGAS metrics")
if config.enable_atif_evaluator:
Expand Down
Loading
Loading