Skip to content

Add Tool Failure Evaluator for IntermediateSteps#1816

Open
ericevans-nv wants to merge 4 commits intoNVIDIA:developfrom
ericevans-nv:feat/tool-error-observability
Open

Add Tool Failure Evaluator for IntermediateSteps#1816
ericevans-nv wants to merge 4 commits intoNVIDIA:developfrom
ericevans-nv:feat/tool-error-observability

Conversation

@ericevans-nv
Copy link
Contributor

@ericevans-nv ericevans-nv commented Mar 20, 2026

Description

Summary

Adds a tool_failure evaluator to the nat eval pipeline that computes per-item tool call success rates and surfaces structured error details through IntermediateStep trajectories.

Changes

1. Framework Callback Handlers

  • LangChain (nvidia_nat_langchain): Added on_tool_error callback to capture tool exceptions and emit TOOL_END events with ToolErrorData payloads. Added _run_id_to_tool_name mapping for tool name resolution.
  • ADK (nvidia_nat_adk): Enhanced error handling to emit ToolErrorData in TOOL_END events.
  • Autogen (nvidia_nat_autogen): Enhanced error handling to emit ToolErrorData in TOOL_END events.

2. ToolErrorData Model (nvidia_nat_core)

  • Added ToolErrorData to intermediate_step.py with error_type and error_message fields for structured error capture.

3. Tool Failure Evaluator (nvidia_nat_eval)

  • New tool_failure evaluator plugin:
    • evaluator.pyToolFailureEvaluator implementing BaseEvaluator for IntermediateStep trajectory evaluation. Detects errors via ToolErrorData instances or dict-based error payloads.
    • models.py — Pydantic models with None defaults for empty lists (cleaner JSON output).
    • register.py — Plugin registration with ToolFailureEvaluatorConfig.
  • Reasoning output:
    • per_tool_summary and failed_tools are None when no failures occur
    • per_tool_summary only includes tools with failures
    • score is (total - failed) / total success rate

Tests

  • Unit tests — Comprehensive tests for error detection logic, model population, and edge cases
  • Framework validation — Verified across ADK, Autogen, and LangChain frameworks with multiple error types

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added a tool failure evaluator to measure tool call success rates and failure breakdowns.
    • Tool errors are now captured with structured data including error type, message, and full content.
  • Bug Fixes

    • Improved tool error handling and propagation across multiple callback handler integrations.

…r and ATIF converter

Signed-off-by: Eric Evans <194135482+ericevans-nv@users.noreply.github.com>
@ericevans-nv ericevans-nv self-assigned this Mar 20, 2026
@ericevans-nv ericevans-nv requested a review from a team as a code owner March 20, 2026 23:41
@ericevans-nv ericevans-nv added feature request New feature or request non-breaking Non-breaking change labels Mar 20, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This change introduces structured tool error handling across the framework. A new ToolErrorData model captures error details, which are then detected and propagated through the ATIF converter into step metadata. A new tool failure evaluator plugin computes tool success metrics, while callback handlers in LangChain, ADK, and AutoGen now emit structured error data when tools fail.

Changes

Cohort / File(s) Summary
Core Tool Error Data Model
packages/nvidia_nat_core/src/nat/data_models/intermediate_step.py
Added ToolErrorData Pydantic model with content, error_type, and error_message fields to represent structured tool failure output.
ATIF Converter Tool Error Handling
packages/nvidia_nat_core/src/nat/utils/atif_converter.py, packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py
Modified converter to detect ToolErrorData in raw output, construct tool_error structures, and propagate into ATIF step "extra" metadata under tool_errors array. Added test fixture and TestToolErrorATIFConversion class validating error detection and parity between batch/streaming conversion.
Tool Failure Evaluator Plugin
packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/*, packages/nvidia_nat_eval/tests/eval/tool_failure_evaluator/*
Added new evaluator plugin computing tool success rate and per-tool failure breakdown. Includes ToolFailureEvaluator implementation, ToolFailureEvaluatorConfig registration, _ToolCall/_ToolSummary/_ToolFailureReasoning output models, and comprehensive test coverage validating error aggregation across mixed success/failure trajectories.
LangChain Callback Handler Tool Error Support
packages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.py, packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py
Added on_tool_error method to LangchainProfilerHandler emitting TOOL_END payload with ToolErrorData output. Changed tool schema parsing failures from exception-level to debug-level logging. Added test assertions validating error fields and tool context inheritance from preceding on_tool_start.
ADK Callback Handler Tool Error Support
packages/nvidia_nat_adk/src/nat/plugins/adk/callback_handler.py, packages/nvidia_nat_adk/tests/test_adk_callback_handler.py
Modified _tool_use_monkey_patch wrapper to emit TOOL_END payload with ToolErrorData on exception. Changed exception logging from exception-level to error-level. Updated test to validate ToolErrorData fields and strict event ordering (TOOL_START then TOOL_END).
AutoGen Callback Handler Tool Error Support
packages/nvidia_nat_autogen/src/nat/plugins/autogen/callback_handler.py, packages/nvidia_nat_autogen/tests/test_callback_handler_autogen.py
Updated tool wrapper to emit ToolErrorData instead of plain string on exception, removed TraceMetadata.error fallback. Refactored tool input extraction to use args[1] dict instead of kwargs serialization. Added test validating structured error output and new test for run_json-style input extraction.
Plugin Registration
packages/nvidia_nat_eval/src/nat/plugins/eval/register.py, packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/__init__.py
Added import of register_tool_failure_evaluator to plugin registry and defined package-level __init__.py exporting ToolFailureEvaluator and ToolFailureEvaluatorConfig.

Sequence Diagram

sequenceDiagram
    participant User
    participant Handler as Callback Handler<br/>(LangChain/ADK/AutoGen)
    participant StepMgr as Step Manager
    participant ATIFConv as ATIF Converter
    participant Evaluator as Tool Failure Evaluator

    User->>Handler: Execute Tool (on_tool_start)
    Handler->>StepMgr: Push TOOL_START event
    StepMgr->>ATIFConv: Queue intermediate step
    
    User->>Handler: Tool Fails (on_tool_error)
    Handler->>Handler: Create ToolErrorData<br/>(content, error_type, error_message)
    Handler->>StepMgr: Push TOOL_END with ToolErrorData output
    StepMgr->>ATIFConv: Queue TOOL_END step with error
    
    ATIFConv->>ATIFConv: Detect ToolErrorData in output
    ATIFConv->>ATIFConv: Append tool_error to step.extra["tool_errors"]
    
    User->>Evaluator: Evaluate trajectory
    Evaluator->>Evaluator: Iterate TOOL_END steps
    Evaluator->>Evaluator: Classify as error if ToolErrorData present
    Evaluator->>Evaluator: Aggregate per-tool failure counts
    Evaluator->>Evaluator: Compute success_rate = (total - failed) / total
    Evaluator-->>User: Return EvalOutputItem with score & tool summaries
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is concise (48 characters), descriptive, and clearly summarizes the main objective of adding a tool failure evaluator for intermediate steps.
Docstring Coverage ✅ Passed Docstring coverage is 88.24% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py (1)

20-20: Use a dict or stub in the test instead of importing ToolMessage to reduce test coupling.

atif_converter.py only accesses status and content attributes, so a simple dict will exercise the same code path. Note that langchain-core is not explicitly declared in packages/nvidia_nat_core/pyproject.toml despite being imported throughout the source code (via transitive dependencies). The test should avoid hardwiring this dependency; use a minimal test object instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py` at line 20,
Replace the direct import of ToolMessage in the test with a minimal test double
(either a plain dict or a small stub object) that exposes the two attributes the
production code uses: status and content; update the test in
test_atif_converter.py to construct and pass this dict/stub into the
atif_converter call (the same function/class under test in atif_converter.py)
instead of a ToolMessage instance so the test no longer depends on
langchain_core via ToolMessage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/nvidia_nat_core/src/nat/utils/atif_converter.py`:
- Around line 104-106: Remove the literal "TODO" from the docstring for
_extract_tool_error; instead rephrase the note to a neutral statement such as
"Return a model instead of a plain dict once ATIF spec adds error support" or
move that implementation note to an external tracker, so the docstring no longer
contains TODO/FIXME placeholder text and continues to describe the current
behavior of _extract_tool_error.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py`:
- Around line 101-103: The loop that currently skips agent steps with empty
step.tool_calls (in the evaluator iterating "for step in steps") must be changed
to also inspect step.extra["tool_errors"] and treat each orphaned tool_error as
a failed tool call; specifically, where the code checks "if not step.tool_calls
or step.source != 'agent': continue", remove or adjust the early continue to
allow processing when step.extra.get('tool_errors') exists, count each entry in
step.extra["tool_errors"] toward the failure total, and add the same
regression/recording logic you use for normal tool_calls (e.g., increment
failure counters and mark the step as regressed) so unmatched tool_errors are
not ignored.
- Around line 105-123: The current logic pairs observations and tool_errors by
list position and tool name (step.tool_calls / observations index and
tool_error_entry["tool"]), which misattributes failures when the same tool is
called multiple times or results arrive out-of-order; change to correlate at the
call level by matching atif_tool_call.tool_call_id (or similar call id on the
tool call object) against ObservationResult.source_call_id for observations and
against tool_error_entry["tool_call_id"] (or add/expect that key) for structured
errors, build a lookup map from source_call_id->ObservationResult to get each
call's observation regardless of position, and when matching a tool_errors entry
consume it exactly once (e.g., remove or mark it as used) so entries aren’t
reused across multiple atif_tool_call instances; update handling around
observation_content, is_error, matching_extra, failed_tool_calls, and
per_tool_summary to use these per-call matches.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py`:
- Around line 39-42: The Field "attempts" in models.py is documented as "Ordered
list of every call to this tool" but ToolFailureEvaluator builds attempts = [a
for a in attempts if a.error is not None], so update the mismatch: either modify
ToolFailureEvaluator (the logic that filters attempts) to include successful
calls as well, or change the Field description for attempts (and any related
docstrings) to state explicitly that it contains only failed attempts; make this
change where attempts: list[_ToolCall] = Field(...) is declared and adjust
tests/docs referencing ToolFailureEvaluator accordingly so schema and
implementation match.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.py`:
- Around line 30-34: The max_concurrency Pydantic field currently allows
zero/negative values and must be validated at the config boundary: update the
Field definition for max_concurrency in register.py (the max_concurrency: int =
Field(...)) to enforce a positive bound (e.g. add gt=0) so invalid CLI/user
configs are rejected immediately while keeping the default 8; no runtime
scheduling changes needed—just add the Pydantic constraint to the existing Field
(or add an equivalent `@validator` on max_concurrency) to enforce >0.

In `@packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py`:
- Around line 316-318: Replace the temporary list comprehension in _get_tool_end
with a generator passed to next to avoid creating an intermediate list and
satisfy Ruff: inside function _get_tool_end(stats:
list[IntermediateStepPayload]) use next(s for s in stats if s.event_type ==
IntermediateStepType.TOOL_END) (or next((s for s in stats if s.event_type ==
IntermediateStepType.TOOL_END), None) if you want to return None when missing)
so you still return the first IntermediateStepPayload matching
IntermediateStepType.TOOL_END without building a list.

---

Nitpick comments:
In `@packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py`:
- Line 20: Replace the direct import of ToolMessage in the test with a minimal
test double (either a plain dict or a small stub object) that exposes the two
attributes the production code uses: status and content; update the test in
test_atif_converter.py to construct and pass this dict/stub into the
atif_converter call (the same function/class under test in atif_converter.py)
instead of a ToolMessage instance so the test no longer depends on
langchain_core via ToolMessage.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bdb0bc8d-0b5f-4330-a44a-4660a278e89e

📥 Commits

Reviewing files that changed from the base of the PR and between 7629460 and 13dc6bc.

📒 Files selected for processing (10)
  • packages/nvidia_nat_core/src/nat/utils/atif_converter.py
  • packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/register.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/__init__.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.py
  • packages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.py
  • packages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.py
  • packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py

…ailed_attempts

Signed-off-by: Eric Evans <194135482+ericevans-nv@users.noreply.github.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.py`:
- Around line 389-410: The code is currently streaming raw exception strings
into ToolMessage.content and trace payloads (error_content -> ToolMessage,
TraceMetadata.tool_outputs, StreamEventData.output/payload), which may leak PII;
change the behavior in the error handling block that builds
error_content/ToolMessage/IntermediateStepPayload so that you do NOT store
str(error) directly—replace error_content with a redacted summary (e.g., only
type(error).__name__ plus a fixed message like "<redacted error details>"), or a
safely truncated/hashed version, and only include the full error string behind
an explicit debug/opt-in flag; update the creation of ToolMessage,
TraceMetadata(tool_outputs=...), and StreamEventData(... output/payload ...) to
use the redacted summary variable instead of the raw error string (referencing
error_content, ToolMessage, IntermediateStepPayload, TraceMetadata,
StreamEventData, and the run_id_str mapping).
- Around line 386-395: The tool_call_id fallback should be the per-invocation
run_id_str rather than tool_name to avoid collisions across invocations; update
the code that sets tool_call_id (currently: tool_call_id: str =
kwargs.get("tool_call_id") or tool_name) to use run_id_str as the final fallback
(e.g., kwargs.get("tool_call_id") or run_id_str) so ToolMessage instances
(created in ToolMessage(...)) always have a unique per-invocation identifier;
adjust any related logic that relies on tool_call_id to accept this change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3e28972f-4112-403e-8173-46ba928580f2

📥 Commits

Reviewing files that changed from the base of the PR and between 13dc6bc and 69e78d0.

📒 Files selected for processing (7)
  • packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/__init__.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py
  • packages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.py
  • packages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.py
  • packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py
✅ Files skipped from review due to trivial changes (4)
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/init.py
  • packages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py (1)

107-123: Index-based pairing may misattribute errors when the same tool is called multiple times.

The current logic pairs observations and tool_errors by list position (index) and by matching tool_error_entry["tool"] to function_name. This approach has limitations:

  1. If one step has two calls to the same tool and only the second fails, the first matching tool_errors entry is applied to the first call
  2. Each tool_errors entry can be consumed multiple times if the tool is called more than once

Consider correlating at the call level using tool_call_id and ObservationResult.source_call_id to ensure each error is matched to exactly one call.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py`
around lines 107 - 123, The pairing logic currently matches observations and
tool_errors by list index and tool name (in the loop over step.tool_calls, using
observations[index] and matching tool_error_entry.get("tool") ==
atif_tool_call.function_name), which can misattribute errors for repeated calls;
change it to match at the call level by using the call identifiers: for each
atif_tool_call use atif_tool_call.tool_call_id to locate the corresponding
ObservationResult whose source_call_id equals that id (instead of
observations[index]), then search step.extra["tool_errors"] for an entry whose
tool matches atif_tool_call.function_name and whose call id field (e.g.
tool_error_entry.get("call_id") or tool_error_entry.get("tool_call_id")) equals
atif_tool_call.tool_call_id; if tool_error entries lack an explicit call id,
maintain a consumed set of matched tool_error_entry objects so each entry is
applied only once per call; update uses of is_error, matching_extra,
observation_content accordingly so each atif_tool_call is correlated to its
exact ObservationResult and tool_error_entry and entries are not reused.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py`:
- Around line 101-103: The loop that currently skips any step with no
step.tool_calls or step.source != "agent" misses orphan errors in
step.extra["tool_errors"]; update the iteration inside the evaluator (the for
steps loop) to also read step.extra.get("tool_errors", []) and treat those as
failure tool calls when present (i.e., accumulate/process them even if
step.tool_calls is empty), and only bypass steps that have neither tool_calls
nor extra.get("tool_errors", []) or are not from the agent; ensure you reference
step.tool_calls, step.extra.get("tool_errors", []), and the existing
failure-accumulation logic so orphan tool_errors are counted toward the failure
score.

In
`@packages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.py`:
- Around line 248-257: The test function
test_dict_without_error_type_not_treated_as_error currently asserts
result.reasoning.failed_tools is None but the Pydantic model returns an empty
list for no failures; update the assertion in
test_dict_without_error_type_not_treated_as_error to expect an empty list
(assert result.reasoning.failed_tools == []) or alternatively change the model
default for the failed_tools field to Optional[List[...] ] = None so the
evaluator (ToolFailureEvaluator and its evaluate_item path) returns None—pick
one consistent approach and update the test or model accordingly.
- Around line 109-120: The test asserts that _ToolFailureReasoning.failed_tools
and .per_tool_summary are None for an empty trajectory, but the
_ToolFailureReasoning model uses default_factory=list so those fields are [] —
update the test function test_empty_trajectory_produces_default_reasoning to
assert empty lists (e.g., assert reasoning.failed_tools == [] and assert
reasoning.per_tool_summary == []) instead of checking is None, leaving the model
defaults unchanged; ensure you reference _ToolFailureReasoning and the
attributes failed_tools and per_tool_summary when making the change.

---

Nitpick comments:
In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py`:
- Around line 107-123: The pairing logic currently matches observations and
tool_errors by list index and tool name (in the loop over step.tool_calls, using
observations[index] and matching tool_error_entry.get("tool") ==
atif_tool_call.function_name), which can misattribute errors for repeated calls;
change it to match at the call level by using the call identifiers: for each
atif_tool_call use atif_tool_call.tool_call_id to locate the corresponding
ObservationResult whose source_call_id equals that id (instead of
observations[index]), then search step.extra["tool_errors"] for an entry whose
tool matches atif_tool_call.function_name and whose call id field (e.g.
tool_error_entry.get("call_id") or tool_error_entry.get("tool_call_id")) equals
atif_tool_call.tool_call_id; if tool_error entries lack an explicit call id,
maintain a consumed set of matched tool_error_entry objects so each entry is
applied only once per call; update uses of is_error, matching_extra,
observation_content accordingly so each atif_tool_call is correlated to its
exact ObservationResult and tool_error_entry and entries are not reused.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e8e81ca4-dff4-4428-84f5-137d794a0c79

📥 Commits

Reviewing files that changed from the base of the PR and between 69e78d0 and e5998a2.

📒 Files selected for processing (11)
  • packages/nvidia_nat_adk/src/nat/plugins/adk/callback_handler.py
  • packages/nvidia_nat_adk/tests/test_adk_callback_handler.py
  • packages/nvidia_nat_autogen/src/nat/plugins/autogen/callback_handler.py
  • packages/nvidia_nat_autogen/tests/test_callback_handler_autogen.py
  • packages/nvidia_nat_core/src/nat/data_models/intermediate_step.py
  • packages/nvidia_nat_core/src/nat/utils/atif_converter.py
  • packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py
  • packages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.py
  • packages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.py
  • packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py
✅ Files skipped from review due to trivial changes (1)
  • packages/nvidia_nat_core/src/nat/data_models/intermediate_step.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.py
  • packages/nvidia_nat_core/src/nat/utils/atif_converter.py

@ericevans-nv ericevans-nv changed the title Add tool failure evaluator for IntermediateStep and ATIF formats Add ool failure evaluator for IntermediateStep and ATIF formats Mar 22, 2026
@ericevans-nv ericevans-nv changed the title Add ool failure evaluator for IntermediateStep and ATIF formats Add Tool Failure valuator for IntermediateStep and ATIF formats Mar 22, 2026
@ericevans-nv ericevans-nv changed the title Add Tool Failure valuator for IntermediateStep and ATIF formats Add Tool Failure valuator for IntermediateStep format Mar 22, 2026
@ericevans-nv ericevans-nv force-pushed the feat/tool-error-observability branch from e5998a2 to a3a777f Compare March 22, 2026 23:57
@ericevans-nv ericevans-nv changed the title Add Tool Failure valuator for IntermediateStep format Add Tool Failure Evaluator for IntermediateSteps Mar 23, 2026
@ericevans-nv ericevans-nv force-pushed the feat/tool-error-observability branch from a3a777f to e31de73 Compare March 23, 2026 00:58
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py (1)

311-313: Use next(...) instead of list comprehension with single element slice.

The static analysis tool (Ruff RUF015) flags this pattern. Using next() is more idiomatic and avoids creating an intermediate list.

♻️ Suggested fix
 def _get_tool_end(stats: list[IntermediateStepPayload]) -> IntermediateStepPayload:
     """Return the first TOOL_END payload from collected stats."""
-    return [s for s in stats if s.event_type == IntermediateStepType.TOOL_END][0]
+    return next(s for s in stats if s.event_type == IntermediateStepType.TOOL_END)

As per coding guidelines: "Use ruff via ruff check --fix as configured in pyproject.toml; fix warnings unless explicitly ignored in pyproject.toml."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py`
around lines 311 - 313, Replace the list comprehension + [0] in _get_tool_end
with a generator passed to next to avoid creating an intermediate list: locate
the function _get_tool_end and change the return to use next(s for s in stats if
s.event_type == IntermediateStepType.TOOL_END), preserving the current semantics
(let it raise if no match) or optionally provide a default to next(...) if you
prefer a different error handling policy; ensure types remain
IntermediateStepPayload and IntermediateStepType as referenced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py`:
- Around line 311-313: Replace the list comprehension + [0] in _get_tool_end
with a generator passed to next to avoid creating an intermediate list: locate
the function _get_tool_end and change the return to use next(s for s in stats if
s.event_type == IntermediateStepType.TOOL_END), preserving the current semantics
(let it raise if no match) or optionally provide a default to next(...) if you
prefer a different error handling policy; ensure types remain
IntermediateStepPayload and IntermediateStepType as referenced.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b992bc03-b51a-4860-9323-07798e4238b0

📥 Commits

Reviewing files that changed from the base of the PR and between a3a777f and e31de73.

📒 Files selected for processing (14)
  • packages/nvidia_nat_adk/src/nat/plugins/adk/callback_handler.py
  • packages/nvidia_nat_adk/tests/test_adk_callback_handler.py
  • packages/nvidia_nat_autogen/src/nat/plugins/autogen/callback_handler.py
  • packages/nvidia_nat_autogen/tests/test_callback_handler_autogen.py
  • packages/nvidia_nat_core/src/nat/data_models/intermediate_step.py
  • packages/nvidia_nat_core/src/nat/utils/atif_converter.py
  • packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.py
  • packages/nvidia_nat_eval/tests/eval/tool_failure_evaluator/__init__.py
  • packages/nvidia_nat_eval/tests/eval/tool_failure_evaluator/test_tool_failure_evaluator.py
  • packages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.py
  • packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py
✅ Files skipped from review due to trivial changes (4)
  • packages/nvidia_nat_eval/tests/eval/tool_failure_evaluator/init.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py
  • packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • packages/nvidia_nat_adk/src/nat/plugins/adk/callback_handler.py
  • packages/nvidia_nat_autogen/tests/test_callback_handler_autogen.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.py
  • packages/nvidia_nat_adk/tests/test_adk_callback_handler.py

@ericevans-nv ericevans-nv force-pushed the feat/tool-error-observability branch from e31de73 to e5dcb29 Compare March 23, 2026 04:35
…ndlers

Signed-off-by: Eric Evans <194135482+ericevans-nv@users.noreply.github.com>
@ericevans-nv ericevans-nv force-pushed the feat/tool-error-observability branch from e5dcb29 to d1c6a13 Compare March 23, 2026 17:27
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvidia-nat-eval is only meant to host the eval harness. evaluators/metrics don't belong here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to moving it wherever you think is best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants