Add Tool Failure Evaluator for IntermediateSteps#1816
Add Tool Failure Evaluator for IntermediateSteps#1816ericevans-nv wants to merge 4 commits intoNVIDIA:developfrom
Conversation
…r and ATIF converter Signed-off-by: Eric Evans <194135482+ericevans-nv@users.noreply.github.com>
…eat/tool-error-observability
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis change introduces structured tool error handling across the framework. A new Changes
Sequence DiagramsequenceDiagram
participant User
participant Handler as Callback Handler<br/>(LangChain/ADK/AutoGen)
participant StepMgr as Step Manager
participant ATIFConv as ATIF Converter
participant Evaluator as Tool Failure Evaluator
User->>Handler: Execute Tool (on_tool_start)
Handler->>StepMgr: Push TOOL_START event
StepMgr->>ATIFConv: Queue intermediate step
User->>Handler: Tool Fails (on_tool_error)
Handler->>Handler: Create ToolErrorData<br/>(content, error_type, error_message)
Handler->>StepMgr: Push TOOL_END with ToolErrorData output
StepMgr->>ATIFConv: Queue TOOL_END step with error
ATIFConv->>ATIFConv: Detect ToolErrorData in output
ATIFConv->>ATIFConv: Append tool_error to step.extra["tool_errors"]
User->>Evaluator: Evaluate trajectory
Evaluator->>Evaluator: Iterate TOOL_END steps
Evaluator->>Evaluator: Classify as error if ToolErrorData present
Evaluator->>Evaluator: Aggregate per-tool failure counts
Evaluator->>Evaluator: Compute success_rate = (total - failed) / total
Evaluator-->>User: Return EvalOutputItem with score & tool summaries
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py (1)
20-20: Use a dict or stub in the test instead of importingToolMessageto reduce test coupling.
atif_converter.pyonly accessesstatusandcontentattributes, so a simple dict will exercise the same code path. Note thatlangchain-coreis not explicitly declared inpackages/nvidia_nat_core/pyproject.tomldespite being imported throughout the source code (via transitive dependencies). The test should avoid hardwiring this dependency; use a minimal test object instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py` at line 20, Replace the direct import of ToolMessage in the test with a minimal test double (either a plain dict or a small stub object) that exposes the two attributes the production code uses: status and content; update the test in test_atif_converter.py to construct and pass this dict/stub into the atif_converter call (the same function/class under test in atif_converter.py) instead of a ToolMessage instance so the test no longer depends on langchain_core via ToolMessage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/nvidia_nat_core/src/nat/utils/atif_converter.py`:
- Around line 104-106: Remove the literal "TODO" from the docstring for
_extract_tool_error; instead rephrase the note to a neutral statement such as
"Return a model instead of a plain dict once ATIF spec adds error support" or
move that implementation note to an external tracker, so the docstring no longer
contains TODO/FIXME placeholder text and continues to describe the current
behavior of _extract_tool_error.
In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py`:
- Around line 101-103: The loop that currently skips agent steps with empty
step.tool_calls (in the evaluator iterating "for step in steps") must be changed
to also inspect step.extra["tool_errors"] and treat each orphaned tool_error as
a failed tool call; specifically, where the code checks "if not step.tool_calls
or step.source != 'agent': continue", remove or adjust the early continue to
allow processing when step.extra.get('tool_errors') exists, count each entry in
step.extra["tool_errors"] toward the failure total, and add the same
regression/recording logic you use for normal tool_calls (e.g., increment
failure counters and mark the step as regressed) so unmatched tool_errors are
not ignored.
- Around line 105-123: The current logic pairs observations and tool_errors by
list position and tool name (step.tool_calls / observations index and
tool_error_entry["tool"]), which misattributes failures when the same tool is
called multiple times or results arrive out-of-order; change to correlate at the
call level by matching atif_tool_call.tool_call_id (or similar call id on the
tool call object) against ObservationResult.source_call_id for observations and
against tool_error_entry["tool_call_id"] (or add/expect that key) for structured
errors, build a lookup map from source_call_id->ObservationResult to get each
call's observation regardless of position, and when matching a tool_errors entry
consume it exactly once (e.g., remove or mark it as used) so entries aren’t
reused across multiple atif_tool_call instances; update handling around
observation_content, is_error, matching_extra, failed_tool_calls, and
per_tool_summary to use these per-call matches.
In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py`:
- Around line 39-42: The Field "attempts" in models.py is documented as "Ordered
list of every call to this tool" but ToolFailureEvaluator builds attempts = [a
for a in attempts if a.error is not None], so update the mismatch: either modify
ToolFailureEvaluator (the logic that filters attempts) to include successful
calls as well, or change the Field description for attempts (and any related
docstrings) to state explicitly that it contains only failed attempts; make this
change where attempts: list[_ToolCall] = Field(...) is declared and adjust
tests/docs referencing ToolFailureEvaluator accordingly so schema and
implementation match.
In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.py`:
- Around line 30-34: The max_concurrency Pydantic field currently allows
zero/negative values and must be validated at the config boundary: update the
Field definition for max_concurrency in register.py (the max_concurrency: int =
Field(...)) to enforce a positive bound (e.g. add gt=0) so invalid CLI/user
configs are rejected immediately while keeping the default 8; no runtime
scheduling changes needed—just add the Pydantic constraint to the existing Field
(or add an equivalent `@validator` on max_concurrency) to enforce >0.
In `@packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py`:
- Around line 316-318: Replace the temporary list comprehension in _get_tool_end
with a generator passed to next to avoid creating an intermediate list and
satisfy Ruff: inside function _get_tool_end(stats:
list[IntermediateStepPayload]) use next(s for s in stats if s.event_type ==
IntermediateStepType.TOOL_END) (or next((s for s in stats if s.event_type ==
IntermediateStepType.TOOL_END), None) if you want to return None when missing)
so you still return the first IntermediateStepPayload matching
IntermediateStepType.TOOL_END without building a list.
---
Nitpick comments:
In `@packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py`:
- Line 20: Replace the direct import of ToolMessage in the test with a minimal
test double (either a plain dict or a small stub object) that exposes the two
attributes the production code uses: status and content; update the test in
test_atif_converter.py to construct and pass this dict/stub into the
atif_converter call (the same function/class under test in atif_converter.py)
instead of a ToolMessage instance so the test no longer depends on
langchain_core via ToolMessage.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bdb0bc8d-0b5f-4330-a44a-4660a278e89e
📒 Files selected for processing (10)
packages/nvidia_nat_core/src/nat/utils/atif_converter.pypackages/nvidia_nat_core/tests/nat/utils/test_atif_converter.pypackages/nvidia_nat_eval/src/nat/plugins/eval/register.pypackages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/__init__.pypackages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.pypackages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.pypackages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.pypackages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.pypackages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.pypackages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py
packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py
Outdated
Show resolved
Hide resolved
packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py
Outdated
Show resolved
Hide resolved
packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.py
Outdated
Show resolved
Hide resolved
…ailed_attempts Signed-off-by: Eric Evans <194135482+ericevans-nv@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.py`:
- Around line 389-410: The code is currently streaming raw exception strings
into ToolMessage.content and trace payloads (error_content -> ToolMessage,
TraceMetadata.tool_outputs, StreamEventData.output/payload), which may leak PII;
change the behavior in the error handling block that builds
error_content/ToolMessage/IntermediateStepPayload so that you do NOT store
str(error) directly—replace error_content with a redacted summary (e.g., only
type(error).__name__ plus a fixed message like "<redacted error details>"), or a
safely truncated/hashed version, and only include the full error string behind
an explicit debug/opt-in flag; update the creation of ToolMessage,
TraceMetadata(tool_outputs=...), and StreamEventData(... output/payload ...) to
use the redacted summary variable instead of the raw error string (referencing
error_content, ToolMessage, IntermediateStepPayload, TraceMetadata,
StreamEventData, and the run_id_str mapping).
- Around line 386-395: The tool_call_id fallback should be the per-invocation
run_id_str rather than tool_name to avoid collisions across invocations; update
the code that sets tool_call_id (currently: tool_call_id: str =
kwargs.get("tool_call_id") or tool_name) to use run_id_str as the final fallback
(e.g., kwargs.get("tool_call_id") or run_id_str) so ToolMessage instances
(created in ToolMessage(...)) always have a unique per-invocation identifier;
adjust any related logic that relies on tool_call_id to accept this change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3e28972f-4112-403e-8173-46ba928580f2
📒 Files selected for processing (7)
packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.pypackages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/__init__.pypackages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.pypackages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.pypackages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.pypackages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.pypackages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py
✅ Files skipped from review due to trivial changes (4)
- packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/init.py
- packages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.py
- packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py
- packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py
packages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.py
Outdated
Show resolved
Hide resolved
packages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py (1)
107-123: Index-based pairing may misattribute errors when the same tool is called multiple times.The current logic pairs observations and
tool_errorsby list position (index) and by matchingtool_error_entry["tool"]tofunction_name. This approach has limitations:
- If one step has two calls to the same tool and only the second fails, the first matching
tool_errorsentry is applied to the first call- Each
tool_errorsentry can be consumed multiple times if the tool is called more than onceConsider correlating at the call level using
tool_call_idandObservationResult.source_call_idto ensure each error is matched to exactly one call.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py` around lines 107 - 123, The pairing logic currently matches observations and tool_errors by list index and tool name (in the loop over step.tool_calls, using observations[index] and matching tool_error_entry.get("tool") == atif_tool_call.function_name), which can misattribute errors for repeated calls; change it to match at the call level by using the call identifiers: for each atif_tool_call use atif_tool_call.tool_call_id to locate the corresponding ObservationResult whose source_call_id equals that id (instead of observations[index]), then search step.extra["tool_errors"] for an entry whose tool matches atif_tool_call.function_name and whose call id field (e.g. tool_error_entry.get("call_id") or tool_error_entry.get("tool_call_id")) equals atif_tool_call.tool_call_id; if tool_error entries lack an explicit call id, maintain a consumed set of matched tool_error_entry objects so each entry is applied only once per call; update uses of is_error, matching_extra, observation_content accordingly so each atif_tool_call is correlated to its exact ObservationResult and tool_error_entry and entries are not reused.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py`:
- Around line 101-103: The loop that currently skips any step with no
step.tool_calls or step.source != "agent" misses orphan errors in
step.extra["tool_errors"]; update the iteration inside the evaluator (the for
steps loop) to also read step.extra.get("tool_errors", []) and treat those as
failure tool calls when present (i.e., accumulate/process them even if
step.tool_calls is empty), and only bypass steps that have neither tool_calls
nor extra.get("tool_errors", []) or are not from the agent; ensure you reference
step.tool_calls, step.extra.get("tool_errors", []), and the existing
failure-accumulation logic so orphan tool_errors are counted toward the failure
score.
In
`@packages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.py`:
- Around line 248-257: The test function
test_dict_without_error_type_not_treated_as_error currently asserts
result.reasoning.failed_tools is None but the Pydantic model returns an empty
list for no failures; update the assertion in
test_dict_without_error_type_not_treated_as_error to expect an empty list
(assert result.reasoning.failed_tools == []) or alternatively change the model
default for the failed_tools field to Optional[List[...] ] = None so the
evaluator (ToolFailureEvaluator and its evaluate_item path) returns None—pick
one consistent approach and update the test or model accordingly.
- Around line 109-120: The test asserts that _ToolFailureReasoning.failed_tools
and .per_tool_summary are None for an empty trajectory, but the
_ToolFailureReasoning model uses default_factory=list so those fields are [] —
update the test function test_empty_trajectory_produces_default_reasoning to
assert empty lists (e.g., assert reasoning.failed_tools == [] and assert
reasoning.per_tool_summary == []) instead of checking is None, leaving the model
defaults unchanged; ensure you reference _ToolFailureReasoning and the
attributes failed_tools and per_tool_summary when making the change.
---
Nitpick comments:
In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py`:
- Around line 107-123: The pairing logic currently matches observations and
tool_errors by list index and tool name (in the loop over step.tool_calls, using
observations[index] and matching tool_error_entry.get("tool") ==
atif_tool_call.function_name), which can misattribute errors for repeated calls;
change it to match at the call level by using the call identifiers: for each
atif_tool_call use atif_tool_call.tool_call_id to locate the corresponding
ObservationResult whose source_call_id equals that id (instead of
observations[index]), then search step.extra["tool_errors"] for an entry whose
tool matches atif_tool_call.function_name and whose call id field (e.g.
tool_error_entry.get("call_id") or tool_error_entry.get("tool_call_id")) equals
atif_tool_call.tool_call_id; if tool_error entries lack an explicit call id,
maintain a consumed set of matched tool_error_entry objects so each entry is
applied only once per call; update uses of is_error, matching_extra,
observation_content accordingly so each atif_tool_call is correlated to its
exact ObservationResult and tool_error_entry and entries are not reused.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e8e81ca4-dff4-4428-84f5-137d794a0c79
📒 Files selected for processing (11)
packages/nvidia_nat_adk/src/nat/plugins/adk/callback_handler.pypackages/nvidia_nat_adk/tests/test_adk_callback_handler.pypackages/nvidia_nat_autogen/src/nat/plugins/autogen/callback_handler.pypackages/nvidia_nat_autogen/tests/test_callback_handler_autogen.pypackages/nvidia_nat_core/src/nat/data_models/intermediate_step.pypackages/nvidia_nat_core/src/nat/utils/atif_converter.pypackages/nvidia_nat_core/tests/nat/utils/test_atif_converter.pypackages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.pypackages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.pypackages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.pypackages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py
✅ Files skipped from review due to trivial changes (1)
- packages/nvidia_nat_core/src/nat/data_models/intermediate_step.py
🚧 Files skipped from review as they are similar to previous changes (2)
- packages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.py
- packages/nvidia_nat_core/src/nat/utils/atif_converter.py
packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py
Outdated
Show resolved
Hide resolved
packages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.py
Show resolved
Hide resolved
packages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.py
Show resolved
Hide resolved
e5998a2 to
a3a777f
Compare
a3a777f to
e31de73
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py (1)
311-313: Usenext(...)instead of list comprehension with single element slice.The static analysis tool (Ruff RUF015) flags this pattern. Using
next()is more idiomatic and avoids creating an intermediate list.♻️ Suggested fix
def _get_tool_end(stats: list[IntermediateStepPayload]) -> IntermediateStepPayload: """Return the first TOOL_END payload from collected stats.""" - return [s for s in stats if s.event_type == IntermediateStepType.TOOL_END][0] + return next(s for s in stats if s.event_type == IntermediateStepType.TOOL_END)As per coding guidelines: "Use ruff via ruff check --fix as configured in pyproject.toml; fix warnings unless explicitly ignored in pyproject.toml."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py` around lines 311 - 313, Replace the list comprehension + [0] in _get_tool_end with a generator passed to next to avoid creating an intermediate list: locate the function _get_tool_end and change the return to use next(s for s in stats if s.event_type == IntermediateStepType.TOOL_END), preserving the current semantics (let it raise if no match) or optionally provide a default to next(...) if you prefer a different error handling policy; ensure types remain IntermediateStepPayload and IntermediateStepType as referenced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py`:
- Around line 311-313: Replace the list comprehension + [0] in _get_tool_end
with a generator passed to next to avoid creating an intermediate list: locate
the function _get_tool_end and change the return to use next(s for s in stats if
s.event_type == IntermediateStepType.TOOL_END), preserving the current semantics
(let it raise if no match) or optionally provide a default to next(...) if you
prefer a different error handling policy; ensure types remain
IntermediateStepPayload and IntermediateStepType as referenced.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b992bc03-b51a-4860-9323-07798e4238b0
📒 Files selected for processing (14)
packages/nvidia_nat_adk/src/nat/plugins/adk/callback_handler.pypackages/nvidia_nat_adk/tests/test_adk_callback_handler.pypackages/nvidia_nat_autogen/src/nat/plugins/autogen/callback_handler.pypackages/nvidia_nat_autogen/tests/test_callback_handler_autogen.pypackages/nvidia_nat_core/src/nat/data_models/intermediate_step.pypackages/nvidia_nat_core/src/nat/utils/atif_converter.pypackages/nvidia_nat_core/tests/nat/utils/test_atif_converter.pypackages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.pypackages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.pypackages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.pypackages/nvidia_nat_eval/tests/eval/tool_failure_evaluator/__init__.pypackages/nvidia_nat_eval/tests/eval/tool_failure_evaluator/test_tool_failure_evaluator.pypackages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.pypackages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py
✅ Files skipped from review due to trivial changes (4)
- packages/nvidia_nat_eval/tests/eval/tool_failure_evaluator/init.py
- packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py
- packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py
- packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py
🚧 Files skipped from review as they are similar to previous changes (4)
- packages/nvidia_nat_adk/src/nat/plugins/adk/callback_handler.py
- packages/nvidia_nat_autogen/tests/test_callback_handler_autogen.py
- packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.py
- packages/nvidia_nat_adk/tests/test_adk_callback_handler.py
e31de73 to
e5dcb29
Compare
…ndlers Signed-off-by: Eric Evans <194135482+ericevans-nv@users.noreply.github.com>
e5dcb29 to
d1c6a13
Compare
There was a problem hiding this comment.
nvidia-nat-eval is only meant to host the eval harness. evaluators/metrics don't belong here.
There was a problem hiding this comment.
I am open to moving it wherever you think is best.
Description
Summary
Adds a
tool_failureevaluator to thenat evalpipeline that computes per-item tool call success rates and surfaces structured error details through IntermediateStep trajectories.Changes
1. Framework Callback Handlers
nvidia_nat_langchain): Addedon_tool_errorcallback to capture tool exceptions and emitTOOL_ENDevents withToolErrorDatapayloads. Added_run_id_to_tool_namemapping for tool name resolution.nvidia_nat_adk): Enhanced error handling to emitToolErrorDatainTOOL_ENDevents.nvidia_nat_autogen): Enhanced error handling to emitToolErrorDatainTOOL_ENDevents.2. ToolErrorData Model (
nvidia_nat_core)ToolErrorDatatointermediate_step.pywitherror_typeanderror_messagefields for structured error capture.3. Tool Failure Evaluator (
nvidia_nat_eval)tool_failureevaluator plugin:evaluator.py—ToolFailureEvaluatorimplementingBaseEvaluatorfor IntermediateStep trajectory evaluation. Detects errors viaToolErrorDatainstances or dict-based error payloads.models.py— Pydantic models withNonedefaults for empty lists (cleaner JSON output).register.py— Plugin registration withToolFailureEvaluatorConfig.per_tool_summaryandfailed_toolsareNonewhen no failures occurper_tool_summaryonly includes tools with failuresscoreis(total - failed) / totalsuccess rateTests
By Submitting this PR I confirm:
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes