feat(benchmarks): add ToolSandbox multi-turn tool-use benchmark by wprazuch · Pull Request #1018 · NVIDIA-NeMo/Evaluator

wprazuch · 2026-05-18T22:30:25Z

Summary

Integrates Apple's ToolSandbox benchmark. ToolSandbox evaluates stateful, multi-turn tool-use — the model must chain tool calls across a conversation with a simulated user, recovering from state errors (e.g. cellular off → detect → fix → retry). World state is tracked in SQLite and validated against expected milestones.

What's in this PR

File	Purpose
`src/nemo_evaluator/benchmarks/toolsandbox.py`	`ToolSandboxEnvironment` — docker/apptainer/subprocess runners, parses `result_summary.json`
`docker/Dockerfile.toolsandbox`	Standalone Docker image (local runs)
`docker/Dockerfile.toolsandbox-combined`	NEL Next + ToolSandbox venv combined (SLURM subprocess runner)
`docker/toolsandbox_entrypoint.py`	Patches ToolSandbox agent/user factories with NVIDIA NIM-backed classes
`examples/configs/toolsandbox.yaml`	Local Docker run
`examples/configs/toolsandbox_slurm.yaml`	SLURM subprocess run
`tests/test_environments/test_toolsandbox.py`	24 offline tests (all green)

Smoke test results ✓

3 scenarios, azure/openai/gpt-4o on inference-api.nvidia.com:

similarity:   0.952
turn_count:   12.3 avg
errors:       0/3 scenarios

Example conversation (send_message_with_contact_content_cellular_off):

search_contacts("Fredrik Thordendal") → phone number found
send_message_with_phone_number(...) → fails (cellular off)
get_cellular_service_status() → confirms off
set_cellular_service_status(on=true) → fixes state
send_message_with_phone_number(...) → succeeds

Runner modes

`runner`	When to use
`docker` (default)	Local runs with Docker daemon
`apptainer`	SLURM clusters, `image` = `.sif`/`.sqsh` path
`subprocess`	SLURM clusters via `Dockerfile.toolsandbox-combined`

Usage

docker build -f docker/Dockerfile.toolsandbox -t toolsandbox-nel:latest .
INFERENCE_API_KEY=... nel eval run examples/configs/toolsandbox.yaml

Notes

httpx<0.28 pinned — openai==1.17.0 (ToolSandbox's pin) uses removed proxies= kwarg
Factory patching reuses RoleImplType.Gorilla / RoleImplType.GPT_4_o_2024_05_13 so no enum changes needed
Both agent and user simulator use NVIDIA Inference API — no OpenAI key required

ToolSandbox (https://github.com/apple/ToolSandbox) evaluates stateful, multi-turn tool-use with a user simulator — the standard seed/solve/verify loop does not apply. This integration runs the full benchmark inside a dedicated Docker container via run_batch() and parses result_summary.json into the standard NEL bundle format. Key design choices: - ToolSandboxEnvironment subclasses EvalEnvironment and overrides run_batch() - Both agent and user simulator call the NVIDIA Inference API (no OpenAI key) - docker/toolsandbox_entrypoint.py patches AGENT_TYPE_TO_FACTORY[Gorilla] and USER_TYPE_TO_FACTORY[GPT_4_o_2024_05_13] with NVIDIANIMAgent/NVIDIANIMUser, which use native function-calling format and read NVIDIA_BASE_URL from env - URL normalization strips /chat/completions from NEL service URLs before passing to the OpenAI SDK base_url Files added: src/nemo_evaluator/benchmarks/toolsandbox.py docker/Dockerfile.toolsandbox docker/toolsandbox_entrypoint.py examples/configs/toolsandbox.yaml tests/test_environments/test_toolsandbox.py (21 offline tests, all green) Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

Adds two runner modes to ToolSandboxEnvironment: - apptainer: runs toolsandbox-nel.sif via `apptainer run` (SLURM with apptainer, no Docker needed) - subprocess: runs toolsandbox_entrypoint.py directly as a Python subprocess (eval container has ToolSandbox pre-installed in a venv, zero nesting) Also adds Dockerfile.toolsandbox-combined (ToolSandbox in isolated /opt/toolsandbox-venv alongside NEL Next) and a SLURM example config that uses the subprocess runner. 24 offline tests, all green. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

OpenAIAPIAgent/User.__init__ reads OPENAI_API_KEY to build a client that NVIDIANIMAgent immediately replaces — without the placeholder it raises even though we never use the parent's client. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

openai==1.17.0 (pinned by ToolSandbox) passes proxies= to httpx which removed that argument in 0.28.0, causing TypeError at runtime. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

Smoke test revealed the actual output format uses category_aggregated_results instead of a top-level similarity key. Overall score comes from ALL_CATEGORIES; per-category breakdown skips it to avoid duplication. Sample count reads from per_scenario_results length. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

…configs Smoke test confirmed: inference-api.nvidia.com with azure/openai/gpt-4o runs 3/3 scenarios with similarity=0.952, no errors, no rate limits. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

copy-pr-bot · 2026-05-18T22:30:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-18T22:30:32Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 5998b099-ce11-474f-a974-1d8a310cdac9

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch wprazuch/toolsandbox-dev-0.3.0

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

wprazuch added 6 commits May 18, 2026 15:29

fix(toolsandbox): pin httpx<0.28 to fix openai==1.17.0 proxies compat

955baaa

openai==1.17.0 (pinned by ToolSandbox) passes proxies= to httpx which removed that argument in 0.28.0, causing TypeError at runtime. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

chore(toolsandbox): use inference-api.nvidia.com + gpt-4o in example …

6b3b416

…configs Smoke test confirmed: inference-api.nvidia.com with azure/openai/gpt-4o runs 3/3 scenarios with similarity=0.952, no errors, no rate limits. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

github-actions Bot added tests CI labels May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmarks): add ToolSandbox multi-turn tool-use benchmark#1018

feat(benchmarks): add ToolSandbox multi-turn tool-use benchmark#1018
wprazuch wants to merge 6 commits into
dev/0.3.0from
wprazuch/toolsandbox-dev-0.3.0

wprazuch commented May 18, 2026

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wprazuch commented May 18, 2026

Summary

What's in this PR

Smoke test results ✓

Runner modes

Usage

Notes

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant