Skip to content

feat(benchmarks): add ToolSandbox multi-turn tool-use benchmark#1018

Open
wprazuch wants to merge 6 commits into
dev/0.3.0from
wprazuch/toolsandbox-dev-0.3.0
Open

feat(benchmarks): add ToolSandbox multi-turn tool-use benchmark#1018
wprazuch wants to merge 6 commits into
dev/0.3.0from
wprazuch/toolsandbox-dev-0.3.0

Conversation

@wprazuch
Copy link
Copy Markdown
Contributor

Summary

Integrates Apple's ToolSandbox benchmark. ToolSandbox evaluates stateful, multi-turn tool-use — the model must chain tool calls across a conversation with a simulated user, recovering from state errors (e.g. cellular off → detect → fix → retry). World state is tracked in SQLite and validated against expected milestones.

What's in this PR

File Purpose
src/nemo_evaluator/benchmarks/toolsandbox.py ToolSandboxEnvironment — docker/apptainer/subprocess runners, parses result_summary.json
docker/Dockerfile.toolsandbox Standalone Docker image (local runs)
docker/Dockerfile.toolsandbox-combined NEL Next + ToolSandbox venv combined (SLURM subprocess runner)
docker/toolsandbox_entrypoint.py Patches ToolSandbox agent/user factories with NVIDIA NIM-backed classes
examples/configs/toolsandbox.yaml Local Docker run
examples/configs/toolsandbox_slurm.yaml SLURM subprocess run
tests/test_environments/test_toolsandbox.py 24 offline tests (all green)

Smoke test results ✓

3 scenarios, azure/openai/gpt-4o on inference-api.nvidia.com:

similarity:   0.952
turn_count:   12.3 avg
errors:       0/3 scenarios

Example conversation (send_message_with_contact_content_cellular_off):

  1. search_contacts("Fredrik Thordendal") → phone number found
  2. send_message_with_phone_number(...) → fails (cellular off)
  3. get_cellular_service_status() → confirms off
  4. set_cellular_service_status(on=true) → fixes state
  5. send_message_with_phone_number(...) → succeeds

Runner modes

runner When to use
docker (default) Local runs with Docker daemon
apptainer SLURM clusters, image = .sif/.sqsh path
subprocess SLURM clusters via Dockerfile.toolsandbox-combined

Usage

docker build -f docker/Dockerfile.toolsandbox -t toolsandbox-nel:latest .
INFERENCE_API_KEY=... nel eval run examples/configs/toolsandbox.yaml

Notes

  • httpx<0.28 pinned — openai==1.17.0 (ToolSandbox's pin) uses removed proxies= kwarg
  • Factory patching reuses RoleImplType.Gorilla / RoleImplType.GPT_4_o_2024_05_13 so no enum changes needed
  • Both agent and user simulator use NVIDIA Inference API — no OpenAI key required

wprazuch added 6 commits May 18, 2026 15:29
ToolSandbox (https://github.com/apple/ToolSandbox) evaluates stateful,
multi-turn tool-use with a user simulator — the standard seed/solve/verify
loop does not apply.  This integration runs the full benchmark inside a
dedicated Docker container via run_batch() and parses result_summary.json
into the standard NEL bundle format.

Key design choices:
- ToolSandboxEnvironment subclasses EvalEnvironment and overrides run_batch()
- Both agent and user simulator call the NVIDIA Inference API (no OpenAI key)
- docker/toolsandbox_entrypoint.py patches AGENT_TYPE_TO_FACTORY[Gorilla] and
  USER_TYPE_TO_FACTORY[GPT_4_o_2024_05_13] with NVIDIANIMAgent/NVIDIANIMUser,
  which use native function-calling format and read NVIDIA_BASE_URL from env
- URL normalization strips /chat/completions from NEL service URLs before
  passing to the OpenAI SDK base_url

Files added:
  src/nemo_evaluator/benchmarks/toolsandbox.py
  docker/Dockerfile.toolsandbox
  docker/toolsandbox_entrypoint.py
  examples/configs/toolsandbox.yaml
  tests/test_environments/test_toolsandbox.py (21 offline tests, all green)

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Adds two runner modes to ToolSandboxEnvironment:
- apptainer: runs toolsandbox-nel.sif via `apptainer run` (SLURM with
  apptainer, no Docker needed)
- subprocess: runs toolsandbox_entrypoint.py directly as a Python
  subprocess (eval container has ToolSandbox pre-installed in a venv,
  zero nesting)

Also adds Dockerfile.toolsandbox-combined (ToolSandbox in isolated
/opt/toolsandbox-venv alongside NEL Next) and a SLURM example config
that uses the subprocess runner.

24 offline tests, all green.

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
OpenAIAPIAgent/User.__init__ reads OPENAI_API_KEY to build a client that
NVIDIANIMAgent immediately replaces — without the placeholder it raises
even though we never use the parent's client.

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
openai==1.17.0 (pinned by ToolSandbox) passes proxies= to httpx which
removed that argument in 0.28.0, causing TypeError at runtime.

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Smoke test revealed the actual output format uses category_aggregated_results
instead of a top-level similarity key.  Overall score comes from
ALL_CATEGORIES; per-category breakdown skips it to avoid duplication.
Sample count reads from per_scenario_results length.

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
…configs

Smoke test confirmed: inference-api.nvidia.com with azure/openai/gpt-4o
runs 3/3 scenarios with similarity=0.952, no errors, no rate limits.

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 5998b099-ce11-474f-a974-1d8a310cdac9

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch wprazuch/toolsandbox-dev-0.3.0

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant