feat(benchmarks): add ToolSandbox multi-turn tool-use benchmark#1018
Open
wprazuch wants to merge 6 commits into
Open
feat(benchmarks): add ToolSandbox multi-turn tool-use benchmark#1018wprazuch wants to merge 6 commits into
wprazuch wants to merge 6 commits into
Conversation
ToolSandbox (https://github.com/apple/ToolSandbox) evaluates stateful, multi-turn tool-use with a user simulator — the standard seed/solve/verify loop does not apply. This integration runs the full benchmark inside a dedicated Docker container via run_batch() and parses result_summary.json into the standard NEL bundle format. Key design choices: - ToolSandboxEnvironment subclasses EvalEnvironment and overrides run_batch() - Both agent and user simulator call the NVIDIA Inference API (no OpenAI key) - docker/toolsandbox_entrypoint.py patches AGENT_TYPE_TO_FACTORY[Gorilla] and USER_TYPE_TO_FACTORY[GPT_4_o_2024_05_13] with NVIDIANIMAgent/NVIDIANIMUser, which use native function-calling format and read NVIDIA_BASE_URL from env - URL normalization strips /chat/completions from NEL service URLs before passing to the OpenAI SDK base_url Files added: src/nemo_evaluator/benchmarks/toolsandbox.py docker/Dockerfile.toolsandbox docker/toolsandbox_entrypoint.py examples/configs/toolsandbox.yaml tests/test_environments/test_toolsandbox.py (21 offline tests, all green) Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Adds two runner modes to ToolSandboxEnvironment: - apptainer: runs toolsandbox-nel.sif via `apptainer run` (SLURM with apptainer, no Docker needed) - subprocess: runs toolsandbox_entrypoint.py directly as a Python subprocess (eval container has ToolSandbox pre-installed in a venv, zero nesting) Also adds Dockerfile.toolsandbox-combined (ToolSandbox in isolated /opt/toolsandbox-venv alongside NEL Next) and a SLURM example config that uses the subprocess runner. 24 offline tests, all green. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
OpenAIAPIAgent/User.__init__ reads OPENAI_API_KEY to build a client that NVIDIANIMAgent immediately replaces — without the placeholder it raises even though we never use the parent's client. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
openai==1.17.0 (pinned by ToolSandbox) passes proxies= to httpx which removed that argument in 0.28.0, causing TypeError at runtime. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Smoke test revealed the actual output format uses category_aggregated_results instead of a top-level similarity key. Overall score comes from ALL_CATEGORIES; per-category breakdown skips it to avoid duplication. Sample count reads from per_scenario_results length. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
…configs Smoke test confirmed: inference-api.nvidia.com with azure/openai/gpt-4o runs 3/3 scenarios with similarity=0.952, no errors, no rate limits. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Integrates Apple's ToolSandbox benchmark. ToolSandbox evaluates stateful, multi-turn tool-use — the model must chain tool calls across a conversation with a simulated user, recovering from state errors (e.g. cellular off → detect → fix → retry). World state is tracked in SQLite and validated against expected milestones.
What's in this PR
src/nemo_evaluator/benchmarks/toolsandbox.pyToolSandboxEnvironment— docker/apptainer/subprocess runners, parsesresult_summary.jsondocker/Dockerfile.toolsandboxdocker/Dockerfile.toolsandbox-combineddocker/toolsandbox_entrypoint.pyexamples/configs/toolsandbox.yamlexamples/configs/toolsandbox_slurm.yamltests/test_environments/test_toolsandbox.pySmoke test results ✓
3 scenarios,
azure/openai/gpt-4ooninference-api.nvidia.com:Example conversation (
send_message_with_contact_content_cellular_off):search_contacts("Fredrik Thordendal")→ phone number foundsend_message_with_phone_number(...)→ fails (cellular off)get_cellular_service_status()→ confirms offset_cellular_service_status(on=true)→ fixes statesend_message_with_phone_number(...)→ succeedsRunner modes
runnerdocker(default)apptainerimage=.sif/.sqshpathsubprocessDockerfile.toolsandbox-combinedUsage
Notes
httpx<0.28pinned —openai==1.17.0(ToolSandbox's pin) uses removedproxies=kwargRoleImplType.Gorilla/RoleImplType.GPT_4_o_2024_05_13so no enum changes needed