Trove baseline#1
Open
mathuryash5 wants to merge 24 commits into
Open
Conversation
Specifies the architecture for adapting TroVE's IMPORT mode to use native OpenAI tool calling with vLLM-served gpt-oss models. CREATE and SKIP remain text-based; reward selection, K-sampling, and trimming stay faithful to the paper. Includes telemetry plan, vLLM version requirements (>= v0.16.0 for PR #28729), defensive sanitizer for the open Harmony control-token leakage bug (PR #35906), and the smoke-run done criteria. Made-with: Cursor
Step-by-step plan for implementing the design spec (2026-04-25-trove-native-tool-calling-design.md). Eleven tasks covering infra patches, tools_api module, chat_with_tools, controller branch, prompts, CLI flags, vLLM launcher, analyzer script, deviations doc, and the 50-task PBEBench-Lite smoke run + report. Made-with: Cursor
- toolbox.trim default C=1.0 (matches original TroVE) - executor DEFAULT_TIMEOUT=60s (PBEBench + multi-turn headroom) - llm._call_openai falls back to message.reasoning_content when message.content is empty (gpt-oss Harmony channel split) Made-with: Cursor
…esponse - imported_callsites(solution, tools, names) -> set: AST-walks Solution code and returns names from the candidate set that are actually called. Handles bare Name and Attribute (toolbox.foo) callees. - parse_response(text, task_family="default"): when task_family="pbebench" the parser does not fall back to the first python block when **Solution** is missing. Prevents CoT scratchpad from being promoted to the answer. Made-with: Cursor
- Add task_family parameter to all build_* prompt builders. - Add _CREATE_EXAMPLE_PBEBENCH and _SKIP_EXAMPLE_PBEBENCH demonstrating replace()-chain solutions and a find_replace_chain helper. - Add build_import_with_tools_prompt for native tool calling: no **Toolbox** markdown block (toolbox is conveyed via tools=[...]). - _FORMAT_OVERRIDE is empty for task_family="pbebench" (the example models the desired format directly). Made-with: Cursor
- toolbox_to_openai_tools(toolbox, topk=10): converts top-k toolbox functions into OpenAI Chat Completions tool schemas. Infers parameter types from inspect.signature; functions with *args/**kwargs are silently excluded. - dispatch_tool_call(toolbox, tool_call): runs the requested function in the sandbox executor, returns stdout truncated to 4096 chars or a JSON error string. Sanitizes Harmony control-token contamination in tool names (defensive vs. open vLLM PR #35906). Made-with: Cursor
executor.run_solution returns proc.stdout.strip(), not stderr. Rename the JSON error key from 'stderr' to 'stdout' so the field name matches what is actually being returned. Caught in code-quality review for Task 4. Made-with: Cursor
Multi-turn loop that handles tool_calls returned by gpt-oss/vLLM:
appends assistant message + tool result messages until the model returns
no tool_calls or max_tool_iters is reached. Records each call as
{name, args_preview, result_preview, ok} for downstream telemetry.
Reuses the existing 3-attempt retry, debug logging, and token accounting.
Anthropic backend raises NotImplementedError as a defensive guard;
controllers branch on self.backend == "openai" before calling.
Made-with: Cursor
- Add task_family and selection params to TroVEController.__init__. - IMPORT branch dispatches to _generate_import_with_tools when toolbox is non-empty and backend is openai; otherwise falls back to legacy text-based IMPORT. - _generate_import_with_tools builds K multi-turn trajectories via TroVELLMClient.chat_with_tools, parses **Solution** strictly for pbebench, and runs the result through the executor. - _update_library credits frequency by unique tool_call.function.name for the native path; legacy path still credits parsed functions. - _make_result emits won_mode, import_eligible, import_was_winner, tool_calls, tool_call_count, tools_called, actually_called, trove_stopped_reason as passive telemetry. - _select_best honors selection="consistency" or "reward" (default). Made-with: Cursor
Update the class-level Parameters block to: - Reflect trim_C default of 1.0 (matches __init__). - Document task_family, selection, max_tool_iters, tool_schema_topk. - Note that base_url governs which backend is used and that native tool-calling IMPORT requires the openai backend. Made-with: Cursor
- --trove-selection {reward,consistency} (default: reward).
- --trove-task-family {default,pbebench} (default: default). Plumbed
through to TroVEController; PBEBench runs should pass --trove-task-family
pbebench to enable PBEBench-shaped few-shots and strict **Solution**
parsing.
Made-with: Cursor
Add three flags required for OpenAI-compatible tool calling on gpt-oss served by vLLM >= v0.16.0: --enable-auto-tool-choice --tool-call-parser openai --reasoning-parser openai_gptoss Without these the controller's chat_with_tools loop sees no tool_calls in the response and degrades to no-tool behavior. Made-with: Cursor
Reads a TroVE JSONL output and reports overall accuracy, final toolbox size, per-mode wins, IMPORT-mode tool-use breakdown (>=1 call rate, mean calls/task, success rate), and the top-10 most-called toolbox functions. Sanitizes Harmony control-token contamination in tool names when aggregating. Made-with: Cursor
Document algorithmic deviations (native OpenAI tool calling for IMPORT, reward-based selection by default, PBEBench-shaped few-shots, strict **Solution** parsing for pbebench), faithful elements (3-mode generation, K-sampling, AST tie-break, C*log_20(n) trimming with C=1.0), and infrastructural patches (JSONL checkpointing, reasoning_content fallback, 60s executor timeout, defensive <|-truncation sanitizer). Includes vLLM version requirement (>= v0.16.0 for PR #28729) and the backend coverage caveat (smoke run is vLLM-served gpt-oss only). Made-with: Cursor
The TroVE controller emits passive telemetry (won_mode, import_eligible, import_was_winner, tool_calls, tool_call_count, tools_called, actually_called, trove_stopped_reason, library_snapshot) on the in-memory result dict, but main._append_task_output was dropping all of it before the JSONL was written. scripts/analyze_trove_run.py would then read empty/missing fields and report misleading numbers (e.g. all won_mode as '?', 'No IMPORT-eligible tasks' on healthy runs). Pass these keys through verbatim when present. Keys are absent on non-TroVE runs, so other frameworks (ssl_bcr, regal, react_mem, etc.) are unaffected. Made-with: Cursor
End-to-end Jupyter notebook for the PBEBench-Lite smoke run on RunPod: launches vLLM with the native tool-calling flags, polls /v1/models until ready, runs main.py with --framework trove --trove-task-family pbebench --trove-selection reward against the 50-task lite_pilot_tasks.jsonl split, then invokes scripts/analyze_trove_run.py and previews telemetry. Defaults to gpt-oss-20b on a single A100/H100; flip MODEL and TENSOR_PARALLEL for 120b. Made-with: Cursor
We are only running the TroVE PBEBench smoke on gpt-oss-20b. Add a 20b-specific vLLM launcher (TP=1 + the three tool-calling flags), retarget scripts/run_trove_vllm.sh at 20b + the new TroVE flags (--trove-task-family pbebench, --trove-selection reward, --max-programs 5, lite_pilot_tasks.jsonl, port 8000), and simplify the runpod notebook to a 20b-only configuration. The 120b launcher remains in place for the other (non-TroVE) baselines that still use it. Made-with: Cursor
… to disk Adds a tail_vllm_log() cell so the latest vllm_logs/vllm_*.log can be spot-checked during a long run, and tees the TroVE run cell's stdout into outputs/trove_pbebench_lite_smoke_<ts>.log so logs survive a disconnected browser session. Made-with: Cursor
vLLM exposes gpt-oss text in message.reasoning when content is empty, so TroVE was parsing empty generations and producing blank solutions. Add a shared extractor and regression test for the OpenAI/vLLM response shape. Made-with: Cursor
PBEBench rewards parse stdout as a list of replace() call strings, but the TroVE few-shots were demonstrating transformed output strings. Update PBEBench CREATE, SKIP, and IMPORT-with-tools examples and add prompt regression tests. Made-with: Cursor
Reward-based PBEBench selection was choosing tiny direct solutions over equally correct CREATE/IMPORT candidates that populate or use the toolbox. Prefer reusable functions and tool calls on reward ties, then fall back to smallest AST, and make PBEBench CREATE prompts require a helper in **Tools**. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.