Trove baseline by mathuryash5 · Pull Request #1 · cmu-llab/ReaComp

mathuryash5 · 2026-04-30T17:41:45Z

No description provided.

Specifies the architecture for adapting TroVE's IMPORT mode to use native OpenAI tool calling with vLLM-served gpt-oss models. CREATE and SKIP remain text-based; reward selection, K-sampling, and trimming stay faithful to the paper. Includes telemetry plan, vLLM version requirements (>= v0.16.0 for PR #28729), defensive sanitizer for the open Harmony control-token leakage bug (PR #35906), and the smoke-run done criteria. Made-with: Cursor

Step-by-step plan for implementing the design spec (2026-04-25-trove-native-tool-calling-design.md). Eleven tasks covering infra patches, tools_api module, chat_with_tools, controller branch, prompts, CLI flags, vLLM launcher, analyzer script, deviations doc, and the 50-task PBEBench-Lite smoke run + report. Made-with: Cursor

- toolbox.trim default C=1.0 (matches original TroVE) - executor DEFAULT_TIMEOUT=60s (PBEBench + multi-turn headroom) - llm._call_openai falls back to message.reasoning_content when message.content is empty (gpt-oss Harmony channel split) Made-with: Cursor

…esponse - imported_callsites(solution, tools, names) -> set: AST-walks Solution code and returns names from the candidate set that are actually called. Handles bare Name and Attribute (toolbox.foo) callees. - parse_response(text, task_family="default"): when task_family="pbebench" the parser does not fall back to the first python block when **Solution** is missing. Prevents CoT scratchpad from being promoted to the answer. Made-with: Cursor

- Add task_family parameter to all build_* prompt builders. - Add _CREATE_EXAMPLE_PBEBENCH and _SKIP_EXAMPLE_PBEBENCH demonstrating replace()-chain solutions and a find_replace_chain helper. - Add build_import_with_tools_prompt for native tool calling: no **Toolbox** markdown block (toolbox is conveyed via tools=[...]). - _FORMAT_OVERRIDE is empty for task_family="pbebench" (the example models the desired format directly). Made-with: Cursor

- toolbox_to_openai_tools(toolbox, topk=10): converts top-k toolbox functions into OpenAI Chat Completions tool schemas. Infers parameter types from inspect.signature; functions with *args/**kwargs are silently excluded. - dispatch_tool_call(toolbox, tool_call): runs the requested function in the sandbox executor, returns stdout truncated to 4096 chars or a JSON error string. Sanitizes Harmony control-token contamination in tool names (defensive vs. open vLLM PR #35906). Made-with: Cursor

executor.run_solution returns proc.stdout.strip(), not stderr. Rename the JSON error key from 'stderr' to 'stdout' so the field name matches what is actually being returned. Caught in code-quality review for Task 4. Made-with: Cursor

Multi-turn loop that handles tool_calls returned by gpt-oss/vLLM: appends assistant message + tool result messages until the model returns no tool_calls or max_tool_iters is reached. Records each call as {name, args_preview, result_preview, ok} for downstream telemetry. Reuses the existing 3-attempt retry, debug logging, and token accounting. Anthropic backend raises NotImplementedError as a defensive guard; controllers branch on self.backend == "openai" before calling. Made-with: Cursor

- Add task_family and selection params to TroVEController.__init__. - IMPORT branch dispatches to _generate_import_with_tools when toolbox is non-empty and backend is openai; otherwise falls back to legacy text-based IMPORT. - _generate_import_with_tools builds K multi-turn trajectories via TroVELLMClient.chat_with_tools, parses **Solution** strictly for pbebench, and runs the result through the executor. - _update_library credits frequency by unique tool_call.function.name for the native path; legacy path still credits parsed functions. - _make_result emits won_mode, import_eligible, import_was_winner, tool_calls, tool_call_count, tools_called, actually_called, trove_stopped_reason as passive telemetry. - _select_best honors selection="consistency" or "reward" (default). Made-with: Cursor

Update the class-level Parameters block to: - Reflect trim_C default of 1.0 (matches __init__). - Document task_family, selection, max_tool_iters, tool_schema_topk. - Note that base_url governs which backend is used and that native tool-calling IMPORT requires the openai backend. Made-with: Cursor

- --trove-selection {reward,consistency} (default: reward). - --trove-task-family {default,pbebench} (default: default). Plumbed through to TroVEController; PBEBench runs should pass --trove-task-family pbebench to enable PBEBench-shaped few-shots and strict **Solution** parsing. Made-with: Cursor

Add three flags required for OpenAI-compatible tool calling on gpt-oss served by vLLM >= v0.16.0: --enable-auto-tool-choice --tool-call-parser openai --reasoning-parser openai_gptoss Without these the controller's chat_with_tools loop sees no tool_calls in the response and degrades to no-tool behavior. Made-with: Cursor

Reads a TroVE JSONL output and reports overall accuracy, final toolbox size, per-mode wins, IMPORT-mode tool-use breakdown (>=1 call rate, mean calls/task, success rate), and the top-10 most-called toolbox functions. Sanitizes Harmony control-token contamination in tool names when aggregating. Made-with: Cursor

Document algorithmic deviations (native OpenAI tool calling for IMPORT, reward-based selection by default, PBEBench-shaped few-shots, strict **Solution** parsing for pbebench), faithful elements (3-mode generation, K-sampling, AST tie-break, C*log_20(n) trimming with C=1.0), and infrastructural patches (JSONL checkpointing, reasoning_content fallback, 60s executor timeout, defensive <|-truncation sanitizer). Includes vLLM version requirement (>= v0.16.0 for PR #28729) and the backend coverage caveat (smoke run is vLLM-served gpt-oss only). Made-with: Cursor

The TroVE controller emits passive telemetry (won_mode, import_eligible, import_was_winner, tool_calls, tool_call_count, tools_called, actually_called, trove_stopped_reason, library_snapshot) on the in-memory result dict, but main._append_task_output was dropping all of it before the JSONL was written. scripts/analyze_trove_run.py would then read empty/missing fields and report misleading numbers (e.g. all won_mode as '?', 'No IMPORT-eligible tasks' on healthy runs). Pass these keys through verbatim when present. Keys are absent on non-TroVE runs, so other frameworks (ssl_bcr, regal, react_mem, etc.) are unaffected. Made-with: Cursor

End-to-end Jupyter notebook for the PBEBench-Lite smoke run on RunPod: launches vLLM with the native tool-calling flags, polls /v1/models until ready, runs main.py with --framework trove --trove-task-family pbebench --trove-selection reward against the 50-task lite_pilot_tasks.jsonl split, then invokes scripts/analyze_trove_run.py and previews telemetry. Defaults to gpt-oss-20b on a single A100/H100; flip MODEL and TENSOR_PARALLEL for 120b. Made-with: Cursor

We are only running the TroVE PBEBench smoke on gpt-oss-20b. Add a 20b-specific vLLM launcher (TP=1 + the three tool-calling flags), retarget scripts/run_trove_vllm.sh at 20b + the new TroVE flags (--trove-task-family pbebench, --trove-selection reward, --max-programs 5, lite_pilot_tasks.jsonl, port 8000), and simplify the runpod notebook to a 20b-only configuration. The 120b launcher remains in place for the other (non-TroVE) baselines that still use it. Made-with: Cursor

… to disk Adds a tail_vllm_log() cell so the latest vllm_logs/vllm_*.log can be spot-checked during a long run, and tees the TroVE run cell's stdout into outputs/trove_pbebench_lite_smoke_<ts>.log so logs survive a disconnected browser session. Made-with: Cursor

vLLM exposes gpt-oss text in message.reasoning when content is empty, so TroVE was parsing empty generations and producing blank solutions. Add a shared extractor and regression test for the OpenAI/vLLM response shape. Made-with: Cursor

PBEBench rewards parse stdout as a list of replace() call strings, but the TroVE few-shots were demonstrating transformed output strings. Update PBEBench CREATE, SKIP, and IMPORT-with-tools examples and add prompt regression tests. Made-with: Cursor

Reward-based PBEBench selection was choosing tiny direct solutions over equally correct CREATE/IMPORT candidates that populate or use the toolbox. Prefer reusable functions and tool calls on reward ties, then fall back to smallest AST, and make PBEBench CREATE prompts require a helper in **Tools**. Made-with: Cursor

Made-with: Cursor

mathuryash5 added 24 commits April 25, 2026 17:19

fix(trove): encourage reusable PBEBench helpers

94bc0d1

Made-with: Cursor

fix(trove): show PBEBench helper signatures in CREATE prompt

4352d31

Made-with: Cursor

chore(trove): remove superpowers planning docs

51edc0c

Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trove baseline#1

Trove baseline#1
mathuryash5 wants to merge 24 commits into
mainfrom
trove_baseline

mathuryash5 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mathuryash5 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant