Skip to content

Trove baseline#1

Open
mathuryash5 wants to merge 24 commits into
mainfrom
trove_baseline
Open

Trove baseline#1
mathuryash5 wants to merge 24 commits into
mainfrom
trove_baseline

Conversation

@mathuryash5
Copy link
Copy Markdown
Collaborator

No description provided.

Specifies the architecture for adapting TroVE's IMPORT mode to use native
OpenAI tool calling with vLLM-served gpt-oss models. CREATE and SKIP
remain text-based; reward selection, K-sampling, and trimming stay
faithful to the paper. Includes telemetry plan, vLLM version requirements
(>= v0.16.0 for PR #28729), defensive sanitizer for the open Harmony
control-token leakage bug (PR #35906), and the smoke-run done criteria.

Made-with: Cursor
Step-by-step plan for implementing the design spec
(2026-04-25-trove-native-tool-calling-design.md). Eleven tasks covering
infra patches, tools_api module, chat_with_tools, controller branch,
prompts, CLI flags, vLLM launcher, analyzer script, deviations doc,
and the 50-task PBEBench-Lite smoke run + report.

Made-with: Cursor
- toolbox.trim default C=1.0 (matches original TroVE)
- executor DEFAULT_TIMEOUT=60s (PBEBench + multi-turn headroom)
- llm._call_openai falls back to message.reasoning_content when
  message.content is empty (gpt-oss Harmony channel split)

Made-with: Cursor
…esponse

- imported_callsites(solution, tools, names) -> set: AST-walks Solution
  code and returns names from the candidate set that are actually called.
  Handles bare Name and Attribute (toolbox.foo) callees.
- parse_response(text, task_family="default"): when task_family="pbebench"
  the parser does not fall back to the first python block when **Solution**
  is missing. Prevents CoT scratchpad from being promoted to the answer.

Made-with: Cursor
- Add task_family parameter to all build_* prompt builders.
- Add _CREATE_EXAMPLE_PBEBENCH and _SKIP_EXAMPLE_PBEBENCH demonstrating
  replace()-chain solutions and a find_replace_chain helper.
- Add build_import_with_tools_prompt for native tool calling: no
  **Toolbox** markdown block (toolbox is conveyed via tools=[...]).
- _FORMAT_OVERRIDE is empty for task_family="pbebench" (the example
  models the desired format directly).

Made-with: Cursor
- toolbox_to_openai_tools(toolbox, topk=10): converts top-k toolbox
  functions into OpenAI Chat Completions tool schemas. Infers parameter
  types from inspect.signature; functions with *args/**kwargs are
  silently excluded.
- dispatch_tool_call(toolbox, tool_call): runs the requested function
  in the sandbox executor, returns stdout truncated to 4096 chars or
  a JSON error string. Sanitizes Harmony control-token contamination
  in tool names (defensive vs. open vLLM PR #35906).

Made-with: Cursor
executor.run_solution returns proc.stdout.strip(), not stderr. Rename the
JSON error key from 'stderr' to 'stdout' so the field name matches what is
actually being returned. Caught in code-quality review for Task 4.

Made-with: Cursor
Multi-turn loop that handles tool_calls returned by gpt-oss/vLLM:
appends assistant message + tool result messages until the model returns
no tool_calls or max_tool_iters is reached. Records each call as
{name, args_preview, result_preview, ok} for downstream telemetry.
Reuses the existing 3-attempt retry, debug logging, and token accounting.

Anthropic backend raises NotImplementedError as a defensive guard;
controllers branch on self.backend == "openai" before calling.

Made-with: Cursor
- Add task_family and selection params to TroVEController.__init__.
- IMPORT branch dispatches to _generate_import_with_tools when toolbox
  is non-empty and backend is openai; otherwise falls back to legacy
  text-based IMPORT.
- _generate_import_with_tools builds K multi-turn trajectories via
  TroVELLMClient.chat_with_tools, parses **Solution** strictly for
  pbebench, and runs the result through the executor.
- _update_library credits frequency by unique tool_call.function.name
  for the native path; legacy path still credits parsed functions.
- _make_result emits won_mode, import_eligible, import_was_winner,
  tool_calls, tool_call_count, tools_called, actually_called,
  trove_stopped_reason as passive telemetry.
- _select_best honors selection="consistency" or "reward" (default).

Made-with: Cursor
Update the class-level Parameters block to:
- Reflect trim_C default of 1.0 (matches __init__).
- Document task_family, selection, max_tool_iters, tool_schema_topk.
- Note that base_url governs which backend is used and that native
  tool-calling IMPORT requires the openai backend.

Made-with: Cursor
- --trove-selection {reward,consistency} (default: reward).
- --trove-task-family {default,pbebench} (default: default). Plumbed
  through to TroVEController; PBEBench runs should pass --trove-task-family
  pbebench to enable PBEBench-shaped few-shots and strict **Solution**
  parsing.

Made-with: Cursor
Add three flags required for OpenAI-compatible tool calling on gpt-oss
served by vLLM >= v0.16.0:
  --enable-auto-tool-choice
  --tool-call-parser openai
  --reasoning-parser openai_gptoss

Without these the controller's chat_with_tools loop sees no tool_calls
in the response and degrades to no-tool behavior.

Made-with: Cursor
Reads a TroVE JSONL output and reports overall accuracy, final toolbox
size, per-mode wins, IMPORT-mode tool-use breakdown (>=1 call rate,
mean calls/task, success rate), and the top-10 most-called toolbox
functions. Sanitizes Harmony control-token contamination in tool names
when aggregating.

Made-with: Cursor
Document algorithmic deviations (native OpenAI tool calling for IMPORT,
reward-based selection by default, PBEBench-shaped few-shots, strict
**Solution** parsing for pbebench), faithful elements (3-mode generation,
K-sampling, AST tie-break, C*log_20(n) trimming with C=1.0), and
infrastructural patches (JSONL checkpointing, reasoning_content
fallback, 60s executor timeout, defensive <|-truncation sanitizer).

Includes vLLM version requirement (>= v0.16.0 for PR #28729) and the
backend coverage caveat (smoke run is vLLM-served gpt-oss only).

Made-with: Cursor
The TroVE controller emits passive telemetry (won_mode, import_eligible,
import_was_winner, tool_calls, tool_call_count, tools_called,
actually_called, trove_stopped_reason, library_snapshot) on the in-memory
result dict, but main._append_task_output was dropping all of it before
the JSONL was written. scripts/analyze_trove_run.py would then read
empty/missing fields and report misleading numbers (e.g. all won_mode
as '?', 'No IMPORT-eligible tasks' on healthy runs).

Pass these keys through verbatim when present. Keys are absent on
non-TroVE runs, so other frameworks (ssl_bcr, regal, react_mem, etc.)
are unaffected.

Made-with: Cursor
End-to-end Jupyter notebook for the PBEBench-Lite smoke run on RunPod:
launches vLLM with the native tool-calling flags, polls /v1/models
until ready, runs main.py with --framework trove
--trove-task-family pbebench --trove-selection reward against the
50-task lite_pilot_tasks.jsonl split, then invokes
scripts/analyze_trove_run.py and previews telemetry. Defaults to
gpt-oss-20b on a single A100/H100; flip MODEL and TENSOR_PARALLEL
for 120b.

Made-with: Cursor
We are only running the TroVE PBEBench smoke on gpt-oss-20b. Add a
20b-specific vLLM launcher (TP=1 + the three tool-calling flags),
retarget scripts/run_trove_vllm.sh at 20b + the new TroVE flags
(--trove-task-family pbebench, --trove-selection reward,
--max-programs 5, lite_pilot_tasks.jsonl, port 8000), and simplify
the runpod notebook to a 20b-only configuration. The 120b launcher
remains in place for the other (non-TroVE) baselines that still use it.

Made-with: Cursor
… to disk

Adds a tail_vllm_log() cell so the latest vllm_logs/vllm_*.log can be
spot-checked during a long run, and tees the TroVE run cell's stdout
into outputs/trove_pbebench_lite_smoke_<ts>.log so logs survive a
disconnected browser session.

Made-with: Cursor
vLLM exposes gpt-oss text in message.reasoning when content is empty, so TroVE was parsing empty generations and producing blank solutions. Add a shared extractor and regression test for the OpenAI/vLLM response shape.

Made-with: Cursor
PBEBench rewards parse stdout as a list of replace() call strings, but the TroVE few-shots were demonstrating transformed output strings. Update PBEBench CREATE, SKIP, and IMPORT-with-tools examples and add prompt regression tests.

Made-with: Cursor
Reward-based PBEBench selection was choosing tiny direct solutions over equally correct CREATE/IMPORT candidates that populate or use the toolbox. Prefer reusable functions and tool calls on reward ties, then fall back to smallest AST, and make PBEBench CREATE prompts require a helper in **Tools**.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant