Conversation
Add scripts to run intervention experiments that inject steps from successful/failed traces into new agent runs to measure knowledge vs reasoning gaps across scientific environments. Pipeline: select tasks (from reports_v2) -> run baseline -> pick traces from baseline -> run intervention conditions -> analyze.
- Each env now has two server ports (react/toolcalling) to allow safe parallel runs — the server is stateful and concurrent agents would clash - Add scripts/setup_envs.sh for one-time venv creation (uv for spectra/ resistor, micromamba for wetlab due to conda-only reaktoro) - launch_sweep.sh gains --start-servers/--stop-servers/--server-status - Resistor env.py uses argparse with --mode single/chained (no path needed) - Wetlab pyproject.toml updated with corral dep and uv.sources
Replace declare -A (bash 4+) with case-based lookup functions. Tested on macOS bash 3.2.57. Also add generated task_selection.json.
- setup_envs.sh: upgrade promptstore + install boto3 for Bedrock - launch_sweep.sh: add --trials flag for smoke testing (e.g. --trials 1) - run_intervention.py: cap k_values at trials count to avoid validation error - Verified end-to-end: setup venvs → start servers → launch baselines → reports
Updated uv.lock files across all task environments after upgrading promptstore. Added generated prompts/index.json.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Free Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note 🎁 Summarized by CodeRabbit FreeYour organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login. Comment |
|
@n0w0f can we close this? |
Add scripts to run intervention experiments that inject steps from
successful/failed traces into new agent runs to measure knowledge
vs reasoning gaps across scientific environments.
Pipeline: select tasks (from reports_v2) -> run baseline -> pick
traces from baseline -> run intervention conditions -> analyze.