Revisions v1.5 by alexandonian · Pull Request #38 · Future-House/BixBench

alexandonian · 2025-09-29T02:55:47Z

This pull request significantly updates the README.md documentation to reflect the latest BixBench v1.5 dataset, streamline the user experience, and introduce comprehensive automation scripts for evaluation. The changes clarify dataset details, add quick start instructions, and provide explicit guidance for both manual and automated evaluation workflows. Additionally, a new results file (bixbench-v1.5_results/zero_shot_baselines.json) is added to showcase zero-shot baseline metrics for major models.

Documentation updates and workflow improvements:

Updated dataset description to reflect the v1.5 release, including the new count of 205 questions from 60 Jupyter notebooks and improved dataset/resource links.
Added a "Quick Start" section and detailed instructions for automated evaluation scripts (run_zeroshot.sh and run_agentic.sh), enabling fast reproduction of results and clarifying resource requirements and expected runtime. [1] [2] [3]
Expanded documentation on manual configuration and evaluation, including clearer explanations of trajectory generation, grading, and directory structure for both v1.5 and original results. [1] [2]
Improved clarity and navigation by updating section headers, links, and example commands, and reorganizing content for easier use and comprehension. [1] [2]

Results and reproducibility:

Added bixbench-v1.5_results/zero_shot_baselines.json containing accuracy, precision, and coverage metrics for zero-shot baselines (GPT-4o and Claude-3.5-Sonnet) across open-ended and MCQ modes, supporting transparent benchmarking and analysis.

- Updated argument parsing to include `--dataset-split` and `--num-examples` options. - Refactored evaluation logic to streamline processing of questions and capsules. - Introduced new grading prompt for range evaluations. - Improved handling of evaluation modes in processing functions. - Added support for replica IDs in trajectory generation. - Enhanced data loading and processing for better performance and clarity.

- Consolidated dependency list for clarity. - Added new dependencies: `aiofiles`, `crow-client`, `datasets`, `huggingface-hub`. - Restored previously removed dependencies: `ldp`, `matplotlib`, `numpy`, `scikit-learn`, `scipy`, `seaborn`, `statsmodels`, `google-cloud-storage`.

- Introduced YAML configuration files for various agent setups: `4o_image`, `4o_no_image`, `claude_image`, and `claude_no_image`. - Each configuration specifies agent parameters, rollout settings, notebook details, capsule prompt templates, and paths for data storage. - Added a new configuration file `v2_paper_results.yaml` to facilitate comparison of results across different models and settings.

…results configuration - Changed the `trajectories_dir` path from `bixbench-v2_results/trajectories` to `bixbench-v1.5_results/trajectories` in the existing configuration files: `4o_image.yaml`, `4o_no_image.yaml`, `claude_image.yaml`, and `claude_no_image.yaml`. - Introduced a new configuration file `v1.5_paper_results.yaml` to facilitate the comparison of results for the BixBench project.

- Adjusted figure size in `plotting_utils.py` for better visualization. - Modified x-axis limits to dynamically reflect the maximum k value. - Streamlined code for error bar calculations in bar plots. - Updated `v1.5_paper_results.yaml` to change `k_value` from 10 to 5 and enabled the replication of paper results.

- Introduced `run_agentic.sh` to automate agentic evaluations, including configuration checks, running evaluations for multiple setups, and postprocessing results. - Added `run_zeroshot.sh` for zero-shot evaluations, encompassing grading and result aggregation into a single JSON file. - Both scripts ensure proper directory structure and provide user feedback during execution.

- Corrected the dataset question count from 296 to 205 and updated related links for consistency. - Added a new "Quick Start" section with automated evaluation scripts for easier reproduction of results. - Enhanced instructions for trajectory generation and evaluation, including details on API costs and execution time. - Improved formatting and clarity throughout the document, including updated links and section titles.

- Introduced new binary image files for results comparison: `bixbench_results_comparison.png`, `majority_vote_accuracy_image_comparison.png`, and `majority_vote_accuracy_refusal_option_comparison.png`. - Added JSON file `zero_shot_baselines.json` containing baseline evaluation metrics for various models. - Created CSV files for zero-shot grading results, including `claude-3-5-sonnet-latest-grader-mcq-refusal-False.csv`, `claude-3-5-sonnet-latest-grader-mcq-refusal-True.csv`, `claude-3-5-sonnet-latest-grader-openended.csv`, and `gpt-4o-grader-mcq-refusal-False.csv`, `gpt-4o-grader-mcq-refusal-True.csv`, `gpt-4o-grader-openended.csv`. - These additions enhance the evaluation framework and facilitate better comparison of model performance across different scenarios.

- Updated `ignore-words-list` in `pyproject.toml` to include "LasR" and modified the `skip` pattern to encompass additional result directories. - Corrected typos in multiple CSV files, ensuring consistency in the phrasing of questions and improving clarity in the dataset. - Enhanced the overall quality of the zero-shot grading results for better evaluation accuracy.

alexandonian · 2025-09-29T21:25:00Z

+  "crow-client >= 0.3.4",
+  "datasets",
  "fhaviary[server] >= 0.18.0",
  "fhda @ git+https://github.com/Future-House/data-analysis-crow@v1.0.0",


@ludomitch We'll need to update this to point to the corresponding release of data-analysis-crow once approved.

Good catch!

Why is crow-client here?

ludomitch

Nice work 🙌

ludomitch · 2025-09-29T21:43:23Z

 - [Using Your Own Agent](#using-your-own-agent)
- [Zero-shot Evaluations](#zero-shot-evaluations)
+- [Zero-shot Evaluations & Grading](#zero-shot-evaluations--grading)
+- [Automated Evaluation Scripts](#automated-evaluation-scripts)


Zero shot vs automated is confusing. Shouldn't it be zero-shot vs agentic?

ludomitch · 2025-09-29T21:48:17Z

+  "crow-client >= 0.3.4",
+  "datasets",
  "fhaviary[server] >= 0.18.0",
  "fhda @ git+https://github.com/Future-House/data-analysis-crow@v1.0.0",


Good catch!

Why is crow-client here?

- Removed the `crow-client` dependency from `pyproject.toml` to streamline requirements. - Updated the model name in `test_zeroshot.py` from `claude-3-opus` to `anthropic/claude-3-5-sonnet-20241022` for consistency with the latest model version.

…data-analysis-crow release

alexandonian added 11 commits September 25, 2025 03:16

Refactor YAML and JSON files to satisfy pre-commit lints

5b6ef07

Refactor plotting utility to satisfy pre-commit lints

ba8aa01

Update README.md to improve clarity and detail for evaluation scripts

d088e13

alexandonian requested a review from ludomitch September 29, 2025 02:55

alexandonian commented Sep 29, 2025

View reviewed changes

ludomitch approved these changes Sep 29, 2025

View reviewed changes

alexandonian added 4 commits September 30, 2025 00:01

Update uv.lock to reflect new package revisions

01bb86f

Update dependencies in pyproject.toml and uv.lock to point to latest …

28c440e

…data-analysis-crow release

Streamline README organization and update figures for v1.5

c3767cc

alexandonian merged commit f1e7f96 into main Oct 2, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisions v1.5#38

Revisions v1.5#38
alexandonian merged 16 commits into
mainfrom
revisions

alexandonian commented Sep 29, 2025

Uh oh!

alexandonian Sep 29, 2025

Uh oh!

ludomitch Sep 29, 2025

Uh oh!

ludomitch left a comment

Uh oh!

ludomitch Sep 29, 2025

Uh oh!

ludomitch Sep 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexandonian commented Sep 29, 2025

Uh oh!

alexandonian Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

ludomitch Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

ludomitch left a comment

Choose a reason for hiding this comment

Uh oh!

ludomitch Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

ludomitch Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants