Skip to content

Revisions v1.5#38

Merged
alexandonian merged 16 commits into
mainfrom
revisions
Oct 2, 2025
Merged

Revisions v1.5#38
alexandonian merged 16 commits into
mainfrom
revisions

Conversation

@alexandonian
Copy link
Copy Markdown
Contributor

This pull request significantly updates the README.md documentation to reflect the latest BixBench v1.5 dataset, streamline the user experience, and introduce comprehensive automation scripts for evaluation. The changes clarify dataset details, add quick start instructions, and provide explicit guidance for both manual and automated evaluation workflows. Additionally, a new results file (bixbench-v1.5_results/zero_shot_baselines.json) is added to showcase zero-shot baseline metrics for major models.

Documentation updates and workflow improvements:

  • Updated dataset description to reflect the v1.5 release, including the new count of 205 questions from 60 Jupyter notebooks and improved dataset/resource links.
  • Added a "Quick Start" section and detailed instructions for automated evaluation scripts (run_zeroshot.sh and run_agentic.sh), enabling fast reproduction of results and clarifying resource requirements and expected runtime. [1] [2] [3]
  • Expanded documentation on manual configuration and evaluation, including clearer explanations of trajectory generation, grading, and directory structure for both v1.5 and original results. [1] [2]
  • Improved clarity and navigation by updating section headers, links, and example commands, and reorganizing content for easier use and comprehension. [1] [2]

Results and reproducibility:

  • Added bixbench-v1.5_results/zero_shot_baselines.json containing accuracy, precision, and coverage metrics for zero-shot baselines (GPT-4o and Claude-3.5-Sonnet) across open-ended and MCQ modes, supporting transparent benchmarking and analysis.

- Updated argument parsing to include `--dataset-split` and `--num-examples` options.
- Refactored evaluation logic to streamline processing of questions and capsules.
- Introduced new grading prompt for range evaluations.
- Improved handling of evaluation modes in processing functions.
- Added support for replica IDs in trajectory generation.
- Enhanced data loading and processing for better performance and clarity.
- Consolidated dependency list for clarity.
- Added new dependencies: `aiofiles`, `crow-client`, `datasets`, `huggingface-hub`.
- Restored previously removed dependencies: `ldp`, `matplotlib`, `numpy`, `scikit-learn`, `scipy`, `seaborn`, `statsmodels`, `google-cloud-storage`.
- Introduced YAML configuration files for various agent setups: `4o_image`, `4o_no_image`, `claude_image`, and `claude_no_image`.
- Each configuration specifies agent parameters, rollout settings, notebook details, capsule prompt templates, and paths for data storage.
- Added a new configuration file `v2_paper_results.yaml` to facilitate comparison of results across different models and settings.
…results configuration

- Changed the `trajectories_dir` path from `bixbench-v2_results/trajectories` to `bixbench-v1.5_results/trajectories` in the existing configuration files: `4o_image.yaml`, `4o_no_image.yaml`, `claude_image.yaml`, and `claude_no_image.yaml`.
- Introduced a new configuration file `v1.5_paper_results.yaml` to facilitate the comparison of results for the BixBench project.
- Adjusted figure size in `plotting_utils.py` for better visualization.
- Modified x-axis limits to dynamically reflect the maximum k value.
- Streamlined code for error bar calculations in bar plots.
- Updated `v1.5_paper_results.yaml` to change `k_value` from 10 to 5 and enabled the replication of paper results.
- Introduced `run_agentic.sh` to automate agentic evaluations, including configuration checks, running evaluations for multiple setups, and postprocessing results.
- Added `run_zeroshot.sh` for zero-shot evaluations, encompassing grading and result aggregation into a single JSON file.
- Both scripts ensure proper directory structure and provide user feedback during execution.
- Corrected the dataset question count from 296 to 205 and updated related links for consistency.
- Added a new "Quick Start" section with automated evaluation scripts for easier reproduction of results.
- Enhanced instructions for trajectory generation and evaluation, including details on API costs and execution time.
- Improved formatting and clarity throughout the document, including updated links and section titles.
- Introduced new binary image files for results comparison: `bixbench_results_comparison.png`, `majority_vote_accuracy_image_comparison.png`, and `majority_vote_accuracy_refusal_option_comparison.png`.
- Added JSON file `zero_shot_baselines.json` containing baseline evaluation metrics for various models.
- Created CSV files for zero-shot grading results, including `claude-3-5-sonnet-latest-grader-mcq-refusal-False.csv`, `claude-3-5-sonnet-latest-grader-mcq-refusal-True.csv`, `claude-3-5-sonnet-latest-grader-openended.csv`, and `gpt-4o-grader-mcq-refusal-False.csv`, `gpt-4o-grader-mcq-refusal-True.csv`, `gpt-4o-grader-openended.csv`.
- These additions enhance the evaluation framework and facilitate better comparison of model performance across different scenarios.
- Updated `ignore-words-list` in `pyproject.toml` to include "LasR" and modified the `skip` pattern to encompass additional result directories.
- Corrected typos in multiple CSV files, ensuring consistency in the phrasing of questions and improving clarity in the dataset.
- Enhanced the overall quality of the zero-shot grading results for better evaluation accuracy.
Comment thread pyproject.toml Outdated
"crow-client >= 0.3.4",
"datasets",
"fhaviary[server] >= 0.18.0",
"fhda @ git+https://github.com/Future-House/data-analysis-crow@v1.0.0",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ludomitch We'll need to update this to point to the corresponding release of data-analysis-crow once approved.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Why is crow-client here?

Copy link
Copy Markdown
Collaborator

@ludomitch ludomitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 🙌

Comment thread README.md Outdated
- [Using Your Own Agent](#using-your-own-agent)
- [Zero-shot Evaluations](#zero-shot-evaluations)
- [Zero-shot Evaluations & Grading](#zero-shot-evaluations--grading)
- [Automated Evaluation Scripts](#automated-evaluation-scripts)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero shot vs automated is confusing. Shouldn't it be zero-shot vs agentic?

Comment thread pyproject.toml Outdated
"crow-client >= 0.3.4",
"datasets",
"fhaviary[server] >= 0.18.0",
"fhda @ git+https://github.com/Future-House/data-analysis-crow@v1.0.0",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Why is crow-client here?

- Removed the `crow-client` dependency from `pyproject.toml` to streamline requirements.
- Updated the model name in `test_zeroshot.py` from `claude-3-opus` to `anthropic/claude-3-5-sonnet-20241022` for consistency with the latest model version.
@alexandonian alexandonian merged commit f1e7f96 into main Oct 2, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants