Revisions v1.5#38
Merged
Merged
Conversation
- Updated argument parsing to include `--dataset-split` and `--num-examples` options. - Refactored evaluation logic to streamline processing of questions and capsules. - Introduced new grading prompt for range evaluations. - Improved handling of evaluation modes in processing functions. - Added support for replica IDs in trajectory generation. - Enhanced data loading and processing for better performance and clarity.
- Consolidated dependency list for clarity. - Added new dependencies: `aiofiles`, `crow-client`, `datasets`, `huggingface-hub`. - Restored previously removed dependencies: `ldp`, `matplotlib`, `numpy`, `scikit-learn`, `scipy`, `seaborn`, `statsmodels`, `google-cloud-storage`.
- Introduced YAML configuration files for various agent setups: `4o_image`, `4o_no_image`, `claude_image`, and `claude_no_image`. - Each configuration specifies agent parameters, rollout settings, notebook details, capsule prompt templates, and paths for data storage. - Added a new configuration file `v2_paper_results.yaml` to facilitate comparison of results across different models and settings.
…results configuration - Changed the `trajectories_dir` path from `bixbench-v2_results/trajectories` to `bixbench-v1.5_results/trajectories` in the existing configuration files: `4o_image.yaml`, `4o_no_image.yaml`, `claude_image.yaml`, and `claude_no_image.yaml`. - Introduced a new configuration file `v1.5_paper_results.yaml` to facilitate the comparison of results for the BixBench project.
- Adjusted figure size in `plotting_utils.py` for better visualization. - Modified x-axis limits to dynamically reflect the maximum k value. - Streamlined code for error bar calculations in bar plots. - Updated `v1.5_paper_results.yaml` to change `k_value` from 10 to 5 and enabled the replication of paper results.
- Introduced `run_agentic.sh` to automate agentic evaluations, including configuration checks, running evaluations for multiple setups, and postprocessing results. - Added `run_zeroshot.sh` for zero-shot evaluations, encompassing grading and result aggregation into a single JSON file. - Both scripts ensure proper directory structure and provide user feedback during execution.
- Corrected the dataset question count from 296 to 205 and updated related links for consistency. - Added a new "Quick Start" section with automated evaluation scripts for easier reproduction of results. - Enhanced instructions for trajectory generation and evaluation, including details on API costs and execution time. - Improved formatting and clarity throughout the document, including updated links and section titles.
- Introduced new binary image files for results comparison: `bixbench_results_comparison.png`, `majority_vote_accuracy_image_comparison.png`, and `majority_vote_accuracy_refusal_option_comparison.png`. - Added JSON file `zero_shot_baselines.json` containing baseline evaluation metrics for various models. - Created CSV files for zero-shot grading results, including `claude-3-5-sonnet-latest-grader-mcq-refusal-False.csv`, `claude-3-5-sonnet-latest-grader-mcq-refusal-True.csv`, `claude-3-5-sonnet-latest-grader-openended.csv`, and `gpt-4o-grader-mcq-refusal-False.csv`, `gpt-4o-grader-mcq-refusal-True.csv`, `gpt-4o-grader-openended.csv`. - These additions enhance the evaluation framework and facilitate better comparison of model performance across different scenarios.
- Updated `ignore-words-list` in `pyproject.toml` to include "LasR" and modified the `skip` pattern to encompass additional result directories. - Corrected typos in multiple CSV files, ensuring consistency in the phrasing of questions and improving clarity in the dataset. - Enhanced the overall quality of the zero-shot grading results for better evaluation accuracy.
alexandonian
commented
Sep 29, 2025
| "crow-client >= 0.3.4", | ||
| "datasets", | ||
| "fhaviary[server] >= 0.18.0", | ||
| "fhda @ git+https://github.com/Future-House/data-analysis-crow@v1.0.0", |
Contributor
Author
There was a problem hiding this comment.
@ludomitch We'll need to update this to point to the corresponding release of data-analysis-crow once approved.
Collaborator
There was a problem hiding this comment.
Good catch!
Why is crow-client here?
ludomitch
approved these changes
Sep 29, 2025
| - [Using Your Own Agent](#using-your-own-agent) | ||
| - [Zero-shot Evaluations](#zero-shot-evaluations) | ||
| - [Zero-shot Evaluations & Grading](#zero-shot-evaluations--grading) | ||
| - [Automated Evaluation Scripts](#automated-evaluation-scripts) |
Collaborator
There was a problem hiding this comment.
Zero shot vs automated is confusing. Shouldn't it be zero-shot vs agentic?
| "crow-client >= 0.3.4", | ||
| "datasets", | ||
| "fhaviary[server] >= 0.18.0", | ||
| "fhda @ git+https://github.com/Future-House/data-analysis-crow@v1.0.0", |
Collaborator
There was a problem hiding this comment.
Good catch!
Why is crow-client here?
- Removed the `crow-client` dependency from `pyproject.toml` to streamline requirements. - Updated the model name in `test_zeroshot.py` from `claude-3-opus` to `anthropic/claude-3-5-sonnet-20241022` for consistency with the latest model version.
…data-analysis-crow release
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request significantly updates the
README.mddocumentation to reflect the latest BixBench v1.5 dataset, streamline the user experience, and introduce comprehensive automation scripts for evaluation. The changes clarify dataset details, add quick start instructions, and provide explicit guidance for both manual and automated evaluation workflows. Additionally, a new results file (bixbench-v1.5_results/zero_shot_baselines.json) is added to showcase zero-shot baseline metrics for major models.Documentation updates and workflow improvements:
run_zeroshot.shandrun_agentic.sh), enabling fast reproduction of results and clarifying resource requirements and expected runtime. [1] [2] [3]Results and reproducibility:
bixbench-v1.5_results/zero_shot_baselines.jsoncontaining accuracy, precision, and coverage metrics for zero-shot baselines (GPT-4o and Claude-3.5-Sonnet) across open-ended and MCQ modes, supporting transparent benchmarking and analysis.