Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
76369ac
Enhance zero-shot evaluation and grading functionality
alexandonian Sep 25, 2025
6c0498a
Update dependencies in pyproject.toml for improved functionality
alexandonian Sep 25, 2025
e3af14c
Add new configuration files for BixBench runs
alexandonian Sep 25, 2025
0fdd93a
Update trajectory directory paths in configuration files and add new …
alexandonian Sep 26, 2025
fd2ccf8
Update plotting utility and configuration for majority vote accuracy
alexandonian Sep 26, 2025
8e22017
Add scripts for agentic and zero-shot evaluations
alexandonian Sep 26, 2025
df6eaca
Update README.md for clarity and new features
alexandonian Sep 26, 2025
28909d8
Add new results and comparison files for BixBench evaluation
alexandonian Sep 26, 2025
5b6ef07
Refactor YAML and JSON files to satisfy pre-commit lints
alexandonian Sep 29, 2025
ba8aa01
Refactor plotting utility to satisfy pre-commit lints
alexandonian Sep 29, 2025
d088e13
Update README.md to improve clarity and detail for evaluation scripts
alexandonian Sep 29, 2025
0efdf6f
Update pyproject.toml and CSV files for improved evaluation consistency
alexandonian Sep 29, 2025
7d3ae65
Update dependencies and model references in configuration files
alexandonian Sep 30, 2025
01bb86f
Update uv.lock to reflect new package revisions
alexandonian Sep 30, 2025
28c440e
Update dependencies in pyproject.toml and uv.lock to point to latest …
alexandonian Sep 30, 2025
c3767cc
Streamline README organization and update figures for v1.5
alexandonian Sep 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 92 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<p align="center">
<p align="center"> <!-- markdownlint-disable MD041 -->
<a href="https://arxiv.org/abs/2503.00096">
<img alt="Paper" src="https://img.shields.io/badge/arXiv-arXiv:2409.11363-b31b1b.svg">
<a href = "https://github.com/Future-House/BixBench">
Expand All @@ -7,7 +7,7 @@
<img alt="Dataset" src="https://img.shields.io/badge/Hugging%20Face-Dataset-yellow.svg">
<img alt="Tests" src="https://github.com/Future-House/BixBench/actions/workflows/tests.yml/badge.svg">
<a href="https://github.com/Future-House/BixBench/actions/workflows/tests.yml">
</p>
</p>

# BixBench: A Comprehensive Benchmark for LLM-based Agents in Computational Biology

Expand All @@ -19,9 +19,9 @@ This benchmark tests AI agents' ability to:
- Interpret nuanced results in the context of a research question

BixBench presents AI agents with open-ended or multiple-choice tasks, requiring them to navigate datasets, execute code (Python, R, Bash), generate scientific hypotheses, and validate them.
The dataset contains 296 questions derived from 53 real-world, published Jupyter notebooks and related data (capsules).
The dataset contains 205 questions derived from 60 real-world, published Jupyter notebooks and related data (capsules).

You can find the BixBench dataset in [Hugging Face](https://huggingface.co/datasets/futurehouse/BixBench), read details in the paper [here](https://arxiv.org/abs/2503.00096), and read our the blog post announcement [here](https://www.futurehouse.org/research-announcements/bixbench).
You can find the BixBench dataset on [Hugging Face](https://huggingface.co/datasets/futurehouse/BixBench), read details in the [paper](https://arxiv.org/abs/2503.00096), and read our [blog post announcement](https://www.futurehouse.org/research-announcements/bixbench).

This repository enables three separate functions:

Expand All @@ -32,9 +32,10 @@ This repository enables three separate functions:
## Links

- [Installation](#installation)
- [Quick Start](#quick-start)
- [Agentic Evaluations](#agentic-evaluations)
- [Using Your Own Agent](#using-your-own-agent)
- [Zero-shot Evaluations](#zero-shot-evaluations)
- [Zero-shot Evaluations & Grading](#zero-shot-evaluations--grading)
- [Replicating the BixBench Paper Results](#replicating-the-bixbench-paper-results)
- [Acknowledgments](#acknowledgments)

Expand All @@ -58,7 +59,7 @@ Next, you will need to be able to access the BixBench dataset. To do this, you w
huggingface-cli login
```

See [here](https://huggingface.co/docs/huggingface_hub/en/guides/cli) for how to get started with the Hugging Face CLI and [here](https://huggingface.co/docs/huggingface_hub/en/guides/security-tokens) for more information on how to create a token.
See the [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli) for how to get started with the Hugging Face CLI and [Hugging Face Security Tokens](https://huggingface.co/docs/huggingface_hub/en/guides/security-tokens) for more information on how to create a token.

Finally, the agent executes its data analysis code in a containerized environment. So to run it, you will need to pull the docker image:

Expand All @@ -67,15 +68,41 @@ Finally, the agent executes its data analysis code in a containerized environmen
docker pull futurehouse/bixbench:aviary-notebook-env
```

See [here](https://www.docker.com/get-started/) for instructions on how to set up Docker.
See [Docker's Getting Started Guide](https://docs.docker.com/get-started/) for instructions on how to set up Docker.

## Quick Start

For quick reproduction of BixBench results, we provide automated scripts that handle the entire evaluation pipeline:

### Option 1: Automated Evaluation (Recommended)

```bash
# Run zero-shot evaluations and grading (fastest)
bash scripts/run_zeroshot.sh

# Run agentic evaluations with multiple replicas (takes 24-48 hours)
bash scripts/run_agentic.sh
```

### Option 2: Manual Configuration

```bash
# Generate trajectories with specific configuration
python bixbench/generate_trajectories.py --config_file bixbench/run_configuration/4o_image.yaml

# Run postprocessing to generate results
python bixbench/postprocessing.py --config_file bixbench/run_configuration/v1.5_paper_results.yaml
```

⚠️ **Note**: The automated agentic evaluation script will run multiple models (GPT-4o, Claude) across 5 replicas each, which requires significant API credits and 24-48 hours to complete.

## Prerequisites

### API Keys

We support all LLMs that are supported by [litellm](https://github.com/BerriAI/litellm). Create a `.env` file with the API keys for the LLMs you want to evaluate. For example:

```
```env
OPENAI_API_KEY = "your-openai-api-key"
ANTHROPIC_API_KEY = "your-anthropic-api-key"
```
Expand All @@ -101,7 +128,7 @@ This will:
2. Preprocess each capsule in the dataset
3. Generate and store trajectories including the final agent answer and Jupyter notebook in the directory specified in the YAML file

Trajectories are saved in the `bixbench_results/` directory as json files.
Trajectories are saved in the specified `trajectories_dir` in the YAML file as json files (default is `data/trajectories/`).

### Customization

Expand All @@ -118,10 +145,6 @@ Edit or create a new YAML file to modify:

To use your own agent, use the `generate_trajectories.py` script by editing the [`custom_rollout`](https://github.com/Future-House/BixBench/blob/6c28217959d5d7dd6f48c59894534fced7c6c040/bixbench/generate_trajectories.py#L239) function to generate trajectories in the same format as the BixBench trajectories, then use the `postprocessing.py` script to evaluate your agent's performance.

### Hosted trajectory generation

Coming soon!

### Evaluate trajectories

Similarly, to evaluate the trajectories, we use the `postprocessing.py` script alongside a YAML configuration file:
Expand Down Expand Up @@ -151,48 +174,91 @@ You can run zero-shot evaluations using the `generate_zeroshot_evals.py` script

The scripts can be configured to run with open-ended questions, multiple-choice questions (with or without a refusal option), different models, and different temperatures. To explore the different options, run the scripts with the `--help` flag.

**Example: Generate zero-shot answers in MCQ setting with the "refusal option" (in addition to the original distractors)**
### Example: Generate zero-shot answers in MCQ setting with the "refusal option" (in addition to the original distractors)

```bash
python generate_zeroshot_evals.py \
--answer-mode "mcq" \
--model "gpt-4o" \
--with-refusal
--answer-mode "mcq" \
--model "gpt-4o" \
--with-refusal
```

**Example: Grade the zero-shot answers from the previous step**
### Example: Grade the zero-shot answers from the previous step

```bash
python grade_outputs.py \
--input-file path/to/zeroshot.csv \
--answer-mode "mcq"
--input-file path/to/zeroshot.csv \
--answer-mode "mcq"
```

## Replicating the BixBench Paper Results

To replicate the BixBench paper results for agentic evaluations, you can download the raw data from 2,120 trajectories and its respective postprocessed evaluation dataframe:
### v1.5 Results (Latest)

For the latest BixBench v1.5 results using the enhanced 205-question dataset, use the automated scripts or the v1.5 configuration:

#### Quick Reproduction

```bash
# Complete reproduction using automation scripts
bash scripts/run_zeroshot.sh # Zero-shot baselines
bash scripts/run_agentic.sh # Agentic evaluations (24-48 hours)
```

#### Manual v1.5 Reproduction

```bash
# Generate v1.5 results
python bixbench/postprocessing.py --config_file bixbench/run_configuration/v1.5_paper_results.yaml
```

The v1.5 configuration includes:

- **Majority vote analysis** with k=5 replicas
- **Image comparison analysis** (with/without image support)
- **Refusal option comparison** (with/without refusal options in MCQs)
- **Zero-shot baseline integration** for comprehensive model comparison
- **Updated result paths** using `bixbench-v1.5_results/` directory structure

### Original Paper Results

To replicate the original BixBench paper results, you can download the raw data from 2,120 trajectories and its respective postprocessed evaluation dataframe:

```bash
wget https://storage.googleapis.com/bixbench-results/raw_trajectory_data.csv -P bixbench_results/
wget https://storage.googleapis.com/bixbench-results/eval_df.csv -P bixbench_results/
```

You can then run the postprocessing script to generate the evaluation dataframe and analysis plots using the `bixbench/run_configuration/bixbench_paper_results.yaml` configuration file:
You can then run the postprocessing script to generate the evaluation dataframe and analysis plots using the original configuration file:

```bash
python bixbench/postprocessing.py --config_file bixbench/run_configuration/bixbench_paper_results.yaml
```

### Generated Figures

The evaluation process will generate the following comparative visualizations:

**v1.5 Results:**

You will see the following figures from the paper:
![Performance Comparison](bixbench_results/bixbench_results_comparison.png)
![Performance Comparison](bixbench-v1.5_results/bixbench_results_comparison.png)

![Majority Vote Accuracy](bixbench-v1.5_results/majority_vote_accuracy_refusal_option_comparison.png)

**Original Results:**

![Majority Vote Accuracy](bixbench_results/majority_vote_accuracy_refusal_option_comparison.png)
- `bixbench_results/bixbench_results_comparison.png` - Original performance comparison
- `bixbench_results/majority_vote_accuracy_refusal_option_comparison.png` - Original majority vote analysis

## Gotchas

- The BixBench dataset is large and may take several minutes to download.
- When generating trajectories, the default batch size is set to 4 to optimize processing speed. You may need to adjust this value in the [configuration file](https://github.com/Future-House/BixBench/blob/8c57d3562044e4ce574a09438066033e21155f54/bixbench/run_configuration/generate_trajectories.yaml#L14) based on your API rate limits and available compute resources.
- While the agent uses the local Jupyter kernel by default, we recommend using our custom Docker environment for improved performance. To enable this, pull the Docker image as described in the [Installation](#installation) section and set the environment variable `USE_DOCKER=true` when running the `generate_trajectories.py` script.
- **API Costs**: The automated agentic evaluation script (`run_agentic.sh`) will incur significant API costs as it runs 5 replicas across multiple model configurations (GPT-4o, Claude-3.5-Sonnet). Estimate your costs before running.
- **Execution Time**: Complete agentic evaluations take 24-48 hours. Use the zero-shot script (`run_zeroshot.sh`) for faster results if you only need baseline comparisons.
- When generating trajectories manually, the default batch size is set to 4 to optimize processing speed. You may need to adjust this value in the [configuration file](https://github.com/Future-House/BixBench/blob/8c57d3562044e4ce574a09438066033e21155f54/bixbench/run_configuration/generate_trajectories.yaml#L14) based on your API rate limits and available compute resources.
- While the agent can use a local Jupyter kernel, we recommend using our custom Docker environment for improved performance (default is `USE_DOCKER=true`). Be sure to pull the Docker image as described in the [Installation](#installation) section. If you would like to use a local Jupyter kernel, set use_docker to false in the notebook section of the configuration file.
- **Directory Structure**: The v1.5 automation scripts create and use `bixbench-v1.5_results/` while manual configurations may use `bixbench_results/` or other directories as specified in the YAML files.

## Acknowledgments

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
50 changes: 50 additions & 0 deletions bixbench-v1.5_results/zero_shot_baselines.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
{
"gpt-4o-grader-mcq-refusal-False": {
"accuracy": 0.36097560975609755,
"precision": 0.36097560975609755,
"coverage": 1.0,
"n_total": 205,
"n_correct": 74,
"n_sure": 205
},
"gpt-4o-grader-mcq-refusal-True": {
"accuracy": 0.03902439024390244,
"precision": 0.4,
"coverage": 0.0975609756097561,
"n_total": 205,
"n_correct": 8,
"n_sure": 20
},
"claude-3-5-sonnet-latest-grader-mcq-refusal-True": {
"accuracy": 0.08292682926829269,
"precision": 0.38636363636363635,
"coverage": 0.2146341463414634,
"n_total": 205,
"n_correct": 17,
"n_sure": 44
},
"claude-3-5-sonnet-latest-grader-mcq-refusal-False": {
"accuracy": 0.34146341463414637,
"precision": 0.34146341463414637,
"coverage": 1.0,
"n_total": 205,
"n_correct": 70,
"n_sure": 205
},
"gpt-4o-grader-openended": {
"accuracy": 0.02926829268292683,
"precision": 0.02926829268292683,
"coverage": 1.0,
"n_total": 205,
"n_correct": 6,
"n_sure": 205
},
"claude-3-5-sonnet-latest-grader-openended": {
"accuracy": 0.02926829268292683,
"precision": 0.02926829268292683,
"coverage": 1.0,
"n_total": 205,
"n_correct": 6,
"n_sure": 205
}
}
Loading