ResearchCodeBench

Environment Setup

Create and activate the Conda environment

conda env create -f environment.yml
conda activate researchcodebench

Run the sanity check

python sanity_check_tests.py

This script runs unit tests for each codebase. On a correct install, you should see timings like:

Repository ID	Test Time (s)
advantage-alignment	0.022
Diff-Transformer	0.037
DiffusionDPO	0.004
DyT	0.016
eomt	0.390
fractalgen	0.026
GMFlow	0.107
GPS	0.013
grid-cell-conformal-isometry	0.030
hyla	1.914
LEN	0.051
llm-sci-use	2.342
minp	0.004
OptimalSteps	0.288
REPA-E	19.240
schedule_free	0.454
semanticist	0.023
SISS	0.016
TabDiff	0.014
Tanh-Init	0.001

API Keys

Before running any models, export your API keys:

export OPENAI_API_KEY="…"
export GOOGLE_API_KEY="…"
export XAI_API_KEY="…"
export DEEPSEEK_API_KEY="…"
export OPENROUTER_API_KEY="…"

Running the Benchmark

On any CPU-only machine:

sh run_greedy.sh

By default this will evaluate all 32 models from the paper.
To target a single model, edit run_greedy.sh.

Results are saved to:

outputs/20llms_greedy/<YYYY-MM-DD-HH-MM-SS>/

Generating the Main Figure

To reproduce Figure 2 (Scaled Pass@1 results), a JSON file produced by the command above is provided.

Ensure you have the JSON file:

outputs/20llms_greedy/2025-05-12-17-13-20/overall_stats.json

Run the plotting script:

python visualize/main_results_blue.py --json_path outputs/20llms_greedy/2025-05-12-17-13-20/overall_stats.json

The plots will be saved as:

outputs_main_results/model_line_rates.png
outputs_main_results/model_line_rates.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
core		core
outputs/20llms_greedy/2025-05-12-17-13-20		outputs/20llms_greedy/2025-05-12-17-13-20
pset		pset
scripts		scripts
visualize		visualize
.gitignore		.gitignore
README.md		README.md
calculate_num_tests.py		calculate_num_tests.py
calculate_success_rate_for_snippets.py		calculate_success_rate_for_snippets.py
environment.yml		environment.yml
main.py		main.py
sanity_check_tests.py		sanity_check_tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ResearchCodeBench

Environment Setup

API Keys

Running the Benchmark

Generating the Main Figure

About

Uh oh!

Releases

Packages

Languages

PatrickHua/ResearchCodeBench

Folders and files

Latest commit

History

Repository files navigation

ResearchCodeBench

Environment Setup

API Keys

Running the Benchmark

Generating the Main Figure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages