This is a benchmark to evaluate AI capabilities to do fair data driven decision-making.
The benchmark consists of several tasks.
A fairnessBench task is defined as follows: For a dataset and a very simple training script that uses logistic regression model. How well can an LLM agent improve the training script to achieve high fairness metrics.
- To capture disparities:
- Disparate Impact
- Statistical Parity Difference
- To assess differences in true positive rates:
- Equal Opportunity Difference
- To quantify misclassification disparities
- Error Rate Difference
- Error Rate Ratio
- To examine disparities in false negatives across groups:
- False Omission Rate Difference
We use a variaty of open-source paid LLMs.
- Meta's Llama-3.3-70B (open source)
- Alibaba's Qwen-2.5-72B (open source)
- OpenAI's GPT-4o (paid)
- Anthropic's Claude-sonnet 3.7 (paid)
Run baseline.sh on a task/list of tasks to run the baseline train.py provided in the task to compare the accuracy and fairness metrics with the values reached by the agents
Run eval.sh with a list of tasks eval.sh runs eval.py which in turn runs the different level of evaluations:
- The task's specific eval.py (to evaluate accuracy and fairness metrics)
- Flake8 eval that evaluates the training script generated by the agent for Python AST tree and some fairness library use
- From the fairnessbench_analysis directory run explode_results.py (make sure to set the result paths to the folder that eval.sh outputted to) to prepare csv files with all the collected results
- Use the other py scripts in the analysis folder to get all the plots
- Task-specific: environment files for the task, the baseliine train.py and the dataset files.
- Benchmarking infrastructure: code needed to overall run of benchmark and scoring etc (
environment.py,run.py,eval-<type>.py) - Agent: agent tools, agent prompts, etc
- Pick a task/ list of tasks to run from tasks.json
- Pick LLMs wanted for the benchmark (make sure the required API keys are in the root directory of the app)
- Run using run_experiment.sh
- Log_dir: The path to a directory for the environment to keep the logs
- Models: The models you want to evaluate on the tasks
- Available options are:
- Paid: claude-2.1, gpt-4-0125-preview, gpt-4o-mini, gpt-4o, claude-3-7-sonnet-20250219, claude-3-5-haiku-20241022, claude-3-opus-20240229
- Local: gemini-pro, llama, qwen, granite
- Local models will be downloaded into your cache if now loaded with
export HF_HOME=<path_to_model>
- Available options are:
- edit_script_model & fast_llm: Are LLMs specifically used to run smaller actions such as editing a script or summarizing a long observation, these can optionally be different from main agent models
- Log_dir: Directory that the llm placed the experiment logs
- json_folder: Directory to place results in
- All tasks: list of tasks to be evaluated on
- Models: Models that we are evaluating on above tasks