Fairness Bench

This is a benchmark to evaluate AI capabilities to do fair data driven decision-making.

The benchmark consists of several tasks.

A fairnessBench task is defined as follows: For a dataset and a very simple training script that uses logistic regression model. How well can an LLM agent improve the training script to achieve high fairness metrics.

Fairness Metrics:

To capture disparities:
- Disparate Impact
- Statistical Parity Difference
To assess differences in true positive rates:
- Equal Opportunity Difference
To quantify misclassification disparities
- Error Rate Difference
- Error Rate Ratio
To examine disparities in false negatives across groups:
- False Omission Rate Difference

Different LLM models used for agent

We use a variaty of open-source paid LLMs.

Meta's Llama-3.3-70B (open source)
Alibaba's Qwen-2.5-72B (open source)
OpenAI's GPT-4o (paid)
Anthropic's Claude-sonnet 3.7 (paid)

Baseline

Run baseline.sh on a task/list of tasks to run the baseline train.py provided in the task to compare the accuracy and fairness metrics with the values reached by the agents

What does eval do?

Run eval.sh with a list of tasks eval.sh runs eval.py which in turn runs the different level of evaluations:

The task's specific eval.py (to evaluate accuracy and fairness metrics)

Flake8 eval that evaluates the training script generated by the agent for Python AST tree and some fairness library use

Reading eval results

From the fairnessbench_analysis directory run explode_results.py (make sure to set the result paths to the folder that eval.sh outputted to) to prepare csv files with all the collected results
Use the other py scripts in the analysis folder to get all the plots

Roles

Task-specific: environment files for the task, the baseliine train.py and the dataset files.
Benchmarking infrastructure: code needed to overall run of benchmark and scoring etc (environment.py, run.py, eval-<type>.py)
Agent: agent tools, agent prompts, etc

Instructions for running the benchmark:

Pick a task/ list of tasks to run from tasks.json
Pick LLMs wanted for the benchmark (make sure the required API keys are in the root directory of the app)
Run using run_experiment.sh

run_experiment.sh

Log_dir: The path to a directory for the environment to keep the logs
Models: The models you want to evaluate on the tasks
- Available options are:
  - Paid: claude-2.1, gpt-4-0125-preview, gpt-4o-mini, gpt-4o, claude-3-7-sonnet-20250219, claude-3-5-haiku-20241022, claude-3-opus-20240229
  - Local: gemini-pro, llama, qwen, granite
- Local models will be downloaded into your cache if now loaded with export HF_HOME=<path_to_model>
edit_script_model & fast_llm: Are LLMs specifically used to run smaller actions such as editing a script or summarizing a long observation, these can optionally be different from main agent models

eval.sh

Log_dir: Directory that the llm placed the experiment logs
json_folder: Directory to place results in
All tasks: list of tasks to be evaluated on
Models: Models that we are evaluating on above tasks

Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
fairnessBench		fairnessBench
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
baseline.sh		baseline.sh
eval.sh		eval.sh
install.sh		install.sh
log_rubric.json		log_rubric.json
multi_run_experiment.sh		multi_run_experiment.sh
requirements.txt		requirements.txt
research_agent_interactive.sh		research_agent_interactive.sh
rubric.json		rubric.json
run_experiments.sh		run_experiments.sh
setup.py		setup.py
system_prompt.txt		system_prompt.txt
system_prompt_log.txt		system_prompt_log.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fairness Bench

Fairness Metrics:

Different LLM models used for agent

Baseline

What does eval do?

Reading eval results

Roles

Instructions for running the benchmark:

run_experiment.sh

eval.sh

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

ml4sts/fairnessBench

Folders and files

Latest commit

History

Repository files navigation

Fairness Bench

Fairness Metrics:

Different LLM models used for agent

Baseline

What does eval do?

Reading eval results

Roles

Instructions for running the benchmark:

run_experiment.sh

eval.sh

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages