Skip to content

a benchmark for evaluating LLM Agents at Fair ML tasks

License

Notifications You must be signed in to change notification settings

ml4sts/fairnessBench

Repository files navigation

Fairness Bench

This is a benchmark to evaluate AI capabilities to do fair data driven decision-making.

The benchmark consists of several tasks.

A fairnessBench task is defined as follows: For a dataset and a very simple training script that uses logistic regression model. How well can an LLM agent improve the training script to achieve high fairness metrics.

Fairness Metrics:

  • To capture disparities:
    • Disparate Impact
    • Statistical Parity Difference
  • To assess differences in true positive rates:
    • Equal Opportunity Difference
  • To quantify misclassification disparities
    • Error Rate Difference
    • Error Rate Ratio
  • To examine disparities in false negatives across groups:
    • False Omission Rate Difference

Different LLM models used for agent

We use a variaty of open-source paid LLMs.

  • Meta's Llama-3.3-70B (open source)
  • Alibaba's Qwen-2.5-72B (open source)
  • OpenAI's GPT-4o (paid)
  • Anthropic's Claude-sonnet 3.7 (paid)

Baseline

Run baseline.sh on a task/list of tasks to run the baseline train.py provided in the task to compare the accuracy and fairness metrics with the values reached by the agents

What does eval do?

Run eval.sh with a list of tasks eval.sh runs eval.py which in turn runs the different level of evaluations:

  • The task's specific eval.py (to evaluate accuracy and fairness metrics)
  • Flake8 eval that evaluates the training script generated by the agent for Python AST tree and some fairness library use

Reading eval results

  • From the fairnessbench_analysis directory run explode_results.py (make sure to set the result paths to the folder that eval.sh outputted to) to prepare csv files with all the collected results
  • Use the other py scripts in the analysis folder to get all the plots

Roles

  • Task-specific: environment files for the task, the baseliine train.py and the dataset files.
  • Benchmarking infrastructure: code needed to overall run of benchmark and scoring etc (environment.py, run.py, eval-<type>.py)
  • Agent: agent tools, agent prompts, etc

Instructions for running the benchmark:

  • Pick a task/ list of tasks to run from tasks.json
  • Pick LLMs wanted for the benchmark (make sure the required API keys are in the root directory of the app)
  • Run using run_experiment.sh

run_experiment.sh

  • Log_dir: The path to a directory for the environment to keep the logs
  • Models: The models you want to evaluate on the tasks
    • Available options are:
      • Paid: claude-2.1, gpt-4-0125-preview, gpt-4o-mini, gpt-4o, claude-3-7-sonnet-20250219, claude-3-5-haiku-20241022, claude-3-opus-20240229
      • Local: gemini-pro, llama, qwen, granite
    • Local models will be downloaded into your cache if now loaded with export HF_HOME=<path_to_model>
  • edit_script_model & fast_llm: Are LLMs specifically used to run smaller actions such as editing a script or summarizing a long observation, these can optionally be different from main agent models

eval.sh

  • Log_dir: Directory that the llm placed the experiment logs
  • json_folder: Directory to place results in
  • All tasks: list of tasks to be evaluated on
  • Models: Models that we are evaluating on above tasks

About

a benchmark for evaluating LLM Agents at Fair ML tasks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors