Animal Harm Assessment (AHA) Benchmark

This is a modification of the AHA benchmark. Many of the scorer instructions were not relevant or overly penalizing models that have not undergone lots of post-training. Only the scorer.py files has been modified.

The Animal Harm Assessment (AHA) Benchmark is an evaluation framework designed to assess the potential for harm to nonhuman animals in generated text by large language models (LLMs). It combines qestions retrieved from reddit and synthetic questions (based on animal ethics literature).

For more information, see preprint: https://arxiv.org/abs/2503.04804

Overview

AHA benchmark investigates how LLM-generated text can impact animal welfare by evaluating responses to a diverse set of questions. The original set of questions include:

1,850 curated Reddit questions (filtered for relevance to animal ethics), and
2,500 synthetic questions created by combining 50 animal categories with 50 ethical scenarios.

A stratified sample of 3045 questions (70% of 4350) is used as a public test set (data_public.json). The remaining 1305 (30%) questions are retained privately to track and test for contamination.

The evaluation framework uses LLMs-as-judges with a three-level scoring system:

[-1] – Increases risk of harm
[0] – No clear effect or uncertain
[1] – Decreases risk of harm

Full details of the scoring rubric are in scoring.py.

Setup Instructions

Example Usage in Google Colab

Follow these steps in a Google Colab notebook to run the AHA benchmark:

Clone the Repository

# 1. Clone the files
!git clone https://github.com/AI-for-Animals/aha.git aha
%cd /content/aha 

# 2.1 install dependencies. If necessary, restart session.
!pip install inspect-ai anthropic google-generativeai openai
!pip install --upgrade google-genai

# 2.2 Retrieve API keys. The three API keys needed to use the default LLMs-as-judges. Specify fewer/other judges if needed (see how in 3.2).
import os
from google.colab import userdata
os.environ['ANTHROPIC_API_KEY'] = userdata.get('ANTHROPIC_API_KEY')
os.environ['GOOGLE_API_KEY'] = userdata.get('GOOGLE_API_KEY')
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

# 3. Run examples
# 3.1 A simple example. The'--run-analysis' option saves & combines results in csv files.
# The default three LLMs-as-judges are: anthropic/claude-3-5-sonnet-20241022,google/gemini-1.5-pro-002,openai/gpt-4o-2024-08-06
!python aha.py --model 'anthropic/claude-3-5-haiku-20241022' --batch_size 2 --run-analysis

# 3.2 A longer example with more options explicitly set (see aha.py for all options):
!python aha.py \
--model 'openai/gpt-4o-mini-2024-07-18' \
--judges 'anthropic/claude-3-5-haiku-20241022,google/gemini-1.5-flash-002,openai/gpt-4o-mini-2024-07-18' \
--batch_size 2 --num_batches 2 \
--seed 0 --model_temperature 1 --judge_temperature 0 \
--run-analysis

# 4. Standard evaluations (uncomment to run) 

# 4.1 Small sample (default, 100 questions)
# !python aha.py --model 'anthropic/claude-3-5-haiku-20241022' --shuffle --run-analysis

# 4.2 Full sample (3045 questions)
# !python aha.py --model 'anthropic/claude-3-5-haiku-20241022' --batch_size 435 --num_batches 7 --run-analysis

# For additional analysis, including self-preference adjustments for LLMs-as-judges, and related models:
# !pip install krippendorff
# !python results_summary_raw.py
# !python results_summary.py

# More scripts for analysis:
# !python tag_analysis.py
# !python figures_final.py #NB! Uses hardcoded values.
# !python pairwise_comparison.py #NB!BETA. Observed to work correctly with identically formated separate input files per model.  

# It can be useful to store and retrieve files remotely:
# from google.colab import drive
# drive.mount('/content/drive')

Project Structure

aha.py – Main evaluation script. scoring.py – Module implementing the LLM-as-a-judge scoring function. analysis.py – Script for combining CSV results and analyzing benchmark outputs. utils.py – Shared utility functions for logging, file I/O, CSV operations, and timestamp handling. data_public.json - input data.

Requirements

Default LLMs-as-judges are Anthropic, Google, and OpenAI models. Dependencies: inspect-ai anthropic google-generativeai openai. API Keys: Required for Anthropic, Google, and OpenAI.

License

This project is licensed under the MIT License.

Acknowledgments

Development of this benchmark has been supported by Hive / AI for Animals. The setup has been built on UK AI Safety Institute's Inspect Evals.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Animal Harm Assessment (AHA) Benchmark

Overview

Setup Instructions

Example Usage in Google Colab

Project Structure

Requirements

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
aha.py		aha.py
analysis.py		analysis.py
data_public.json		data_public.json
figures_final.py		figures_final.py
pairwise_comparison.py		pairwise_comparison.py
results_summary.py		results_summary.py
results_summary_raw.py		results_summary_raw.py
scoring.py		scoring.py
tag_analysis.py		tag_analysis.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Animal Harm Assessment (AHA) Benchmark

Overview

Setup Instructions

Example Usage in Google Colab

Project Structure

Requirements

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages