Skip to content

holi-lab/psychometric-contamination

Repository files navigation

Quantifying Data Contamination in Psychometric Evaluations of LLMs

EACL 2026 Findings

Jongwook Han, Woojung Song, Jonggeun Lee, Yohan Jo

Paper PDF

TL;DR

LLMs recognize psychometric inventory items and even understand what they measure. We systematically quantify data contamination and provide evidence of data contamination on psychometric inventories.

  • LLMs exhibit prior exposure to psychometric inventories
  • Contamination checks should precede psychometric evaluation of LLMs
  • Larger models tend to exhibit stronger contamination patterns

This repository provides the code for the contamination checks described in the paper. It lets readers inspect task prompts, choose a model and questionnaire, and run the method on their own.

Quick Start

Install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create a local .env file for live API runs:

OPENAI_API_KEY=...
OPENROUTER_API_KEY=...

The semantic item-memorization task uses OPENAI_API_KEY for the gpt-5.2-mini refusal filter and text-embedding-3-large embeddings, even when the generation model is called through OpenRouter.

Inspect the default run plan without making API calls:

python main.py --dry-run

Run one small task:

python main.py --tasks keyitem --questionnaires bfi44 --models gpt-4o-mini

List supported task names:

python main.py --list-tasks

Generated outputs are written to outputs/, which is ignored by git.

Equivalent Make targets are available:

make dry-run
make list-tasks
make test

Method Tasks

main.py is the user-facing launcher. It reads configs/example_method_config.yaml, accepts CLI overrides, groups requested task names, and calls the matching task module under method_tasks/.

Supported task groups:

  • item_memorization: semantic, keyitem
  • evaluation_memorization: recognition, option_score
  • target_score_matching: target_score_per_item

The task implementations are:

  • method_tasks/item_memorization/main.py
  • method_tasks/evaluation_memorization/main.py
  • method_tasks/target_score_matching/main.py
  • scripts/summarize_results.py: aggregates repeated run outputs into mean metrics with 95% confidence intervals.
  • scripts/compute_semantic_baseline.py: computes the dimension-description semantic baseline.

Each task module reads questionnaire files from data/ and prompt templates from its own method_tasks/*/prompts/ directory.

Repository Layout

  • main.py: user-facing launcher for the method.
  • method_tasks/: task implementations and prompt templates.
  • configs/example_method_config.yaml: minimal runnable config.
  • configs/*_config.yaml: larger provider/model config examples.
  • data/: questionnaire JSON files used by the tasks, plus data/custom_example.json and data/README.md for adding new inventories.
  • scripts/: result summarization and baseline utilities.
  • tests/: lightweight checks for the launcher and repository layout.

Notes

The questionnaire inventories may have their own redistribution terms. Confirm that your intended use is compatible with those terms before republishing or running large-scale experiments.

The repository source code is released under the MIT License. Questionnaire item text in data/*.json remains subject to the original instruments' terms; see DATA_LICENSE.md.

Citation

If you use this code, please cite the associated paper:

@inproceedings{han-etal-2026-quantifying,
    title = "Quantifying Data Contamination in Psychometric Evaluations of {LLM}s",
    author = "Han, Jongwook  and
      Song, Woojung  and
      Lee, Jonggeun  and
      Jo, Yohan",
    booktitle = "Findings of the {A}ssociation for {C}omputational {L}inguistics: {EACL} 2026",
    month = mar,
    year = "2026",
    address = "Rabat, Morocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.findings-eacl.319/",
    pages = "6070--6088",
}

About

Quantifying Data Contamination in Psychometric Evaluations of LLMs (EACL 2026 Findings)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors