Quantifying Data Contamination in Psychometric Evaluations of LLMs

EACL 2026 Findings

Jongwook Han, Woojung Song, Jonggeun Lee, Yohan Jo

TL;DR

LLMs recognize psychometric inventory items and even understand what they measure. We systematically quantify data contamination and provide evidence of data contamination on psychometric inventories.

LLMs exhibit prior exposure to psychometric inventories
Contamination checks should precede psychometric evaluation of LLMs
Larger models tend to exhibit stronger contamination patterns

This repository provides the code for the contamination checks described in the paper. It lets readers inspect task prompts, choose a model and questionnaire, and run the method on their own.

Quick Start

Install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create a local .env file for live API runs:

OPENAI_API_KEY=...
OPENROUTER_API_KEY=...

The semantic item-memorization task uses OPENAI_API_KEY for the gpt-5.2-mini refusal filter and text-embedding-3-large embeddings, even when the generation model is called through OpenRouter.

Inspect the default run plan without making API calls:

python main.py --dry-run

Run one small task:

python main.py --tasks keyitem --questionnaires bfi44 --models gpt-4o-mini

List supported task names:

python main.py --list-tasks

Generated outputs are written to outputs/, which is ignored by git.

Equivalent Make targets are available:

make dry-run
make list-tasks
make test

Method Tasks

main.py is the user-facing launcher. It reads configs/example_method_config.yaml, accepts CLI overrides, groups requested task names, and calls the matching task module under method_tasks/.

Supported task groups:

item_memorization: semantic, keyitem
evaluation_memorization: recognition, option_score
target_score_matching: target_score_per_item

The task implementations are:

method_tasks/item_memorization/main.py
method_tasks/evaluation_memorization/main.py
method_tasks/target_score_matching/main.py
scripts/summarize_results.py: aggregates repeated run outputs into mean metrics with 95% confidence intervals.
scripts/compute_semantic_baseline.py: computes the dimension-description semantic baseline.

Each task module reads questionnaire files from data/ and prompt templates from its own method_tasks/*/prompts/ directory.

Repository Layout

main.py: user-facing launcher for the method.
method_tasks/: task implementations and prompt templates.
configs/example_method_config.yaml: minimal runnable config.
configs/*_config.yaml: larger provider/model config examples.
data/: questionnaire JSON files used by the tasks, plus data/custom_example.json and data/README.md for adding new inventories.
scripts/: result summarization and baseline utilities.
tests/: lightweight checks for the launcher and repository layout.

Notes

The questionnaire inventories may have their own redistribution terms. Confirm that your intended use is compatible with those terms before republishing or running large-scale experiments.

The repository source code is released under the MIT License. Questionnaire item text in data/*.json remains subject to the original instruments' terms; see DATA_LICENSE.md.

Citation

If you use this code, please cite the associated paper:

@inproceedings{han-etal-2026-quantifying,
    title = "Quantifying Data Contamination in Psychometric Evaluations of {LLM}s",
    author = "Han, Jongwook  and
      Song, Woojung  and
      Lee, Jonggeun  and
      Jo, Yohan",
    booktitle = "Findings of the {A}ssociation for {C}omputational {L}inguistics: {EACL} 2026",
    month = mar,
    year = "2026",
    address = "Rabat, Morocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.findings-eacl.319/",
    pages = "6070--6088",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quantifying Data Contamination in Psychometric Evaluations of LLMs

TL;DR

Quick Start

Method Tasks

Repository Layout

Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs		configs
data		data
method_tasks		method_tasks
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
DATA_LICENSE.md		DATA_LICENSE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Quantifying Data Contamination in Psychometric Evaluations of LLMs

TL;DR

Quick Start

Method Tasks

Repository Layout

Notes

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages