EACL 2026 Findings
Jongwook Han, Woojung Song, Jonggeun Lee, Yohan Jo
LLMs recognize psychometric inventory items and even understand what they measure. We systematically quantify data contamination and provide evidence of data contamination on psychometric inventories.
- LLMs exhibit prior exposure to psychometric inventories
- Contamination checks should precede psychometric evaluation of LLMs
- Larger models tend to exhibit stronger contamination patterns
This repository provides the code for the contamination checks described in the paper. It lets readers inspect task prompts, choose a model and questionnaire, and run the method on their own.
Install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtCreate a local .env file for live API runs:
OPENAI_API_KEY=...
OPENROUTER_API_KEY=...The semantic item-memorization task uses OPENAI_API_KEY for the
gpt-5.2-mini refusal filter and text-embedding-3-large embeddings, even
when the generation model is called through OpenRouter.
Inspect the default run plan without making API calls:
python main.py --dry-runRun one small task:
python main.py --tasks keyitem --questionnaires bfi44 --models gpt-4o-miniList supported task names:
python main.py --list-tasksGenerated outputs are written to outputs/, which is ignored by git.
Equivalent Make targets are available:
make dry-run
make list-tasks
make testmain.py is the user-facing launcher. It reads configs/example_method_config.yaml,
accepts CLI overrides, groups requested task names, and calls the matching task
module under method_tasks/.
Supported task groups:
item_memorization:semantic,keyitemevaluation_memorization:recognition,option_scoretarget_score_matching:target_score_per_item
The task implementations are:
method_tasks/item_memorization/main.pymethod_tasks/evaluation_memorization/main.pymethod_tasks/target_score_matching/main.pyscripts/summarize_results.py: aggregates repeated run outputs into mean metrics with 95% confidence intervals.scripts/compute_semantic_baseline.py: computes the dimension-description semantic baseline.
Each task module reads questionnaire files from data/ and prompt templates
from its own method_tasks/*/prompts/ directory.
main.py: user-facing launcher for the method.method_tasks/: task implementations and prompt templates.configs/example_method_config.yaml: minimal runnable config.configs/*_config.yaml: larger provider/model config examples.data/: questionnaire JSON files used by the tasks, plusdata/custom_example.jsonanddata/README.mdfor adding new inventories.scripts/: result summarization and baseline utilities.tests/: lightweight checks for the launcher and repository layout.
The questionnaire inventories may have their own redistribution terms. Confirm that your intended use is compatible with those terms before republishing or running large-scale experiments.
The repository source code is released under the MIT License. Questionnaire
item text in data/*.json remains subject to the original instruments' terms;
see DATA_LICENSE.md.
If you use this code, please cite the associated paper:
@inproceedings{han-etal-2026-quantifying,
title = "Quantifying Data Contamination in Psychometric Evaluations of {LLM}s",
author = "Han, Jongwook and
Song, Woojung and
Lee, Jonggeun and
Jo, Yohan",
booktitle = "Findings of the {A}ssociation for {C}omputational {L}inguistics: {EACL} 2026",
month = mar,
year = "2026",
address = "Rabat, Morocco",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2026.findings-eacl.319/",
pages = "6070--6088",
}