This repository contains the runnable evaluation scripts for the MMErroR benchmark. It is the maintained, config-driven evaluation bundle for the two benchmark tasks:
ETC: Error Type ClassificationEPD: Error Presence Detection
MMErroR evaluates whether vision-language models can identify erroneous reasoning, rather than only produce correct answers. In ETC, the model is told that an error exists and must classify its type. In EPD, the model must first decide whether an error is present and only then diagnose the type if needed.
mmerror_eval/: main config-driven evaluatorprompts/: prompt templates forETCandEPDrun.py: primary CLI entrypointeval_config.yaml: default evaluation configsmoke_test.yaml: local smoke test config using a mock providersmoke_test_data/: bundled minimal samples for the local smoke testauto_eval_all_models.py: batch runner for multiple models and tasksmain.sh: shell wrapper for batch evaluationlegacy/: older evaluation scripts kept for reference
This repository intentionally excludes dataset assets, plotting scripts, paper sources, and historical result files.
The dataset is hosted on Hugging Face:
The evaluator needs two folders:
data/jsons/data/images/
Default layout expected by the example commands below:
<parent-of-this-repo>/
+-- MMerroR-Eval/
`-- data/
+-- jsons/
| +-- MMErroR_00001.json
| `-- ...
`-- images/
+-- MMErroR_00001.png
`-- ...
If you keep the downloaded Hugging Face snapshot as-is, the dataset layout will be:
mmerror-benchmark/
`-- data/
+-- jsons/
`-- images/
In that case, pass those exact folders with --data-dir and --image-dir.
Example download with huggingface_hub:
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='s-u-do/mmerror-benchmark', repo_type='dataset', local_dir='../mmerror-benchmark')"Then either:
- Move
..\mmerror-benchmark\data\jsonsand..\mmerror-benchmark\data\imagesto..\data\jsonsand..\data\images, so the default commands work unchanged. - Or run the evaluator directly on the downloaded snapshot:
python .\run.py --data-dir ..\mmerror-benchmark\data\jsons --image-dir ..\mmerror-benchmark\data\images --task-mode epdpython -m pip install -r requirements.txt
Copy-Item .\.env.example .\.envSet your API credentials in .env or through environment variables:
$env:MMERROR_API_BASE="https://your-endpoint/v1/chat/completions"
$env:MMERROR_API_KEY="your-key"Run a single task with the main config-driven evaluator:
python .\run.py --data-dir ..\data\jsons --image-dir ..\data\images --task-mode epdSupported task modes:
etc: Error Type Classificationepd: Error Presence Detection
Useful overrides:
--config: choose a different YAML config--models: run only selected models from the config--output-dir: change where reports are written--limit: evaluate only the firstNsamples--env-file: load credentials from a specific.envfile
You can also run the packaged smoke test without any API calls or external dataset download:
python .\run.py --config .\smoke_test.yamlpython .\auto_eval_all_models.py ^
--data-dir ..\data\jsons ^
--image-dir ..\data\images ^
--api-base "https://your-endpoint/v1/chat/completions" ^
--key "your-key" ^
--models "gpt-5.2,claude-opus-4.5"Or with environment variables:
$env:MMERROR_API_BASE="https://your-endpoint/v1/chat/completions"
$env:MMERROR_API_KEY="your-key"
python .\auto_eval_all_models.py --data-dir ..\data\jsons --image-dir ..\data\imagesFor shell environments:
MMERROR_API_BASE="https://your-endpoint/v1/chat/completions" MMERROR_API_KEY="your-key" sh main.shAPI_BASE / API_KEY are still accepted as compatibility aliases, but MMERROR_API_BASE / MMERROR_API_KEY are the canonical names.
ETCresults are written to../result/ETC/EPDresults are written to../result/EPD/- each model gets a
latest.jsonreport - each task also gets
summary.jsonandsummary.md
Use --output-dir to override the default result root.
- The main evaluator in
mmerror_eval/is the maintained path. - The scripts under
legacy/preserve older behavior and may require more explicit arguments. - This repository only provides the runnable evaluation code; it does not bundle the dataset itself.
We thank the authors of the following benchmarks for making their datasets publicly available:
If you find MMErroR helpful, please cite:
@misc{shi2026mmerror,
title={MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models},
author={Yang Shi and Yifeng Xie and Minzhe Guo and Liangsi Lu and Mingxuan Huang and Jingchao Wang and Zhihong Zhu and Boyan Xu and Zhiqi Huang},
year={2026},
eprint={2601.03331},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.03331}
}