EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images
A multi-modal question answering dataset that combines structured Electronic Health Records (EHRs) and chest X-ray images, designed to facilitate joint reasoning across imaging and table modalities in EHR Question Answering (QA) systems.
Electronic Health Records (EHRs), which contain patients' medical histories in various multi-modal formats, often overlook the potential for joint reasoning across imaging and table modalities underexplored in current EHR Question Answering (QA) systems. In this paper, we introduce EHRXQA, a novel multi-modal question answering dataset combining structured EHRs and chest X-ray images. To develop our dataset, we first construct two uni-modal resources: 1) The MIMIC-CXR-VQA dataset, our newly created medical visual question answering (VQA) benchmark, specifically designed to augment the imaging modality in EHR QA, and 2) EHRSQL (MIMIC-IV), a refashioned version of a previously established table-based EHR QA dataset. By integrating these two uni-modal resources, we successfully construct a multi-modal EHR QA dataset that necessitates both uni-modal and cross-modal reasoning. To address the unique challenges of multi-modal questions within EHRs, we propose a NeuralSQL-based strategy equipped with an external VQA API. This pioneering endeavor enhances engagement with multi-modal EHR sources and we believe that our dataset can catalyze advances in real-world medical scenarios such as clinical decision-making and research.
- [07/24/2024] We released EHRXQA dataset on Physionet.
- [12/12/2023] We presented our research work at NeurIPS 2023 Datasets and Benchmarks Track as a poster.
- [10/28/2023] We released our research paper on arXiv.
- Python 3.12+
- PhysioNet credentialed account with signed DUAs for:
New to PhysioNet? Click to see credentialing instructions
- Register for a PhysioNet account
- Follow the credentialing instructions
- Complete the CITI Data or Specimens Only Research training course
- Sign the DUA for each required dataset (links above)
Using UV (Recommended)
# Install UV
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and setup
git clone https://github.com/baeseongsu/ehrxqa.git
cd ehrxqa
# Create environment and install dependencies
uv venv --python 3.12
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install pandas tqdm scikit-learn daskUsing Conda
# Clone repository
git clone https://github.com/baeseongsu/ehrxqa.git
cd ehrxqa
# Create environment and install dependencies
conda create -n ehrxqa python=3.12
conda activate ehrxqa
pip install pandas tqdm scikit-learn daskbash build_dataset.shWhen prompted, enter your PhysioNet credentials (password will not be displayed).
What this script does:
- Downloads source datasets from PhysioNet (MIMIC-CXR-JPG, Chest ImaGenome, MIMIC-IV)
- Preprocesses the datasets
- Constructs integrated database (MIMIC-IV + MIMIC-CXR)
- Generates complete EHRXQA dataset with ground-truth answers
Dataset Structure
ehrxqa/
└── dataset/
├── _train.json # Pre-release (without answers)
├── _valid.json # Pre-release (without answers)
├── _test.json # Pre-release (without answers)
├── train.json # Generated after running script
├── valid.json # Generated after running script
└── test.json # Generated after running script
Pre-release files (_*.json) are intentionally incomplete to safeguard privacy. Complete files with answers are generated after running the reproduction script with valid PhysioNet credentials.
Dataset Schema
Each QA sample is a JSON object with the following fields:
Core Fields:
db_id: Database IDsplit: Dataset split (train/valid/test)id: Unique instance identifierquestion: Natural language questionanswer: Answer string (generated by script)
Template Fields:
template: Question template with database valuesquery: NeuralSQL/SQL queryvalue: Key-value pairs from database
Tag Fields:
q_tag: Question template structuret_tag: Time templateso_tag: Operational valuesv_tag: Visual values (object, category, attribute)tag: Synthesized tag
Metadata:
para_type: Paraphrase source (machine/GPT-4)is_impossible: Whether question is answerable
Example:
{
"db_id": "mimic_iv_cxr",
"split": "train",
"id": 0,
"question": "how many days have passed since the last chest x-ray of patient 18679317 depicting any anatomical findings in 2105?",
"template": "how many days have passed since the last time patient 18679317 had a chest x-ray study indicating any anatomicalfinding in 2105?",
"query": "select 1 * ( strftime('%J',current_time) - strftime('%J',t1.studydatetime) ) from ...",
"answer": "42"
}Current: v1.0.0
This project uses semantic versioning. For detailed changes, see CHANGELOG.
When you use the EHRXQA dataset, we would appreciate it if you cite the following:
@article{bae2023ehrxqa,
title={EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images},
author={Bae, Seongsu and Kyung, Daeun and Ryu, Jaehee and Cho, Eunbyeol and Lee, Gyubok and Kweon, Sunjun and Oh, Jungwoo and Ji, Lei and Chang, Eric I and Kim, Tackeun and others},
journal={arXiv preprint arXiv:2310.18652},
year={2023}
}The code in this repository is provided under the terms of the MIT License. The final output dataset (EHRXQA) is subject to the terms and conditions of the original datasets from Physionet: MIMIC-CXR-JPG License, Chest ImaGenome License, and MIMIC-IV License.
For questions or concerns regarding this dataset, please contact:
- Seongsu Bae (seongsu@kaist.ac.kr)
- Daeun Kyung (kyungdaeun@kaist.ac.kr)