EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images

A multi-modal question answering dataset that combines structured Electronic Health Records (EHRs) and chest X-ray images, designed to facilitate joint reasoning across imaging and table modalities in EHR Question Answering (QA) systems.

Overview

Electronic Health Records (EHRs), which contain patients' medical histories in various multi-modal formats, often overlook the potential for joint reasoning across imaging and table modalities underexplored in current EHR Question Answering (QA) systems. In this paper, we introduce EHRXQA, a novel multi-modal question answering dataset combining structured EHRs and chest X-ray images. To develop our dataset, we first construct two uni-modal resources: 1) The MIMIC-CXR-VQA dataset, our newly created medical visual question answering (VQA) benchmark, specifically designed to augment the imaging modality in EHR QA, and 2) EHRSQL (MIMIC-IV), a refashioned version of a previously established table-based EHR QA dataset. By integrating these two uni-modal resources, we successfully construct a multi-modal EHR QA dataset that necessitates both uni-modal and cross-modal reasoning. To address the unique challenges of multi-modal questions within EHRs, we propose a NeuralSQL-based strategy equipped with an external VQA API. This pioneering endeavor enhances engagement with multi-modal EHR sources and we believe that our dataset can catalyze advances in real-world medical scenarios such as clinical decision-making and research.

Updates

[07/24/2024] We released EHRXQA dataset on Physionet.
[12/12/2023] We presented our research work at NeurIPS 2023 Datasets and Benchmarks Track as a poster.
[10/28/2023] We released our research paper on arXiv.

Reproducing the Dataset

Prerequisites

Python 3.12+
PhysioNet credentialed account with signed DUAs for:

New to PhysioNet? Click to see credentialing instructions

Register for a PhysioNet account
Follow the credentialing instructions
Complete the CITI Data or Specimens Only Research training course
Sign the DUA for each required dataset (links above)

Environment Setup

Using UV (Recommended)

# Install UV
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/baeseongsu/ehrxqa.git
cd ehrxqa

# Create environment and install dependencies
uv venv --python 3.12
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install pandas tqdm scikit-learn dask

Using Conda

# Clone repository
git clone https://github.com/baeseongsu/ehrxqa.git
cd ehrxqa

# Create environment and install dependencies
conda create -n ehrxqa python=3.12
conda activate ehrxqa
pip install pandas tqdm scikit-learn dask

Running the Reproduction Script

bash build_dataset.sh

When prompted, enter your PhysioNet credentials (password will not be displayed).

What this script does:

Downloads source datasets from PhysioNet (MIMIC-CXR-JPG, Chest ImaGenome, MIMIC-IV)
Preprocesses the datasets
Constructs integrated database (MIMIC-IV + MIMIC-CXR)
Generates complete EHRXQA dataset with ground-truth answers

Dataset Structure

ehrxqa/
└── dataset/
    ├── _train.json       # Pre-release (without answers)
    ├── _valid.json       # Pre-release (without answers)
    ├── _test.json        # Pre-release (without answers)
    ├── train.json        # Generated after running script
    ├── valid.json        # Generated after running script
    └── test.json         # Generated after running script

Pre-release files (_*.json) are intentionally incomplete to safeguard privacy. Complete files with answers are generated after running the reproduction script with valid PhysioNet credentials.

Dataset Schema

Each QA sample is a JSON object with the following fields:

Core Fields:

db_id: Database ID
split: Dataset split (train/valid/test)
id: Unique instance identifier
question: Natural language question
answer: Answer string (generated by script)

Template Fields:

template: Question template with database values
query: NeuralSQL/SQL query
value: Key-value pairs from database

Tag Fields:

q_tag: Question template structure
t_tag: Time templates
o_tag: Operational values
v_tag: Visual values (object, category, attribute)
tag: Synthesized tag

Metadata:

para_type: Paraphrase source (machine/GPT-4)
is_impossible: Whether question is answerable

Example:

{
    "db_id": "mimic_iv_cxr",
    "split": "train",
    "id": 0,
    "question": "how many days have passed since the last chest x-ray of patient 18679317 depicting any anatomical findings in 2105?",
    "template": "how many days have passed since the last time patient 18679317 had a chest x-ray study indicating any anatomicalfinding in 2105?",
    "query": "select 1 * ( strftime('%J',current_time) - strftime('%J',t1.studydatetime) ) from ...",
    "answer": "42"
}

Version

Current: v1.0.0

This project uses semantic versioning. For detailed changes, see CHANGELOG.

Citation

When you use the EHRXQA dataset, we would appreciate it if you cite the following:

@article{bae2023ehrxqa,
  title={EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images},
  author={Bae, Seongsu and Kyung, Daeun and Ryu, Jaehee and Cho, Eunbyeol and Lee, Gyubok and Kweon, Sunjun and Oh, Jungwoo and Ji, Lei and Chang, Eric I and Kim, Tackeun and others},
  journal={arXiv preprint arXiv:2310.18652},
  year={2023}
}

License

The code in this repository is provided under the terms of the MIT License. The final output dataset (EHRXQA) is subject to the terms and conditions of the original datasets from Physionet: MIMIC-CXR-JPG License, Chest ImaGenome License, and MIMIC-IV License.

Contact

For questions or concerns regarding this dataset, please contact:

Seongsu Bae (seongsu@kaist.ac.kr)
Daeun Kyung (kyungdaeun@kaist.ac.kr)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
database		database
dataset		dataset
dataset_builder		dataset_builder
physionet.org/files		physionet.org/files
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
build_dataset.sh		build_dataset.sh
download_images.sh		download_images.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images

Overview

Updates

Reproducing the Dataset

Prerequisites

Environment Setup

Running the Reproduction Script

Version

Citation

License

Contact

About

Uh oh!

Releases

Packages

Languages

License

KAIST-Edlab/ehrxqa

Folders and files

Latest commit

History

Repository files navigation

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images

Overview

Updates

Reproducing the Dataset

Prerequisites

Environment Setup

Running the Reproduction Script

Version

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages