Skip to content

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images (NeurIPS 2023 D&B)

License

Notifications You must be signed in to change notification settings

KAIST-Edlab/ehrxqa

 
 

Repository files navigation

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images

A multi-modal question answering dataset that combines structured Electronic Health Records (EHRs) and chest X-ray images, designed to facilitate joint reasoning across imaging and table modalities in EHR Question Answering (QA) systems.

Overview

Electronic Health Records (EHRs), which contain patients' medical histories in various multi-modal formats, often overlook the potential for joint reasoning across imaging and table modalities underexplored in current EHR Question Answering (QA) systems. In this paper, we introduce EHRXQA, a novel multi-modal question answering dataset combining structured EHRs and chest X-ray images. To develop our dataset, we first construct two uni-modal resources: 1) The MIMIC-CXR-VQA dataset, our newly created medical visual question answering (VQA) benchmark, specifically designed to augment the imaging modality in EHR QA, and 2) EHRSQL (MIMIC-IV), a refashioned version of a previously established table-based EHR QA dataset. By integrating these two uni-modal resources, we successfully construct a multi-modal EHR QA dataset that necessitates both uni-modal and cross-modal reasoning. To address the unique challenges of multi-modal questions within EHRs, we propose a NeuralSQL-based strategy equipped with an external VQA API. This pioneering endeavor enhances engagement with multi-modal EHR sources and we believe that our dataset can catalyze advances in real-world medical scenarios such as clinical decision-making and research.

Updates

  • [07/24/2024] We released EHRXQA dataset on Physionet.
  • [12/12/2023] We presented our research work at NeurIPS 2023 Datasets and Benchmarks Track as a poster.
  • [10/28/2023] We released our research paper on arXiv.

Reproducing the Dataset

Prerequisites

New to PhysioNet? Click to see credentialing instructions
  1. Register for a PhysioNet account
  2. Follow the credentialing instructions
  3. Complete the CITI Data or Specimens Only Research training course
  4. Sign the DUA for each required dataset (links above)

Environment Setup

Using UV (Recommended)
# Install UV
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/baeseongsu/ehrxqa.git
cd ehrxqa

# Create environment and install dependencies
uv venv --python 3.12
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install pandas tqdm scikit-learn dask
Using Conda
# Clone repository
git clone https://github.com/baeseongsu/ehrxqa.git
cd ehrxqa

# Create environment and install dependencies
conda create -n ehrxqa python=3.12
conda activate ehrxqa
pip install pandas tqdm scikit-learn dask

Running the Reproduction Script

bash build_dataset.sh

When prompted, enter your PhysioNet credentials (password will not be displayed).

What this script does:

  1. Downloads source datasets from PhysioNet (MIMIC-CXR-JPG, Chest ImaGenome, MIMIC-IV)
  2. Preprocesses the datasets
  3. Constructs integrated database (MIMIC-IV + MIMIC-CXR)
  4. Generates complete EHRXQA dataset with ground-truth answers
Dataset Structure
ehrxqa/
└── dataset/
    ├── _train.json       # Pre-release (without answers)
    ├── _valid.json       # Pre-release (without answers)
    ├── _test.json        # Pre-release (without answers)
    ├── train.json        # Generated after running script
    ├── valid.json        # Generated after running script
    └── test.json         # Generated after running script

Pre-release files (_*.json) are intentionally incomplete to safeguard privacy. Complete files with answers are generated after running the reproduction script with valid PhysioNet credentials.

Dataset Schema

Each QA sample is a JSON object with the following fields:

Core Fields:

  • db_id: Database ID
  • split: Dataset split (train/valid/test)
  • id: Unique instance identifier
  • question: Natural language question
  • answer: Answer string (generated by script)

Template Fields:

  • template: Question template with database values
  • query: NeuralSQL/SQL query
  • value: Key-value pairs from database

Tag Fields:

  • q_tag: Question template structure
  • t_tag: Time templates
  • o_tag: Operational values
  • v_tag: Visual values (object, category, attribute)
  • tag: Synthesized tag

Metadata:

  • para_type: Paraphrase source (machine/GPT-4)
  • is_impossible: Whether question is answerable

Example:

{
    "db_id": "mimic_iv_cxr",
    "split": "train",
    "id": 0,
    "question": "how many days have passed since the last chest x-ray of patient 18679317 depicting any anatomical findings in 2105?",
    "template": "how many days have passed since the last time patient 18679317 had a chest x-ray study indicating any anatomicalfinding in 2105?",
    "query": "select 1 * ( strftime('%J',current_time) - strftime('%J',t1.studydatetime) ) from ...",
    "answer": "42"
}

Version

Current: v1.0.0

This project uses semantic versioning. For detailed changes, see CHANGELOG.

Citation

When you use the EHRXQA dataset, we would appreciate it if you cite the following:

@article{bae2023ehrxqa,
  title={EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images},
  author={Bae, Seongsu and Kyung, Daeun and Ryu, Jaehee and Cho, Eunbyeol and Lee, Gyubok and Kweon, Sunjun and Oh, Jungwoo and Ji, Lei and Chang, Eric I and Kim, Tackeun and others},
  journal={arXiv preprint arXiv:2310.18652},
  year={2023}
}

License

The code in this repository is provided under the terms of the MIT License. The final output dataset (EHRXQA) is subject to the terms and conditions of the original datasets from Physionet: MIMIC-CXR-JPG License, Chest ImaGenome License, and MIMIC-IV License.

Contact

For questions or concerns regarding this dataset, please contact:

About

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images (NeurIPS 2023 D&B)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 93.6%
  • Shell 6.4%