A preprocessing pipeline for cardiac MRI DICOM studies, converting raw acquisition files into structured, ML-ready HDF5 datasets. Supplementary repository for work performed in the paper: A Generalizable Deep Learning System for Cardiac MRI
cmr_toolkit handles the full preprocessing lifecycle for multi-institutional cardiac MRI data — from raw tar.gz DICOM archives through standardized array storage — with built-in data integrity validation and cloud integration. All scripts scale linearly to ~64 CPU cores and can process upwards of 100k MRI scans in under 3 hours.
DICOM Archives (tar.gz)
↓
[preprocess_mri.py] ← parallel DICOM to HDF5 conversion, multi-institution
↓
HDF5 Filestore ← per-patient, per-accession, arrays shaped [frames, H, W] or [frames, channels, H, W] (if RGB)
↓
[generate_checksums.py] ← SHA256 validation against ground-truth manifest
After preprocess_mri.py, output is organized as institution_anon_mrn/anon_accession.h5. Each HDF5 file contains one dataset per MRI series, using the raw DICOM SeriesDescription string as the key:
upenn_Zx3da3244/
└── Gf3lv2173.h5
├── 4CH_FIESTA_BH [frames, channels, H, W] attrs: total_images, slice_frames
├── SAX_FIESTA_BH_1 [frames, channels, H, W] attrs: total_images, slice_frames
├── SAX_FIESTA_BH_2 [frames, channels, H, W] attrs: total_images, slice_frames
└── STACK_LV3CH_FIESTA_BH [frames, channels, H, W] attrs: total_images, slice_frames
DICOM files are delivered in a variety of institutional patterns re: nested folder structures, filenaming conventions, and series folder distributions. This reads DICOM directories and writes compressed .tar.gz files in a standardized format, anonymizing MRN and Accession amongst other PHI if needed. Files are named: anon_mrn-anon_accession.tgz
Main entry point. Reads tar.gz DICOM archives, extracts pixel arrays, and writes compressed HDF5 files. Key behaviors:
- Handles institution-specific DICOM quirks (Stanford, UCSF, MedStar, UK Biobank, UPenn)
- Sorts frames by
SliceLocation+InstanceNumberfor correct temporal ordering - Resizes frames via torchvision transforms (default 480px)
- Supports RGB and greyscale storage modes; greyscale reduces storage ~50–70%
- Default behaviour to downsample source float16 to uint8
- Optional direct upload to Google Cloud Storage during processing
python utils/preprocess_mri.py \
-r /path/to/dicoms \
-o /path/to/output \
-i stanford \
-c 16 \
--channels rgb| Argument | Description |
|---|---|
-r / --root_dir |
DICOM archive directory or GCS bucket (gs://...) |
-o / --output_dir |
HDF5 output location (required) |
-i / --institution |
Institution prefix: stanford, ucsf, medstar, ukbiobank, upenn |
-c / --cpus |
CPU cores for multiprocessing (default: 4) |
-s / --framesize |
Resize frames to this pixel size (default: 480) |
-z / --compression |
gzip or lzf (default: gzip) |
--channels |
rgb (default) or grey |
--gcs_bucket_upload |
Optional GCS bucket for direct upload |
-d / --debug |
Report statistics without converting |
Post-processes raw HDF5 output by renaming datasets from raw DICOM SeriesDescription strings to standardized view labels (4CH, SAX, 3CH, LAX) using the lookup table in series_descriptions_master.csv. DEPRECATED
Computes SHA256 checksums over HDF5 pixel data (not file headers) for reproducibility validation. Supports comparison against a reference manifest CSV to detect regressions between runs.
Scans DICOM archives to extract metadata (SeriesDescription, SliceLocation, Manufacturer, field strength, MRN, AccessionNumber) and outputs a CSV.
Compresses extracted DICOM folders back to tar.gz. Supports anonymization via a CSV crosswalk that remaps (mrn, accession) → (anon_mrn, anon_accession) during recompression.
Converts HDF5 cine arrays to MP4 videos via FFmpeg for visual QC. Supports both greyscale and RGB modes.
Google Cloud Storage utilities: asynchronous upload queue, GCS bucket mount/unmount via gcsfuse, and disk-full throttling (pauses pipeline when temp storage exceeds 90%).
Wrapper around the ukbfetch CLI for bulk UK Biobank downloads. Chunks large bulk files into 1000-row batches and runs parallel downloads (default: 20 concurrent connections).
De-identification pipeline. llm_deid.py dispatches local Ollama LLM instances across multiple GPU devices to detect PHI in free-text DICOM fields and clinical reports.
Device/cluster-specific settings (paths, CPU counts, GCS bucket name, Slack credentials). This file is machine-specific and not committed. Run tests/docker_prep.py to generate a .env from it before running the Docker pipeline. Example structure:
global_settings:
bucket_name: 'your_gcs_bucket'
slack_bot_token: 'xoxb-...'
sherlock: # per-machine block
tmp_dir: '/scratch/tmp'
num_cpus: 48The full preprocessing pipeline runs inside a reproducible Docker environment (Ubuntu 24.04, Python 3.13). tests/docker-compose.yml orchestrates sequential preprocessing across all supported institutions in both RGB and greyscale modes, followed by checksum generation and comparison against the ground-truth manifests in tests/checksums/.
A pre-push git hook enforces this automatically on every git push:
git push
→ hooks/pre-push
→ tests/run_docker_tests.sh (workstation-guarded: jarvis_lambda only)
→ tests/docker_prep.py (regenerate .env from local_config.yaml)
→ docker compose up (build image, run full multi-institution pipeline)
→ generate_checksums.py (compare output against tests/checksums/ manifests)
→ push proceeds only if all checksums match
This blocks any commit that breaks a known-working preprocessing result from reaching the remote.
After cloning, run once to install the pre-push hook:
bash hooks/install.sh| Institution | Prefix | Notes |
|---|---|---|
| Stanford | stanford |
Standard DICOM metadata |
| UCSF | ucsf |
Standard DICOM metadata |
| MedStar | medstar |
MRN/accession extracted from filename |
| UK Biobank | ukbiobank |
SAX split across multiple folders; EID-based naming |
| UPenn | upenn |
MRN/accession extracted from filename |
engine/torch_dataset.py: PyTorchDatasetclass that reads HDF5 files directly, supports train/val/test CSV-driven splits, random or full-frame sampling, and optional transforms.engine/labeller.py: PyTorch Lightning module for CMR view/modality classification using a Facebook DINO ViT-S/16 backbone. In active development.
If you use this codebase, or otherwise found our work valuable, please cite:
@article{shad2026generalizabledeeplearningcardiac,
title={A Generalizable Deep Learning System for Cardiac MRI},
author={Rohan Shad and Cyril Zakka and Dhamanpreet Kaur and Robyn Fong and Ross Warren Filice and John Mongan and Kimberly Kalianos and Nishith Khandwala and David Eng and Matthew Leipzig and Walter Witschey and Alejandro de Feria and Victor Ferrari and Euan Ashley and Michael A. Acker and Curtis Langlotz and William Hiesinger},
journal={Nature Biomedical Engineering},
year={2026},
doi={10.1038/s41551-026-01637-3},
url={https://www.nature.com/articles/s41551-026-01637-3},
}
