From Statistical Fidelity to Clinical Consistency: Scalable Generation and Auditing of Synthetic Patient Trajectories
Download the MIMIC-IV v2.2 dataset from Physionet and save it to dataset/mimiciv/2.2.
The dataset/mimiciv/2.2 directory should contain hosp and icu directories.
- We need to setup a separate environment to convert MIMIC-IV to MEDS format first:
conda create -n meds-etl-env python==3.10.0 -y
conda activate meds-etl-env
git submodule update --init --recursive
pip install -e external/meds_etl
meds_etl_mimic dataset/mimiciv/2.2 dataset/mimiciv/2.2_meds 2.2There'll be a dataset/mimiciv/2.2_meds directory containing the MEDS-formatted data. You will then use the data for the following steps in Quick Start.
- Setup and change to the main python environment:
conda create -n coogee python==3.13.0 -y
conda activate coogee
pip install -r requirements.txtTIMESTAMP=$(date '+%Y-%m-%d_%H_%M_%S')
LOG_DIR="output/coogee-final-sanity-check/${TIMESTAMP}"
mkdir -p "$LOG_DIR"
mkdir -p "$LOG_DIR/logs"
python -m scripts.process_mimiciv_meds --LOG_DIR $LOG_DIR --icd10cm_top_n -1 --icd10pcs_top_n -1 --atc_top_n -1 --lab_top_n -1
python -m scripts.token_processor_complete \
--config "configs/coogee.yaml" \
--LOG_DIR $LOG_DIR
python -m scripts.map_concept_label --LOG_DIR $LOG_DIRThe pre-processing and tokenization steps need 80GB RAM. Some files are generated as below:
cohort_stat: contains the statistics of the cohort;logs:process_mimiciv_meds.logandtoken_processor_complete.log;raw_sequences: contains the untokenized medical event sequences splited into train/val/test;tokenized_sequences: contains the tokenized sequences splited into train/val/test;tokenizer: contains vocab.csv;
We construct two types of knowledge-aware representations for each token in our vocabulary, leveraging external biomedical knowledge sources.
We need to download code2tokens.json, all_codes_mappings.parquet, code2embeddings.json, kg.csv and all_codes_mappings.parquet from MedTok and save them to artifacts/medtok_files.
python -m scripts.map_vocab_to_medtok --LOG_DIR $LOG_DIRThis script maps each vocabulary code to MedTok concepts.
- ICD10CM/PCS codes: normalized and matched to the closest MedToK code by prefix/character matching;
- ATC codes: direct mapping with manual corrections for deprecated codes;
- Lab test codes: semantically matched to SNOMED concepts using ClinicalBERT similarity;
- Special tokens: assigned zero embeddings;
python -m scripts.construct_knowledge_embd --LOG_DIR $LOG_DIRThis script produces two embeddings per token.
Hierarchy Embeddings capture structural relationships from PrimeKG, a comprehensive biomedical knowledge graph:
- For each medical code, extract its local subgraph from PrimeKG (including connected diseases, drugs, proteins, etc.)
- Train a Relational Graph Convolutional Network (RGCN) using link prediction loss to learn node representations that preserve graph structure
- Pool subgraph node embeddings via mean pooling to obtain a single per-code hierarchy embedding
Semantic Embeddings capture textual meaning from code descriptions:
- Look up the natural language description for each medical code (e.g., "Type 2 diabetes mellitus" for ICD-10 E11)
- Encode descriptions using ClinicalBERT
- Special tokens use predefined descriptions (e.g., "This token represents the death of the patient" for DEATH)
The GNN training might take a few hours to complete. Here's the final checkpoint: artifacts/coogee_final/knowledge_embd/gnn_checkpoint_172.pt.
RUN_DIR="$LOG_DIR/sweeps/w_knowledge_embd"
n_embd=384
hidden_dim=1024
n_layers=6
n_heads=6
n_kv=2
batch_size=32
python -m model.train \
--LOG_DIR "$RUN_DIR" \
--n_embd "$n_embd" \
--hidden_dim "$hidden_dim" \
--n_layers "$n_layers" \
--num_attention_heads "$n_heads" \
--num_key_value_heads "$n_kv" \
--batch_size "$batch_size" \
--wandb_log \
--use_hierarchy_embd \
--use_semantic_embdpython -m model.generate --LOG_DIR "$LOG_DIR" --wandb_logpython -m scripts.construct_final_patient_timelines --LOG_DIR "$LOG_DIR"
python -m scripts.post_process_icd10cm_syn --LOG_DIR "$LOG_DIR"# split in order to use more gpus to process at the same time
python -m scripts.split_synthetic_timelines_into_parts --LOG_DIR "$LOG_DIR"
qsub pbs_cmd/pbs_llm_as_judge/llm_as_judge_1.sh
python -m scripts.compute_realism_stats --LOG_DIR "$LOG_DIR" --NUM 7python -m evaluation.eval_synthetic --LOG_DIR "$LOG_DIR/synthetic_data_topp_0.98_temperature_1.0_realism_geq_7_w_reasoning" --fidelity --utility --privacy --heatmap
# plot time fidelity in Figure1 g-i
python -m evaluation.eval_fidelity_time --LOG_DIR "$LOG_DIR"If you find our work useful in your research, please consider citing:
@misc{zhou2026statisticalfidelityclinicalconsistency,
title={From Statistical Fidelity to Clinical Consistency: Scalable Generation and Auditing of Synthetic Patient Trajectories},
author={Guanglin Zhou and Armin Catic and Motahare Shabestari and Matthew Young and Chaiquan Li and Katrina Poppe and Sebastiano Barbieri},
year={2026},
eprint={2603.06720},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.06720},
}Coogee © 2026 by Guanglin Zhou is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.