From Statistical Fidelity to Clinical Consistency: Scalable Generation and Auditing of Synthetic Patient Trajectories

Data

Download the MIMIC-IV v2.2 dataset from Physionet and save it to dataset/mimiciv/2.2. The dataset/mimiciv/2.2 directory should contain hosp and icu directories.

Install

We need to setup a separate environment to convert MIMIC-IV to MEDS format first:

conda create -n meds-etl-env python==3.10.0 -y
conda activate meds-etl-env
git submodule update --init --recursive
pip install -e external/meds_etl
meds_etl_mimic dataset/mimiciv/2.2 dataset/mimiciv/2.2_meds 2.2

There'll be a dataset/mimiciv/2.2_meds directory containing the MEDS-formatted data. You will then use the data for the following steps in Quick Start.

Setup and change to the main python environment:

conda create -n coogee python==3.13.0 -y
conda activate coogee
pip install -r requirements.txt

Quick Start

1. tokenization

TIMESTAMP=$(date '+%Y-%m-%d_%H_%M_%S')
LOG_DIR="output/coogee-final-sanity-check/${TIMESTAMP}"
mkdir -p "$LOG_DIR"
mkdir -p "$LOG_DIR/logs"
python -m scripts.process_mimiciv_meds --LOG_DIR $LOG_DIR --icd10cm_top_n -1 --icd10pcs_top_n -1 --atc_top_n -1 --lab_top_n -1
python -m scripts.token_processor_complete \
  --config "configs/coogee.yaml" \
  --LOG_DIR $LOG_DIR
python -m scripts.map_concept_label --LOG_DIR $LOG_DIR

The pre-processing and tokenization steps need 80GB RAM. Some files are generated as below:

cohort_stat: contains the statistics of the cohort;
logs: process_mimiciv_meds.log and token_processor_complete.log;
raw_sequences: contains the untokenized medical event sequences splited into train/val/test;
tokenized_sequences: contains the tokenized sequences splited into train/val/test;
tokenizer: contains vocab.csv;

2. knowledge-aware representation

We construct two types of knowledge-aware representations for each token in our vocabulary, leveraging external biomedical knowledge sources.

1. some preparation from MedTok

We need to download code2tokens.json, all_codes_mappings.parquet, code2embeddings.json, kg.csv and all_codes_mappings.parquet from MedTok and save them to artifacts/medtok_files.

python -m scripts.map_vocab_to_medtok --LOG_DIR $LOG_DIR

This script maps each vocabulary code to MedTok concepts.

ICD10CM/PCS codes: normalized and matched to the closest MedToK code by prefix/character matching;
ATC codes: direct mapping with manual corrections for deprecated codes;
Lab test codes: semantically matched to SNOMED concepts using ClinicalBERT similarity;
Special tokens: assigned zero embeddings;

2. construct knowledge-aware embeddings for each token

python -m scripts.construct_knowledge_embd --LOG_DIR $LOG_DIR

This script produces two embeddings per token.

Hierarchy Embeddings capture structural relationships from PrimeKG, a comprehensive biomedical knowledge graph:

For each medical code, extract its local subgraph from PrimeKG (including connected diseases, drugs, proteins, etc.)
Train a Relational Graph Convolutional Network (RGCN) using link prediction loss to learn node representations that preserve graph structure
Pool subgraph node embeddings via mean pooling to obtain a single per-code hierarchy embedding

Semantic Embeddings capture textual meaning from code descriptions:

Look up the natural language description for each medical code (e.g., "Type 2 diabetes mellitus" for ICD-10 E11)
Encode descriptions using ClinicalBERT
Special tokens use predefined descriptions (e.g., "This token represents the death of the patient" for DEATH)

The GNN training might take a few hours to complete. Here's the final checkpoint: artifacts/coogee_final/knowledge_embd/gnn_checkpoint_172.pt.

3. training

RUN_DIR="$LOG_DIR/sweeps/w_knowledge_embd"
n_embd=384
hidden_dim=1024
n_layers=6
n_heads=6
n_kv=2
batch_size=32
python -m model.train \
  --LOG_DIR "$RUN_DIR" \
  --n_embd "$n_embd" \
  --hidden_dim "$hidden_dim" \
  --n_layers "$n_layers" \
  --num_attention_heads "$n_heads" \
  --num_key_value_heads "$n_kv" \
  --batch_size "$batch_size" \
  --wandb_log \
  --use_hierarchy_embd \
  --use_semantic_embd

4. generation

python -m model.generate --LOG_DIR "$LOG_DIR" --wandb_log

5. construct final patient timelines (used for clinical evaluation in our study)

python -m scripts.construct_final_patient_timelines --LOG_DIR "$LOG_DIR"
python -m scripts.post_process_icd10cm_syn --LOG_DIR "$LOG_DIR"

6. LLM-as-a-judge exp

# split in order to use more gpus to process at the same time
python -m scripts.split_synthetic_timelines_into_parts --LOG_DIR "$LOG_DIR"

qsub pbs_cmd/pbs_llm_as_judge/llm_as_judge_1.sh
python -m scripts.compute_realism_stats --LOG_DIR "$LOG_DIR" --NUM 7

7. evaluation

python -m evaluation.eval_synthetic --LOG_DIR "$LOG_DIR/synthetic_data_topp_0.98_temperature_1.0_realism_geq_7_w_reasoning" --fidelity --utility --privacy --heatmap
# plot time fidelity in Figure1 g-i
python -m evaluation.eval_fidelity_time --LOG_DIR "$LOG_DIR"

Citation (to be updated)

If you find our work useful in your research, please consider citing:

@misc{zhou2026statisticalfidelityclinicalconsistency,
      title={From Statistical Fidelity to Clinical Consistency: Scalable Generation and Auditing of Synthetic Patient Trajectories}, 
      author={Guanglin Zhou and Armin Catic and Motahare Shabestari and Matthew Young and Chaiquan Li and Katrina Poppe and Sebastiano Barbieri},
      year={2026},
      eprint={2603.06720},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.06720}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
evaluation		evaluation
model		model
pbs_cmd/pbs_llm_as_judge		pbs_cmd/pbs_llm_as_judge
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From Statistical Fidelity to Clinical Consistency: Scalable Generation and Auditing of Synthetic Patient Trajectories

Data

Install

Quick Start

1. tokenization

2. knowledge-aware representation

1. some preparation from MedTok

2. construct knowledge-aware embeddings for each token

3. training

4. generation

5. construct final patient timelines (used for clinical evaluation in our study)

6. LLM-as-a-judge exp

7. evaluation

Citation (to be updated)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

From Statistical Fidelity to Clinical Consistency: Scalable Generation and Auditing of Synthetic Patient Trajectories

Data

Install

Quick Start

1. tokenization

2. knowledge-aware representation

1. some preparation from MedTok

2. construct knowledge-aware embeddings for each token

3. training

4. generation

5. construct final patient timelines (used for clinical evaluation in our study)

6. LLM-as-a-judge exp

7. evaluation

Citation (to be updated)

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages