This repository contains the code used for multimodal subtype discovery, prognostic modelling, external evaluation, longitudinal imaging validation, and AI versus clinician benchmarking in Alzheimer's disease research.
The manuscript aligned analysis addresses four linked questions.
- Whether clinically similar participants with mild cognitive impairment contain structurally distinct multimodal subtypes.
- Whether those subtype differences correspond to divergent longitudinal brain atrophy.
- Whether a frozen holdout prediction pipeline generalizes to an independent Alzheimer's Disease Neuroimaging Initiative test set.
- Whether the framework extends across external cohorts through subtype transportability in A4 and AIBL and framework adaptation in the Harvard Aging Brain Study.
step1_preprocess_APOE.py
step2_preprocess_CSF.py
step3_preprocess_Clinical.py
step4_preprocess_sMRI.py
step5_preprocess_PET.py
step6_create_outcome.py
step7_Cohort Integration.py
step8_vae_clustering.py
step9A_cross_cohort_analysis.py
step9B_biomarker_validation.py
step10_differential_analysis.R
step11_predictive_modeling.R
step12_cluster_signatures.R
step13_conversion_differential.R
step14_cluster_validation.R
step15_cross_modal_validation.R
step16_habs_validation.R
step17_shap_analysis.R
step18_evidence_synthesis.R
step19_ADNI_discovery.R
step20_AIBL _Validation.R
step21_A4_validation.R
step22_neuroimaging_endotypes.R
Directory AI_vs_Clinician_Analysis
Step 1 Prepare Test.R
Step 2 AI Prediction.py
Step 3 Expert Assessment Workflow.R
Step 4 AI vs Expert Comparison Analysis
step1_preprocess_APOE.pystep2_preprocess_CSF.pystep3_preprocess_Clinical.pystep4_preprocess_sMRI.pystep6_create_outcome.pystep7_Cohort Integration.pystep8_vae_clustering.pystep9B_biomarker_validation.pystep10_differential_analysis.Rstep11_predictive_modeling.Rstep12_cluster_signatures.Rstep13_conversion_differential.Rstep14_cluster_validation.Rstep15_cross_modal_validation.Rstep19_ADNI_discovery.Rstep9A_cross_cohort_analysis.pystep20_AIBL _Validation.Rstep21_A4_validation.Rstep22_neuroimaging_endotypes.Rstep16_habs_validation.Rstep17_shap_analysis.Rstep18_evidence_synthesis.R
AI_vs_Clinician_Analysis\Step 1 Prepare Test.RAI_vs_Clinician_Analysis\Step 2 AI Prediction.pyAI_vs_Clinician_Analysis\Step 3 Expert Assessment Workflow.RAI_vs_Clinician_Analysis\Step 4 AI vs Expert Comparison Analysis
Run step18_evidence_synthesis.R only after the independent holdout branch, HABS validation, and A4 and AIBL summary outputs have all been generated.
This repository does not redistribute cohort data. Access must be obtained directly from the source studies under their respective data use agreements.
Alzheimer's Disease Neuroimaging Initiative
Anti Amyloid Treatment in Asymptomatic Alzheimer's Disease study
Australian Imaging, Biomarker and Lifestyle study
Harvard Aging Brain Study
Fox Laboratory boundary shift integral longitudinal magnetic resonance imaging measures
Clinical_data.csv
metabolites.csv
RNA_plasma.csv
subtype_assignments.csv
latent_representations.csv
vae_summary.json
independent_test_set.csv
HABS_Baseline_Integrated.csv
AIBL_Baseline_Integrated.csv
A4_Baseline_Integrated.csv
Because local file names may differ across environments, inspect each script argument list before execution.
Recommended Python 3.10 or higher
Install dependencies with pip install -r requirements.txt
Recommended R 4.2 or higher
Install required packages listed in the script headers before execution.
The repository reflects the current manuscript aligned logic.
- The variational autoencoder uses a modality weighted reconstruction loss so that cerebrospinal fluid, clinical, and magnetic resonance imaging blocks contribute comparably despite unequal dimensionality.
- The final variational autoencoder feature list is stored in
vae_summary.jsonand should be treated as the executable record of the discovery input matrix. - A4 and AIBL are implemented as direct subtype transportability analyses based on latent projection and centroid assignment.
- The Harvard Aging Brain Study is implemented as framework adaptation with cohort specific Firth logistic regression rather than direct transfer of the Alzheimer's Disease Neuroimaging Initiative subtype model.
- The holdout workflow now leaves unavailable variables such as
GDSas missing and lets the discovery fitted imputation pipeline supply the reference value. Holdout preprocessing does not estimate imputation or scaling parameters from the holdout set. AI_vs_Clinician_Analysis\Step 1 Prepare Test.Rnow creates a strict 36 month conversion endpoint so that the public holdout label matches the expert three year assessment task.step14_cluster_validation.Rnow checks archived follow up variables first and writesCox_Time_Source_Metadata.csvso that discovery survival analyses can be reported transparently.step16_habs_validation.Rwritesstep16_manuscript_summary.csv, which should be used as the manuscript facing source for Harvard Aging Brain Study sample size, event count, event rate, and area under the curve reporting.step20_AIBL _Validation.Rwritesstep20_aibl_summary.csvandstep21_A4_validation.Rwritesstep21_a4_summary.csvfor cross cohort synthesis.step18_evidence_synthesis.Ruses direct step outputs whenever available and no longer relies on hardcoded holdout metrics.
step11_predictive_modeling.Rcurrently uses multiple imputation by chained equations and then carries forward one completed dataset withcomplete(mice_obj, 1). It does not pool estimates under Rubin's rules.- The current public discovery modelling workflow does not implement additional inverse probability class weighting.
- Several preprocessing scripts are broader than the final primary manuscript path. In particular,
step5_preprocess_PET.pyis retained for traceability but is not part of the final primary analysis path. - The file name
step20_AIBL _Validation.Rcontains a space and should be called exactly as written.
python step8_vae_clustering.py --input_dir ./processed_data --output_dir ./vae_output
python step9A_cross_cohort_analysis.py --external_file ./A4_Baseline_Integrated.csv --vae_dir ./vae_output --output_dir ./a4_projection --cohort_name A4
python AI_vs_Clinician_Analysis/"Step 2 AI Prediction.py" --base_dir ./AI_vs_Clinician_Analysis --output_dir ./AI_vs_Clinician_Analysis/results
Rscript step11_predictive_modeling.R --vae_dir ./vae_output --data_dir ./processed_data --output_dir ./step11_results
Rscript step16_habs_validation.R --habs_file ./HABS_Baseline_Integrated.csv --output_dir ./step16_results
Rscript "step20_AIBL _Validation.R" --projection_file ./aibl_projection/AIBL_projected_subtypes.csv --baseline_file ./AIBL_Baseline_Integrated.csv --output_dir ./step20_results
Representative outputs include the following files.
Subtype assignments and latent representations
Variational autoencoder summary files and subtype centroids
Feature importance tables and model comparison files
Calibration, receiver operating characteristic, and confusion matrix plots
Boundary shift integral longitudinal analysis tables and figures
External cohort projected subtype files
Harvard Aging Brain Study performance summaries, decision curves, and kernel SHAP outputs
Independent holdout prediction outputs and AI versus clinician comparison results
Cross cohort manuscript summary tables, including Cox_Time_Source_Metadata.csv, step16_manuscript_summary.csv, step20_aibl_summary.csv, and step21_a4_summary.csv
Set seeds are embedded in the scripts where applicable.
Cohort specific preprocessing assumptions remain script dependent and should be documented in any derivative publication.
Before manuscript submission, verify that the text matches the executable code for endpoint definition, imputation strategy, time source handling, and external cohort summary outputs.
If you use this repository, please cite the associated manuscript and the originating cohort studies.
This repository is released under the MIT License. See LICENSE.