Code accompanying the manuscript:
Plasma Proteomic Signatures Characterize Type 2 Diabetes Heterogeneity
Dandan Tan, Yefeng Yang, Masashi Hasebe, Julia Carrasco-Zanini-Sanchez, Chen-Yang Su, Urvashi Singh, Leighton Smith, Megan Tsao, Aaron Leong, Miriam S. Udler, Jason Flannick, Guillaume Butler-Laporte, Claudia Langenberg, Tianyuan Lu, Satoshi Yoshiji
This repository contains the core analysis scripts used to:
- Preprocess and impute missing Olink Explore NPX values (panel-wise imputation; then merge).
- Select a T2D-related protein panel using L1-regularized logistic regression (LASSO).
- Residualize proteins against covariates and apply rank-based inverse normal transformation (RNT).
- Build a 2D latent proteomic space with Monocle2/DDRTree among incident T2D cases.
- Test associations between proteomic dimensions / clusters and pathway-partitioned genetic risk scores (GRS).
Note: Scripts currently contain hard-coded paths (e.g.,
/scratch/...). You will need to edit paths (and sometimes column indices) to match your environment.
This repo is intentionally “flat” (one script per step). Recommended execution order:
01.protein_by_panel.R- Loads a curated participant-level table (
curated_full_df.tsv) and Olink assay metadata (olink_assay.dat) - Removes proteins with >40% missingness and individuals with >60% missingness
- Writes panel-specific TSVs:
proteomics_Cardiometabolic.tsv,proteomics_Cardiometabolic_II.tsv,
proteomics_Inflammation.tsv,proteomics_Inflammation_II.tsv,
proteomics_Neurology.tsv,proteomics_Neurology_II.tsv,
proteomics_Oncology.tsv,proteomics_Oncology_II.tsv
- Loads a curated participant-level table (
02.Imputation_per_pancel.R- Runs missForest per panel; uses sex and age as covariates included in the imputation matrix
- Takes the panel name as a command-line argument
Example:Rscript 02.Imputation_per_pancel.R Cardiometabolic
- Outputs:
Imputed_NPX_missForest_<Panel>.RData
03.Imputation_qc.R- Loads all
Imputed_NPX_missForest_<Panel>.RDatafiles and concatenates imputed proteins across panels - Merges imputed proteins back to the curated participant table
- Optionally runs a PCA sanity check
- Joins genetic ancestry labels (expects an
Ancestry_cluster.RDatawith anAncestryfield) - Applies rank-based inverse normal transformation (RNT) to each protein
- Outputs the main downstream file:
imputed_participants_clinical_olink_RNT_anc.tsv
- Loads all
04_a.Lasso_all_proteins.py- Reads:
imputed_participants_clinical_olink_RNT_anc.tsv - Predicts:
incident_T2D_case_control(must exist in the input table) - Uses
LogisticRegressionCV(penalty="l1", solver="saga", scoring="roc_auc") - Saves:
train_indices.txt,test_indices.txt(reproducible split)logistic_cv_model.pkl(trained model)
- Reads:
04_b.Lasso_select_proteins.py- Loads:
logistic_cv_model.pkl - Extracts feature names + coefficients
- Outputs:
ranked_features.csv(feature, coefficient, absolute coefficient, rank)
- Loads:
04_c.Logisitic_Linear_regressions.R- Reads:
imputed_participants_clinical_olink_RNT_anc.tsvandranked_features.csv - Filters to incident T2D cases (
incident_T2D_case_control == 1) - For each selected protein, fits:
protein ~ sex + age + PCs + centre + smoking_status + Batch + time_to_olink_processing + Ancestry - Stores residuals, then applies RNT to residuals
- Outputs:
Linear_regression_Olink_sig_proteins_RNT_t2d_incident_lasso.Rdata
- Reads:
Note: The current script selects the 483 proteins by absolute LASSO coefficient (you can modify to keep all non-zero / top-k , depending on your analysis plan).
05_a.DDRTree_monocle.R- Loads the residualized (RNT) protein matrix from the previous step
- Creates a Monocle2
CellDataSet - Runs:
reduceDimension(..., reduction_method="DDRTree", max_components=2)orderCells()(pseudotime + state/branch assignment)
- Produces DDRTree trajectory plots colored by Pseudotime and State
05_b.DDRTree_results.R- Loads the residualized (RNT) protein matrix and DDRTree derived dimensions and groups from the previous step
- Runs multiple analyses: linear regression of clinical variables, logistic regression of clinical variables, and figure generation
05_c.DDRTree_mapping.R- Reads:
Linear_regression_Olink_LASSO_proteins_RNT_t2d_incident.Rdata(residualized proteomics)monocle2_t2d_incident_lasso.Rdata(DDRTree object: reduced dimensions, weights, cluster/state assignments)imputed_participants_clinical_olink_RNT_anc.tsv(clinical + proteomics)ranked_features.csv(LASSO-selected proteins)
- Defines:
-
normalize_weight()andDDRTree_map()for projecting new samples into the incident T2D DDRTree space- 2D kernel density estimation helper for visualization
- Runs:
- Projection of incident T2D cases (original DDRTree embedding)
- Projection of prevalent T2D cases into the incident-derived DDRTree space
- Projection of prediabetes candidates into the same space
- Kernel density estimation on DDRTree coordinates
- Outputs:
- Combined DDRTree visualization showing:
- Incident T2D cases
- Prediabetes candidates
- Prevalent T2D cases
- Corresponding 2D density maps
- DDRTree_mapping_density_prevalent_incident_pret2d.pdf
- Reads:
You will need (at minimum):
- A participant-level table with clinical covariates + Olink NPX values (used as
curated_full_df.tsv) - Olink assay-to-panel mapping (
olink_assay.dat) - Genetic ancestry labels (
Ancestry_cluster.RDataor equivalent table witheidandAncestry) - Case/control label:
incident_T2D_case_controlin the participant table
Because UK Biobank and EPIC-Norfolk are controlled-access resources, individual-level data are not distributed in this repository.
Key packages used by the current scripts include: dplyr, data.table, tidyr, missForest, monocle, ggplot2 (plus others loaded in scripts).
Key packages: pandas, numpy, scikit-learn, joblib, matplotlib.
- The LASSO training script sets a fixed random seed and saves train/test indices.
- Several scripts use explicit column ranges (e.g., proteins starting at a specific column index). If you change the input table layout, update these indices accordingly.
This repository does not currently include a license file. If you plan to distribute this code publicly, add a LICENSE file (e.g., MIT or BSD-3-Clause are common defaults for academic analysis code).