Skip to content

HaibinLi0817/T2D_heterogeneity

Repository files navigation

T2D_heterogeneity

Code accompanying the manuscript:

Plasma Proteomic Signatures Characterize Type 2 Diabetes Heterogeneity
Dandan Tan, Yefeng Yang, Masashi Hasebe, Julia Carrasco-Zanini-Sanchez, Chen-Yang Su, Urvashi Singh, Leighton Smith, Megan Tsao, Aaron Leong, Miriam S. Udler, Jason Flannick, Guillaume Butler-Laporte, Claudia Langenberg, Tianyuan Lu, Satoshi Yoshiji


What this repo does

This repository contains the core analysis scripts used to:

  1. Preprocess and impute missing Olink Explore NPX values (panel-wise imputation; then merge).
  2. Select a T2D-related protein panel using L1-regularized logistic regression (LASSO).
  3. Residualize proteins against covariates and apply rank-based inverse normal transformation (RNT).
  4. Build a 2D latent proteomic space with Monocle2/DDRTree among incident T2D cases.
  5. Test associations between proteomic dimensions / clusters and pathway-partitioned genetic risk scores (GRS).

Note: Scripts currently contain hard-coded paths (e.g., /scratch/...). You will need to edit paths (and sometimes column indices) to match your environment.


Repository structure (run order)

This repo is intentionally “flat” (one script per step). Recommended execution order:

01. Split proteins by Olink panel + basic missingness filtering

  • 01.protein_by_panel.R
    • Loads a curated participant-level table (curated_full_df.tsv) and Olink assay metadata (olink_assay.dat)
    • Removes proteins with >40% missingness and individuals with >60% missingness
    • Writes panel-specific TSVs:
      • proteomics_Cardiometabolic.tsv, proteomics_Cardiometabolic_II.tsv,
        proteomics_Inflammation.tsv, proteomics_Inflammation_II.tsv,
        proteomics_Neurology.tsv, proteomics_Neurology_II.tsv,
        proteomics_Oncology.tsv, proteomics_Oncology_II.tsv

02. Impute missing NPX values (panel-wise missForest)

  • 02.Imputation_per_pancel.R
    • Runs missForest per panel; uses sex and age as covariates included in the imputation matrix
    • Takes the panel name as a command-line argument
      Example:
      Rscript 02.Imputation_per_pancel.R Cardiometabolic
    • Outputs: Imputed_NPX_missForest_<Panel>.RData

03. Merge imputed panels + QC + add ancestry + RNT

  • 03.Imputation_qc.R
    • Loads all Imputed_NPX_missForest_<Panel>.RData files and concatenates imputed proteins across panels
    • Merges imputed proteins back to the curated participant table
    • Optionally runs a PCA sanity check
    • Joins genetic ancestry labels (expects an Ancestry_cluster.RData with an Ancestry field)
    • Applies rank-based inverse normal transformation (RNT) to each protein
    • Outputs the main downstream file:
      • imputed_participants_clinical_olink_RNT_anc.tsv

Protein panel selection (LASSO)

04.1 Train cross-validated L1 logistic regression (LASSO)

  • 04_a.Lasso_all_proteins.py
    • Reads: imputed_participants_clinical_olink_RNT_anc.tsv
    • Predicts: incident_T2D_case_control (must exist in the input table)
    • Uses LogisticRegressionCV(penalty="l1", solver="saga", scoring="roc_auc")
    • Saves:
      • train_indices.txt, test_indices.txt (reproducible split)
      • logistic_cv_model.pkl (trained model)

04.2 Export and rank selected proteins

  • 04_b.Lasso_select_proteins.py
    • Loads: logistic_cv_model.pkl
    • Extracts feature names + coefficients
    • Outputs:
      • ranked_features.csv (feature, coefficient, absolute coefficient, rank)

Residualization (covariate regression) for DDRTree input

04.3 Regress out covariates and RNT residuals (incident T2D cases)

  • 04_c.Logisitic_Linear_regressions.R
    • Reads: imputed_participants_clinical_olink_RNT_anc.tsv and ranked_features.csv
    • Filters to incident T2D cases (incident_T2D_case_control == 1)
    • For each selected protein, fits: protein ~ sex + age + PCs + centre + smoking_status + Batch + time_to_olink_processing + Ancestry
    • Stores residuals, then applies RNT to residuals
    • Outputs:
      • Linear_regression_Olink_sig_proteins_RNT_t2d_incident_lasso.Rdata

Note: The current script selects the 483 proteins by absolute LASSO coefficient (you can modify to keep all non-zero / top-k , depending on your analysis plan).


DDRTree latent space construction (Monocle2)

05. DDRTree embedding and “state” (cluster) assignment

  • 05_a.DDRTree_monocle.R
    • Loads the residualized (RNT) protein matrix from the previous step
    • Creates a Monocle2 CellDataSet
    • Runs:
      • reduceDimension(..., reduction_method="DDRTree", max_components=2)
      • orderCells() (pseudotime + state/branch assignment)
    • Produces DDRTree trajectory plots colored by Pseudotime and State

  • 05_b.DDRTree_results.R
    • Loads the residualized (RNT) protein matrix and DDRTree derived dimensions and groups from the previous step
    • Runs multiple analyses: linear regression of clinical variables, logistic regression of clinical variables, and figure generation

05_c. Mapping function

  • 05_c.DDRTree_mapping.R
    • Reads:
      • Linear_regression_Olink_LASSO_proteins_RNT_t2d_incident.Rdata (residualized proteomics)
      • monocle2_t2d_incident_lasso.Rdata (DDRTree object: reduced dimensions, weights, cluster/state assignments)
      • imputed_participants_clinical_olink_RNT_anc.tsv (clinical + proteomics)
      • ranked_features.csv (LASSO-selected proteins)
    • Defines: -normalize_weight() and DDRTree_map() for projecting new samples into the incident T2D DDRTree space
      • 2D kernel density estimation helper for visualization
    • Runs:
      • Projection of incident T2D cases (original DDRTree embedding)
      • Projection of prevalent T2D cases into the incident-derived DDRTree space
      • Projection of prediabetes candidates into the same space
      • Kernel density estimation on DDRTree coordinates
    • Outputs:
      • Combined DDRTree visualization showing:
      • Incident T2D cases
      • Prediabetes candidates
      • Prevalent T2D cases
      • Corresponding 2D density maps
      • DDRTree_mapping_density_prevalent_incident_pret2d.pdf

Inputs expected (high level)

You will need (at minimum):

  • A participant-level table with clinical covariates + Olink NPX values (used as curated_full_df.tsv)
  • Olink assay-to-panel mapping (olink_assay.dat)
  • Genetic ancestry labels (Ancestry_cluster.RData or equivalent table with eid and Ancestry)
  • Case/control label: incident_T2D_case_control in the participant table

Because UK Biobank and EPIC-Norfolk are controlled-access resources, individual-level data are not distributed in this repository.


Software

R

Key packages used by the current scripts include: dplyr, data.table, tidyr, missForest, monocle, ggplot2 (plus others loaded in scripts).

Python

Key packages: pandas, numpy, scikit-learn, joblib, matplotlib.


Reproducibility notes

  • The LASSO training script sets a fixed random seed and saves train/test indices.
  • Several scripts use explicit column ranges (e.g., proteins starting at a specific column index). If you change the input table layout, update these indices accordingly.

License

This repository does not currently include a license file. If you plan to distribute this code publicly, add a LICENSE file (e.g., MIT or BSD-3-Clause are common defaults for academic analysis code).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors