T2D_heterogeneity

Code accompanying the manuscript:

Plasma Proteomic Signatures Characterize Type 2 Diabetes Heterogeneity
Dandan Tan, Yefeng Yang, Masashi Hasebe, Julia Carrasco-Zanini-Sanchez, Chen-Yang Su, Urvashi Singh, Leighton Smith, Megan Tsao, Aaron Leong, Miriam S. Udler, Jason Flannick, Guillaume Butler-Laporte, Claudia Langenberg, Tianyuan Lu, Satoshi Yoshiji

What this repo does

This repository contains the core analysis scripts used to:

Preprocess and impute missing Olink Explore NPX values (panel-wise imputation; then merge).
Select a T2D-related protein panel using L1-regularized logistic regression (LASSO).
Residualize proteins against covariates and apply rank-based inverse normal transformation (RNT).
Build a 2D latent proteomic space with Monocle2/DDRTree among incident T2D cases.
Test associations between proteomic dimensions / clusters and pathway-partitioned genetic risk scores (GRS).

Note: Scripts currently contain hard-coded paths (e.g., /scratch/...). You will need to edit paths (and sometimes column indices) to match your environment.

Repository structure (run order)

This repo is intentionally “flat” (one script per step). Recommended execution order:

01. Split proteins by Olink panel + basic missingness filtering

01.protein_by_panel.R
- Loads a curated participant-level table (curated_full_df.tsv) and Olink assay metadata (olink_assay.dat)
- Removes proteins with >40% missingness and individuals with >60% missingness
- Writes panel-specific TSVs:
  - proteomics_Cardiometabolic.tsv, proteomics_Cardiometabolic_II.tsv,
    proteomics_Inflammation.tsv, proteomics_Inflammation_II.tsv,
    proteomics_Neurology.tsv, proteomics_Neurology_II.tsv,
    proteomics_Oncology.tsv, proteomics_Oncology_II.tsv

02. Impute missing NPX values (panel-wise missForest)

02.Imputation_per_pancel.R
- Runs missForest per panel; uses sex and age as covariates included in the imputation matrix
- Takes the panel name as a command-line argument
  Example:
```
Rscript 02.Imputation_per_pancel.R Cardiometabolic
```
- Outputs: Imputed_NPX_missForest_<Panel>.RData

03. Merge imputed panels + QC + add ancestry + RNT

03.Imputation_qc.R
- Loads all Imputed_NPX_missForest_<Panel>.RData files and concatenates imputed proteins across panels
- Merges imputed proteins back to the curated participant table
- Optionally runs a PCA sanity check
- Joins genetic ancestry labels (expects an Ancestry_cluster.RData with an Ancestry field)
- Applies rank-based inverse normal transformation (RNT) to each protein
- Outputs the main downstream file:
  - imputed_participants_clinical_olink_RNT_anc.tsv

Protein panel selection (LASSO)

04.1 Train cross-validated L1 logistic regression (LASSO)

04_a.Lasso_all_proteins.py
- Reads: imputed_participants_clinical_olink_RNT_anc.tsv
- Predicts: incident_T2D_case_control (must exist in the input table)
- Uses LogisticRegressionCV(penalty="l1", solver="saga", scoring="roc_auc")
- Saves:
  - train_indices.txt, test_indices.txt (reproducible split)
  - logistic_cv_model.pkl (trained model)

04.2 Export and rank selected proteins

04_b.Lasso_select_proteins.py
- Loads: logistic_cv_model.pkl
- Extracts feature names + coefficients
- Outputs:
  - ranked_features.csv (feature, coefficient, absolute coefficient, rank)

Residualization (covariate regression) for DDRTree input

04.3 Regress out covariates and RNT residuals (incident T2D cases)

04_c.Logisitic_Linear_regressions.R
- Reads: imputed_participants_clinical_olink_RNT_anc.tsv and ranked_features.csv
- Filters to incident T2D cases (incident_T2D_case_control == 1)
- For each selected protein, fits: protein ~ sex + age + PCs + centre + smoking_status + Batch + time_to_olink_processing + Ancestry
- Stores residuals, then applies RNT to residuals
- Outputs:
  - Linear_regression_Olink_sig_proteins_RNT_t2d_incident_lasso.Rdata

Note: The current script selects the 483 proteins by absolute LASSO coefficient (you can modify to keep all non-zero / top-k , depending on your analysis plan).

DDRTree latent space construction (Monocle2)

05. DDRTree embedding and “state” (cluster) assignment

05_a.DDRTree_monocle.R
- Loads the residualized (RNT) protein matrix from the previous step
- Creates a Monocle2 CellDataSet
- Runs:
  - reduceDimension(..., reduction_method="DDRTree", max_components=2)
  - orderCells() (pseudotime + state/branch assignment)
- Produces DDRTree trajectory plots colored by Pseudotime and State

05_b.DDRTree_results.R
- Loads the residualized (RNT) protein matrix and DDRTree derived dimensions and groups from the previous step
- Runs multiple analyses: linear regression of clinical variables, logistic regression of clinical variables, and figure generation

05_c. Mapping function

05_c.DDRTree_mapping.R
- Reads:
  - Linear_regression_Olink_LASSO_proteins_RNT_t2d_incident.Rdata (residualized proteomics)
  - monocle2_t2d_incident_lasso.Rdata (DDRTree object: reduced dimensions, weights, cluster/state assignments)
  - imputed_participants_clinical_olink_RNT_anc.tsv (clinical + proteomics)
  - ranked_features.csv (LASSO-selected proteins)
- Defines: -normalize_weight() and DDRTree_map() for projecting new samples into the incident T2D DDRTree space
  - 2D kernel density estimation helper for visualization
- Runs:
  - Projection of incident T2D cases (original DDRTree embedding)
  - Projection of prevalent T2D cases into the incident-derived DDRTree space
  - Projection of prediabetes candidates into the same space
  - Kernel density estimation on DDRTree coordinates
- Outputs:
  - Combined DDRTree visualization showing:
  - Incident T2D cases
  - Prediabetes candidates
  - Prevalent T2D cases
  - Corresponding 2D density maps
  - DDRTree_mapping_density_prevalent_incident_pret2d.pdf

Inputs expected (high level)

You will need (at minimum):

A participant-level table with clinical covariates + Olink NPX values (used as curated_full_df.tsv)
Olink assay-to-panel mapping (olink_assay.dat)
Genetic ancestry labels (Ancestry_cluster.RData or equivalent table with eid and Ancestry)
Case/control label: incident_T2D_case_control in the participant table

Because UK Biobank and EPIC-Norfolk are controlled-access resources, individual-level data are not distributed in this repository.

Software

R

Key packages used by the current scripts include: dplyr, data.table, tidyr, missForest, monocle, ggplot2 (plus others loaded in scripts).

Python

Key packages: pandas, numpy, scikit-learn, joblib, matplotlib.

Reproducibility notes

The LASSO training script sets a fixed random seed and saves train/test indices.
Several scripts use explicit column ranges (e.g., proteins starting at a specific column index). If you change the input table layout, update these indices accordingly.

License

This repository does not currently include a license file. If you plan to distribute this code publicly, add a LICENSE file (e.g., MIT or BSD-3-Clause are common defaults for academic analysis code).

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
01.protein_by_panel.R		01.protein_by_panel.R
02.Imputation_per_panel.R		02.Imputation_per_panel.R
03.Imputation_qc_PCA.R		03.Imputation_qc_PCA.R
04_a.Lasso_all_proteins.py		04_a.Lasso_all_proteins.py
04_b.lasso_select_proteins.py		04_b.lasso_select_proteins.py
04_c.Logisitic_Linear_regressions.R		04_c.Logisitic_Linear_regressions.R
05_a.DDRTree_monocle.R		05_a.DDRTree_monocle.R
05_b.DDRTree_results.R		05_b.DDRTree_results.R
05_c.DDRTree_mapping.R		05_c.DDRTree_mapping.R
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

T2D_heterogeneity

What this repo does

Repository structure (run order)

01. Split proteins by Olink panel + basic missingness filtering

02. Impute missing NPX values (panel-wise missForest)

03. Merge imputed panels + QC + add ancestry + RNT

Protein panel selection (LASSO)

04.1 Train cross-validated L1 logistic regression (LASSO)

04.2 Export and rank selected proteins

Residualization (covariate regression) for DDRTree input

04.3 Regress out covariates and RNT residuals (incident T2D cases)

DDRTree latent space construction (Monocle2)

05. DDRTree embedding and “state” (cluster) assignment

05_c. Mapping function

Inputs expected (high level)

Software

R

Python

Reproducibility notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

T2D_heterogeneity

What this repo does

Repository structure (run order)

01. Split proteins by Olink panel + basic missingness filtering

02. Impute missing NPX values (panel-wise missForest)

03. Merge imputed panels + QC + add ancestry + RNT

Protein panel selection (LASSO)

04.1 Train cross-validated L1 logistic regression (LASSO)

04.2 Export and rank selected proteins

Residualization (covariate regression) for DDRTree input

04.3 Regress out covariates and RNT residuals (incident T2D cases)

DDRTree latent space construction (Monocle2)

05. DDRTree embedding and “state” (cluster) assignment

05_c. Mapping function

Inputs expected (high level)

Software

R

Python

Reproducibility notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages