Skip to content

guomics-lab/GNHSF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

GNHSF: Large Scale Metaproteomics Reveals Key Microbial Functions in Metabolic Diseases and Aging
This repository contains the main analysis codes used to generate figures in the GNHSF study.

Code Description
figure_R
This directory contains the R scripts used for generating the figures in the research paper

Generate_matrix.R   #Prepares specific sample cohorts for downstream processing: N=1385 for cross-sectional analysis, N=954 for longitudinal analysis, and N=1039 for metagenomic integration.

arrange_link.R  #This script establishes the correspondence relationships among peptides, proteins, and taxa, generating the protein-taxa mapping files required by all subsequent analysis scripts.

Fig1D_FigS1ABC.R    #This script generates figure1D and supplementary figure 1. Note: Run figs1_part_getBC.R first to calculate Bray-Curtis distance matrices for all replicate types. Includes:
Fig 1D: Spearman correlations across all types of QCs.
Fig S1A-B, Fig 1C: Correlation coefficients and Bray-Curtis distances for all replicates
Fig S1C: PCoA of all 2,512 samples

Fig1E_FigS1FG.R #This script generates:
Fig 1E: Average identification counts per sample at each taxonomic level
Fig S1F: Distribution histogram of sample identification counts
Fig S1G: Breakdown of microbial proteins versus human proteins

Fig2.R  #This script processes and visualizes results from Generalized Linear Model (GLM) analysis. Note: Run fig2_glm_1385.R first to obtain complete GLM results. Includes:
Fig S3D: GLM associations grouped by clinical categories
Fig 2A: Summary of top 6 associations
Fig 2B-D: Heatmap visualization of the most significant associations at different taxonomic levels

Fig3.R  #This script analyzes metaproteomic features associated with aging. Note: Run fig3_glmm_954.R first to calculate within-subject associations using Generalized Linear Mixed Models (GLMM). Includes:
Fig 3A-F: Aging-associated metaproteomic features

Fig4.R  #This script identifies and visualizes metaproteomic features commonly associated with metabolic diseases. Includes:
Fig 4A-C: Shared metaproteomic signatures across metabolic diseases

Fig5.R  #This script performs medication-weighted GLM calculations and generates related visualizations. Includes:
Fig 5B-G: Medication-responsive metaproteomic features in metabolic diseases
Fig S7D: Medication-specific proteins and their corresponding species in T2D

Fig6_FigS6_FigS7.R  #This script analyzes and visualizes T2D-associated features. Note: Perform GLM analysis on the FH cohort and run machine learning code to export ML-related features before executing relevant sections. Includes:
Fig 6A: Network visualization of T2D-associated species
Fig 6C: Network visualization of T2D-associated metaproteomic features
Fig S6B: Comparison between metaproteomics and metagenomics data
Fig S6C: T2D-related species and their produced microbial protein groups
Fig S7C: GLM associations of M. elsdenii proteins with T2D and T2D medication

Fig7.R  #This script visualizes in vivo and in vitro biological validation data. Includes:
Fig 7: All panels for biological validation experiments

FigS1DE_mapping.R   #This script calculates the proportion of each sample annotated to taxa or functions for Fig S1D-E: Annotation coverage statistics

FigS2_count.R   #This script generates Fig S2A-H: Count statistics of top features

FigS2A-H.R  #This script generates all panels in Fig S2 A-H.

FigS2I-K.R  #This script generates all panels in Fig S2 I-K.

FigS3_FigS4.R   #This script calculates and visualizes all core features of the GNHSF metaproteomic dataset. Includes Fig S3 and Fig S4: All panels showing core metaproteomic features

FigS5.R #This script performs Fig S5A: PERMANOVA analysis

ML_py
This directory contains the python scripts used for machine learning in the research paper. Includes:
evaluate.py                # Calculate AUC and other metrics from predicted probabilities and true labels.
test_model_extra.py        # Evaluate model performance on the external test set.
test_model_inter.py        # Evaluate model performance on the internal test set.
train_model.py             # Train models on different proteomics datasets.
validation_analysis.py  # model validation metric calculation script.

figures
This directory contains the python scripts and results of ROC-AUC and PR-AUC of machine learning models. Includes:
auc_curves                     # This directory contains the ROC-AUC curve plots of internal and external tests
external_roc.pdf
internal_roc.pdf
boxplot                        # This directory contains the scripts and box plots of ROC-AUC and PR-AUC. Includes:
seed_auc_boxplot_external.pdf
seed_auc_boxplot_internal.pdf
seed_prauc_boxplot_external.pdf
seed_prauc_boxplot_internal.pdf
plot_auc_boxplots.py          # Generate boxplots illustrating the distribution of ROC-AUC scores across 20 random seeds.
plot_pr_auc_boxplots.py       # Create boxplots that depict the distribution of PR-AUC (Precision-Recall Area Under the Curve) scores across 20 random seeds

Usage Notes
Some scripts require prerequisite scripts to be run first (as noted in descriptions above).
Ensure all dependency files and intermediate results are properly generated before running downstream analyses.
Scripts are named according to their corresponding figures in the manuscript.

About

Population-based metaproteomics reveals key microbial functions in metabolic diseases and aging

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors