A machine learning–based framework designed to systematically infer regulatory mechanisms underlying metabolic dysregulation in different conditions
3 python scripts are provided, corresponding to the 3 major steps of MetaSage:
Feature_generation.pyThis script generates per-metabolite input files for downstream model training. For each target metabolite, the output file contains:
- The abundance of the target metabolite
- Multi-omics–derived features associated with that metabolite
The script requires the following input files:
-
gene_expression_file: Gene expression matrix derived from omics data (e.g., RNA-seq or proteomics).
- Rows: genes (gene symbols)
- Columns: samples (sample IDs in the first row)
-
metabolite_expression_file: Metabolite expression matrix from metabolomics data.
- Rows: metabolites (unified metabolite names)
- Columns: samples (sample IDs in the first row)
-
meta_gene_relation_file: A curated mapping file describing, for each target metabolite:
- Associated genes
- Upstream reactants
- These relationships are derived from known genome-scale metabolic models (GEMs) and filtered based on the study-specific multi-omics datasets.
-
ESTIMATE_score_results: A matrix containing 4 inferred tumor microenvironment scores generated by the ESTIMATE algorithm:
- Stromal score
- Immune score
- ESTIMATE score
- Tumor purity
Predictability_assessment.pyThis script implements an XGBoost-based regression model to assess the predictability of each target metabolite.
-
Input: feature files generated in the Feature Generation step
-
Output: a .tsv file summarizing model perfromance, including the coefficients and p-values of Peasron correlation between the observed and predicted metabolite abundance from 5-fold cross-validation.
Regulator_prioritization.pyThis script re-tain the model using the complete datasets for metabolites identified as predictable in the previous step. Feature importance is evaluated using Shapley values (SHAP), and features are ranked according to their average absolute SHAP values. The top-ranking features are considered the most influential regulators of the corresponding metabolite.
-
Input: feature files generated in the Feature Generation step
-
Output: a .tsv file summarizing the average absolute SHAP values of all features, and a visualization illustrating the SHAP values at the individual-sample level.
MetaSage: Machine Learning-Based Prioritization of Metabolic Regulators from Multi-Omics Data
Chenwei Wang, John M. Elizarraras, Bing Zhang