Skip to content

MetabioinfomicsLab/MICROPHERRET

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 

Repository files navigation

MICROPHERRET

MICROPHERRET: MICRObial PHEnotypic tRait classifieR using machine lEarning Techniques

MICROPHERRET is a machine learning-based for the prediction of 89 metabolic and ecological prokaryotic phenotypes from gene annotations. Starting from KEGG Orthologs, it can efficiently classify genomes and metagenomes by integrating different supervised optimized machine learning models - logistic regression, random forest, support vector machines and neural networks - maximizing the prediction accuracy. MICROPHERRET is an efficient approach for functional prediction of MAGs till 70% of genome completeness. The tool can be easily expanded and refined by the users to predict functions of interest.

Installation

Create a conda environment with Python 3.9 and packages:

conda create -n micropherret python=3.9 pandas biopython scikit-learn=1.1.2 numpy tensorflow
pip install tensorflow-addons

To enter the environment:

conda activate micropherret

Clone this git repository

git clone https://github.com/MetabioinfomicsLab/MICROPHERRET

The files needed to run the scripts can be retrieved here: https://mega.nz/folder/FAVSRYTT#ElNlwvpSMuXSZu9NDoIp3w

To run the scripts, create a folder called "MICROPHERRET" and add the files and script into it, respecting the folder structure.

Requirements

Random forest models were saved in .sav files with the parameter n_job set to 6 or fewer. Consequently, training MICROPHERRET requires at least 6 CPUs. To modify this parameter, users can retrain and optimize the models by following the previously described procedure.

Usage

Functional prediction

For simply predicting the functions of your genome of interest you need to provide the .annotations file from EggNOG-mapper. In the MICROPHERRET folder, put the annotations of your genomes or MAGs in a folder called "annotations". Run "get_annotation_matrix.ipynb" and specify the folder where the files are located. The obtained output, called "ko_df_all.csv", stores a matrix with the genetic information (KO copy number) per genome, with genomes as rows and KO as columns.

Here's an example of ko_df_all.csv's structure from "get_annotation_matrix.ipynb":

- KO1 KO2 ... KO3
GCF_XXXXXXXXXX.X 2 0 ... 0
GCF_XXXXXXXXXX.X 0 1 ... 5
... ... ... ... ...
GCF_XXXXXXXXXX.X 1 3 ... 2

Run "predict_functions.py" from the "annotation" folder and provide the user-generated annotation matrix ko_df_all.csv as input to the script (-i file.csv). Additionally, a matrix containing real functions as columns and genomes as rows (e.g. function_db.csv) can be provided using the optional parameter -f file.csv to evaluate of prediction results.

Example:

python predict_functions.py -i ko_df_all.csv -f function_db.csv

The resulting predict_function.csv file contains the genomes used in input as rows and the prediction for each functional class of MICROPHERRET in the columns. If the prediction is positive and the genome performs a particular function, a "1" is present in the cell, otherwise a "0". Here's an example:

- hydrocarbon_degradation methanogenesis ... human_gut
GCF_XXXXXXXXXX.X 0 1 ... 1
GCF_XXXXXXXXXX.X 1 1 ... 0
... ... ... ... ...
GCF_XXXXXXXXXX.X 0 0 ... 1

The number of genomes classified into each function is also saved in the file predict_sum.csv. The script predicts the functions of 1 genome/second.

Class generation and refinement

MICROPHERRET's models can be refined to fit the user's needs by changing the sets of genomes performing the functions stored in the training set. At the same time, the training set can be expanded by modifying the training set adding genomes linked to the function the user wishes to predict. In both cases, the users can follow the instructions in "create_new_class.py". The inputs required to run the script are .annotations files from EggNOG-mapper for all the genomes linked to the function of interest. The script will then open and parse .annotations files and from each file it retrieves the KOs. Those are stored in a dictionary with KOs as keys and copy numbers as values. The dictionary is then used with the main script KO matrix to train the different machine-learning algorithms and after selecting the best-performing model it saves it. The resulting model and scaler can then be added to the folder containing MICROPHERRET models.

Parameters to use:

  • -g folder

Folder containing the .annotations files from EggNOG-mapper of the genomes that will be used for model training

  • --from_faprotax file.csv

Dataset of genomes from FAPROTAX used for MICROPHERRET training, i.e. genome_ko_all.csv from the matrix folder

  • -f string

Name of the function-specific classifier that will be trained with the genomes. If it is new, a new model will be trained. Otherwise, a previous model will be retrained using the provided genomes

  • -r file.txt

optional .txt file with the list of genomes that we want to remove from the previous training set, one per line. If genomes present in FAPROTAX are suspected to perform the function but were not provided in -g, they should be absolutely removed

Example: python create_new_class.py -g genomes_folder/ --from_faprotax faprotax_annotation_matrix.csv -f new_function -r genomes_to_remove.txt

After creating the class, add the saved model and scaler with the others in the MICROPHERRET folder to run your predictions.

Class refinement example: acetoclastic methanogenesis

The code to run the script is available with the other files in the "Example" folder. After installing MICROPHERRET and activating the environment, simply run create_new_class.py from the folder to obtain a refined version of the classifier to predict acetoclastic methanogenesis. The exact command to run is the following:

python3 create_new_class.py -g annotations/ --from_faprotax my_faprotax_db.csv  -f acetoclastic_methanogenesis 

The resulting files will be the .sav and .scaler models that need to be put into the model folder in order to predict MICROPHERRET functions.

Update: MICROPHERRET for anaerobic digestion phenotypes

MICROPHERRET has been updated to allow the prediction of 9 new phenotypes of the anaerobic digestion process, most of which are currently understudied and underrepresented in the literature. The models were trained following the Class generation and refinement strategy described above.

Usage

First check that the folders MICROPHERRET/matrix/Training_sets and MICROPHERRET/saved_models are not empty. If so, download again the corresponding folders from Meta (see above).

Then, a new version of the script predict_functions_v2.py needs to be launched specifying the -ad parameter:

python3 predict_functions_v2.py -i <path_csv_file or path_annotations_files> -o <path_output_folder> -ad

Parameters to use:

  • -i path_csv_file or path_annotations_files

File .csv storing KO information obtained launching get_annotation_matrix.ipynb (see example and explanation above). UPDATE: The path to the folder containing the .annotations files from EggNOG-mapper can be specified instead, the script will generate the desired matrix automatically.

  • -o path_csv_file or path_annotations_files

Path to output folder

  • -ad

If specified, the new 9 anaerobic digestion specific models will be included in the prediction, resulting in 98 predicted phenotypes! Remove the -ad argument if interested in the 89 phenotypes described in the paper, as done in the previous version (see above). If -ad is specified, the script will take time to load all the different training datasets so you will need to be more patient than usual!

Important: the refined, more correct version of the acetoclastic methanogenesis model (described in the Class refinement and generation section as case study) is employed in this new version of the script!

Models performances

Models were trained on custom-made datasets and evaluated with cross validation. Information on the number on the models, their performances (measured with Matthew's correlation coefficient) and the functional class size (No. genomes) used in the training set can be found below:

Function No. genomes Average MCC Status
Starch degradation 51 0.85 New
Cellulose degradation 25 0.89 Refined
Butanoate oxidation 22 0.98 New
Propionate oxidation 25 1.00 New
Homoacetogenesis 33 0.98 New
Homoacetogenesis + CO oxidation 42 0.80 New
CO oxidation 14 0.80 New
Hydrogen production 34 0.85 New
Sulfate reduction 30 1.00 Refined

Citation

If you use MICROPHERRET, please cite:

Bizzotto, E., Fraulini, S., Zampieri, G. et al. MICROPHERRET: MICRObial PHEnotypic tRait ClassifieR using Machine lEarning Techniques. Environmental Microbiome 19, 58 (2024). https://doi.org/10.1186/s40793-024-00600-6

About

MICROPHERRET: MICRObial PHEnotypic tRait ClassifieR using machine lEarning Techniques

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •