MICROPHERRET

MICROPHERRET: MICRObial PHEnotypic tRait classifieR using machine lEarning Techniques

MICROPHERRET is a machine learning-based for the prediction of 89 metabolic and ecological prokaryotic phenotypes from gene annotations. Starting from KEGG Orthologs, it can efficiently classify genomes and metagenomes by integrating different supervised optimized machine learning models - logistic regression, random forest, support vector machines and neural networks - maximizing the prediction accuracy. MICROPHERRET is an efficient approach for functional prediction of MAGs till 70% of genome completeness. The tool can be easily expanded and refined by the users to predict functions of interest.

Installation

Create a conda environment with Python 3.9 and packages:

conda create -n micropherret python=3.9 pandas biopython scikit-learn=1.1.2 numpy tensorflow
pip install tensorflow-addons

To enter the environment:

conda activate micropherret

Clone this git repository

git clone https://github.com/MetabioinfomicsLab/MICROPHERRET

The files needed to run the scripts can be retrieved here: https://mega.nz/folder/FAVSRYTT#ElNlwvpSMuXSZu9NDoIp3w

To run the scripts, create a folder called "MICROPHERRET" and add the files and script into it, respecting the folder structure.

Requirements

Random forest models were saved in .sav files with the parameter n_job set to 6 or fewer. Consequently, training MICROPHERRET requires at least 6 CPUs. To modify this parameter, users can retrain and optimize the models by following the previously described procedure.

Usage

Functional prediction

For simply predicting the functions of your genome of interest you need to provide the .annotations file from EggNOG-mapper. In the MICROPHERRET folder, put the annotations of your genomes or MAGs in a folder called "annotations". Run "get_annotation_matrix.ipynb" and specify the folder where the files are located. The obtained output, called "ko_df_all.csv", stores a matrix with the genetic information (KO copy number) per genome, with genomes as rows and KO as columns.

Here's an example of ko_df_all.csv's structure from "get_annotation_matrix.ipynb":

-	KO₁	KO₂	...	KO₃
GCF_XXXXXXXXXX.X	2	0	...	0
GCF_XXXXXXXXXX.X	0	1	...	5
...	...	...	...	...
GCF_XXXXXXXXXX.X	1	3	...	2

Run "predict_functions.py" from the "annotation" folder and provide the user-generated annotation matrix ko_df_all.csv as input to the script (-i file.csv). Additionally, a matrix containing real functions as columns and genomes as rows (e.g. function_db.csv) can be provided using the optional parameter -f file.csv to evaluate of prediction results.

Example:

python predict_functions.py -i ko_df_all.csv -f function_db.csv

The resulting predict_function.csv file contains the genomes used in input as rows and the prediction for each functional class of MICROPHERRET in the columns. If the prediction is positive and the genome performs a particular function, a "1" is present in the cell, otherwise a "0". Here's an example:

-	hydrocarbon_degradation	methanogenesis	...	human_gut
GCF_XXXXXXXXXX.X	0	1	...	1
GCF_XXXXXXXXXX.X	1	1	...	0
...	...	...	...	...
GCF_XXXXXXXXXX.X	0	0	...	1

The number of genomes classified into each function is also saved in the file predict_sum.csv. The script predicts the functions of 1 genome/second.

Class generation and refinement

MICROPHERRET's models can be refined to fit the user's needs by changing the sets of genomes performing the functions stored in the training set. At the same time, the training set can be expanded by modifying the training set adding genomes linked to the function the user wishes to predict. In both cases, the users can follow the instructions in "create_new_class.py". The inputs required to run the script are .annotations files from EggNOG-mapper for all the genomes linked to the function of interest. The script will then open and parse .annotations files and from each file it retrieves the KOs. Those are stored in a dictionary with KOs as keys and copy numbers as values. The dictionary is then used with the main script KO matrix to train the different machine-learning algorithms and after selecting the best-performing model it saves it. The resulting model and scaler can then be added to the folder containing MICROPHERRET models.

Parameters to use:

-g folder

Folder containing the .annotations files from EggNOG-mapper of the genomes that will be used for model training

--from_faprotax file.csv

Dataset of genomes from FAPROTAX used for MICROPHERRET training, i.e. genome_ko_all.csv from the matrix folder

-f string

Name of the function-specific classifier that will be trained with the genomes. If it is new, a new model will be trained. Otherwise, a previous model will be retrained using the provided genomes

-r file.txt

optional .txt file with the list of genomes that we want to remove from the previous training set, one per line. If genomes present in FAPROTAX are suspected to perform the function but were not provided in -g, they should be absolutely removed

Example: python create_new_class.py -g genomes_folder/ --from_faprotax faprotax_annotation_matrix.csv -f new_function -r genomes_to_remove.txt

After creating the class, add the saved model and scaler with the others in the MICROPHERRET folder to run your predictions.

Class refinement example: acetoclastic methanogenesis

The code to run the script is available with the other files in the "Example" folder. After installing MICROPHERRET and activating the environment, simply run create_new_class.py from the folder to obtain a refined version of the classifier to predict acetoclastic methanogenesis. The exact command to run is the following:

python3 create_new_class.py -g annotations/ --from_faprotax my_faprotax_db.csv  -f acetoclastic_methanogenesis

The resulting files will be the .sav and .scaler models that need to be put into the model folder in order to predict MICROPHERRET functions.

Update: MICROPHERRET for anaerobic digestion phenotypes

MICROPHERRET has been updated to allow the prediction of 9 new phenotypes of the anaerobic digestion process, most of which are currently understudied and underrepresented in the literature. The models were trained following the Class generation and refinement strategy described above.

Usage

First check that the folders MICROPHERRET/matrix/Training_sets and MICROPHERRET/saved_models are not empty. If so, download again the corresponding folders from Meta (see above).

Then, a new version of the script predict_functions_v2.py needs to be launched specifying the -ad parameter:

python3 predict_functions_v2.py -i <path_csv_file or path_annotations_files> -o <path_output_folder> -ad

Parameters to use:

-i path_csv_file or path_annotations_files

File .csv storing KO information obtained launching get_annotation_matrix.ipynb (see example and explanation above). UPDATE: The path to the folder containing the .annotations files from EggNOG-mapper can be specified instead, the script will generate the desired matrix automatically.

-o path_csv_file or path_annotations_files

Path to output folder

-ad

If specified, the new 9 anaerobic digestion specific models will be included in the prediction, resulting in 98 predicted phenotypes! Remove the -ad argument if interested in the 89 phenotypes described in the paper, as done in the previous version (see above). If -ad is specified, the script will take time to load all the different training datasets so you will need to be more patient than usual!

Important: the refined, more correct version of the acetoclastic methanogenesis model (described in the Class refinement and generation section as case study) is employed in this new version of the script!

Models performances

Models were trained on custom-made datasets and evaluated with cross validation. Information on the number on the models, their performances (measured with Matthew's correlation coefficient) and the functional class size (No. genomes) used in the training set can be found below:

Function	No. genomes	Average MCC	Status
Starch degradation	51	0.85	New
Cellulose degradation	25	0.89	Refined
Butanoate oxidation	22	0.98	New
Propionate oxidation	25	1.00	New
Homoacetogenesis	33	0.98	New
Homoacetogenesis + CO oxidation	42	0.80	New
CO oxidation	14	0.80	New
Hydrogen production	34	0.85	New
Sulfate reduction	30	1.00	Refined

Citation

If you use MICROPHERRET, please cite:

Bizzotto, E., Fraulini, S., Zampieri, G. et al. MICROPHERRET: MICRObial PHEnotypic tRait ClassifieR using Machine lEarning Techniques. Environmental Microbiome 19, 58 (2024). https://doi.org/10.1186/s40793-024-00600-6

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Scripts		Scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MICROPHERRET

Installation

Requirements

Usage

Functional prediction

Class generation and refinement

Class refinement example: acetoclastic methanogenesis

Update: MICROPHERRET for anaerobic digestion phenotypes

Usage

Models performances

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

MetabioinfomicsLab/MICROPHERRET

Folders and files

Latest commit

History

Repository files navigation

MICROPHERRET

Installation

Requirements

Usage

Functional prediction

Class generation and refinement

Class refinement example: acetoclastic methanogenesis

Update: MICROPHERRET for anaerobic digestion phenotypes

Usage

Models performances

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages