seDSM is a model for the prediction of deleterious synonymous mutations based on selective ensemble scheme.
Figure 1. Experimental flowchart. (A) Base classifiers training. Generate multiple balanced training subsets from the imbalanced benchmark training sets based on random under-sampling methods and then use the balanced training subsets to construct base classifiers with random features selection. There are three different machine algorithm, support vector machine, decision tree and logistic regression, used in this process. (B) Models selection. Calculate diversity measure of each model in the models pool and select the models with better diversity measure for integrating. And finally evaluation the models on the validation data.
Although previous studies have suggested that synonymous mutations drive or participate in various complex human diseases, accurately identifying deleterious synonymous mutations from benign ones is still challenge in the field of medical genomics. There are several computational tools that were developed to predict the harmfulness of synonymous mutations currently. However, most of these computational tools were built based on a balanced training sets with ignoring abundant negative samples that may lead to deficient performance. In this study, we proposed a novel model for prediction of deleterious synonymous mutations named seDSM, which made full used of the abundant negative samples through selective ensemble scheme based on pairwise diversity. First of all, we built models pool containing large number of candidate classifiers for ensemble based on balanced training subsets that were randomly sampled from the imbalanced training sets. Secondly, we selected a number of base classifiers from models pool based on pairwise diversity measures and integrated the models by soft voting. Finally, we constructed seDSM and compared the performance with other tools. On the two independent test sets, seDSM surpasses other tools this field on multiple evaluation indicators, suggesting its significant outstanding predictive performance for deleterious synonymous mutations. We hope that our model could contribute to the further study of deleterious synonymous mutations predicting.
- Install Python 3.9 in Linux and Windows.
- Because the program is written in Python 3.9, python 3.9 with the pip tool must be installed first.
- seDSM uses the following dependencies: numpy, pandas, sklearn and DESlib。 You can install these packages first, by the following commands:
pip install numpy
pip install pandas
pip install sklearn
pip install deslib
- If you have run above commands in Linux for the first time, you can run the following command:
sudo apt install python3-pip
- After that, users can change the commands into:
pip install numpy
pip install pandas
pip install sklearn
pip install deslib
open cmd in Windows or terminal in Linux, then cd to the BBBPred-master/codes folder which contains predict.py
To predict general synonymous mutations using our model, run:
python predict.py --input [custom predicting data in csv format] --output [ predicting results in csv format]
Example:
python predict.py --input ./example/example.csv --output ./results/results.csv
After entering predict.py, you will enter the data you need to predict and the csv file that stores the predicted results.