This repository has been migrated to https://github.com/alcantarar/BiomchBERT, where it is being loosely maintained.
We use Machine Learning to predict the general topic of a biomechanics-related paper given its title. To accomplish this, we:
- Developed an HTML web scraper to extract the paper information and assigned paper topic from every Biomch-L Literature Update since 2010. (
webscraper.py) - Trained and compared multiple classification Machine Learning algorithms (
keras_1.py&test_many_ML_algorithms_nn.ipynb) - Created a python script (
literature_search.ipynb) that:- Searches PubMed for Biomechanics-related papers published in the past week,
- Uses the top-performing Machine Learning model (
keras-1, a Deep Neural Network with 73.5% accuracy) to predict the paper topic for the week’s papers, - Compiles papers, formats their citation, and organizes them by topic, saving to .md file here: Literature Updates.
A neato gif.
Contains the files to contstruct the models. Two main files keras_1.py and test_many_ML_algorithms_nn.ipynb.
keras_1.py- Fits a deep neural network to data contained in Data. Saves the models into models. The vectorizer and label encoders are saved here as well.test_many_ML_algorithms_nn.ipynb- Fits multiple machine learning methods to the Data. Includes Multinomial Naive Payes, Logistic Regression, Stochastic Gradient Descent (SGD), Linear Support Vector Classification), and Multi-layer Perceptron Classifier. Saves the data into models. The vectorizer and label encoders are saved here as well.keras_eval.py- A small script to evaluate the keras neural network on test strings.
Where the webscraped data is stored.
- RYANDATA.csv - The full csv file including paper number, Category/Topic, Authors, Title, Journal, Year, Volume and Issue, DOI, and Abstract. Named this way because Gary just thought he would hand the data off and not get really really caught up in this. Boy, was he wrong.
- RYANDATA_filt.csv - Has all the same headers as RYANDATA.csv, but filters out topics that represent less than 5% of the total papers.
- RYANDATA_filt_even.csv - An evenly downsampled (by topic) csv of RYANDATA_filt.csv. Each topic has the same number of representations in this csv.
Where weekly updates can be stored in markdown & csv format for publishing.
Where all the model files are saved after being created.
- Keras_model - Location of all the Keras Neural Net files. Some neural net files are to large to upload to Git on their own so are split. Using 7-zip(Windows) or Keka (MacOS) you can recombine these files to create the model file and weights file.
- Many_ML_models - Location of all the many ML testing files are saved. The mpl file will need to be recombined using 7-zip/Keka similar to the Keras Neural Net files.
Model validation plots are saved here. Usually a confusion matrix.
The python file to scrape the Biomch-L forum.
Ipython Notebook to generate the literature update. Uses Biopython v1.73 to perfrom a literature search, then the a given ML model to classify the papers. Saves the results in a markdown file in literature update.
- BeautifySoup is used to scrape the web for the articles to feed into the ML models.
- Keras and Scikit-learn are used to construct ML models.
- Biopython is used to access PubMed. Requires version
1.73or newer.
