Authors: Benjamin Levy, Shuying Ni, Zihao Xu, Liyang Zhao
Predicting gene expression levels from upstream promoter regions using deep learning. Collaboration between IACS and Inari.
scripts/: directory for production code
0-data-loading-processing/:01-gene-expression.py: downloads and processes gene expression data and saves into "B73_genex.txt".02-download-process-db-data.py: downloads and processes gene sequences from a specified database: 'Ensembl', 'Maize', 'Maize_addition', 'Refseq'03-combine-databases.py: combines all the downloaded sequences within all the databases04a-merge-genex-maize_seq.py:04b-merge-genex-b73.py:05a-cluster-maize_seq.sh: clusters the promoter sequences into groups with up to 80% sequence identity, which may be interpreted as paralogs05b-train-test-split.py: divides the promoter sequences into train and test sets, avoiding a set of pairs that indicate close relations ("paralogs")06_transformer_preparation.py:07_train_tokenizer.py: training byte-level BPE for RoBERTa model
1-modeling/pretrain.py: training the FLORABERT base using a masked language modeling task. Typepython scripts/1-modeling/pretrain.py --helpto see command line options, including choice of dataset and whether to warmstart from a partially trained model. Note: not all options will be used by this script.finetune.py: training the FLORABERT regression model (including newly initialized regression head) on multitask regression for gene expression in all 10 tissues. Typepython scripts/1-modeling/finetune.py --helpto see command line options; mainly for specifying data inputs and output directory for saving model weights.evaluate.py: computing metrics for the trained FLORABERT model
- [
2-feature-visualization/](https://github.com/benlevyx/florabert/tree/master/scripts/2-feature-visualization)embedding_vis.py: computing a sample of BERT embeddings for the testing data and saving to a tensorboard log. Can specify how many embeddings to sample with--num-embeddings XXwhereXXis the number of embeddings (must be integer).
module/: directory for our customized modules
module/: our main module namedflorabertthat packages customized functionsconfig.py: project-wide configuration settings and absolute paths to important directories/filesdataio.py: utilities for performing I/O operations (reading and writing to/from files)gene_db_io.py: helper functions to download and process gene sequencesmetrics.py: functions for evaluating modelsnlp.py: custom classes and functions for working with text/sequencestraining.py: helper functions that make it easier to train models in PyTorch and with Huggingface's Trainer API, as well as custom optimizers and schedulerstransformers.py: implementation of RoBERTa model with mean-pooling of final token embeddings, as well as functions for loading and working with Huggingface's transformers libraryutils.py: General-purpose functions and codevisualization.py: helper functions to perform random k-mer flip during data processing and make model prediction
If you wish to experiment with our pre-trained FLORABERT models, you can find the saved PyTorch models and the Huggingface tokenizer files here
Contents:
byte-level-bpe-tokenizer: Files expected by a Huggingfacetransformers.PretrainedTokenizermerges.txtvocab.txt
- transformer: Both language models can instantiate any RoBERTa model from Huggingface's
transformerslibrary. The prediction model should instantiate our customRobertaForSequenceClassificationMeanPoolmodel classlanguage-model: Trained on all plant promoter sequenceslanguage-model-finetuned: Further trained on just maize promoter sequencesprediction-model: Fine-tuned on the multitask regression problem