This repository contains the code and model weights necessary to run BGCat with user-supplied biosynthetic gene clusters (BGCs). BGCat provides detailed BGC product classification per the NPClassifier nomenclature based on the biosynthetic gene content of BGCs.
We recommend using conda for managing this environment. The environment requires Python (3.10 or later), RDKit, and ESM. At the time of publication, the environment can be installed as follows:
conda create -n bgcat -c conda-forge rdkit python=3.10
conda activate bgcat
pip install esm httpxOnce installed, clone the repository and enter its directory.
git clone https://github.com/HassounLab/bgcat
cd bgcatNext, download the ESM Cambrian (ESM C) 600M weights, and place them in data/weights:
wget -P data/weights https://huggingface.co/EvolutionaryScale/esmc-600m-2024-12/resolve/main/data/weights/esmc_600m_2024_12_v0.pthThe BGCat workflow consists of two major steps: embedding of BGCs with ESM C, followed by BGC product classification.
Start by identifying your BGCs of interest along with their biosynthetic genes. If you're using antiSMASH, retrieve all biosynthetic genes labeled biosynthetic and biosynthetic-additional (do not include regulatory, transport, or other genes). Construct an input JSON file containg a dictionary, where the keys are BGC identifiers and the values are the corresponding lists of biosynthetic genes, with each gene represented by a string of nucleotides. An example of an input file is provided in /data/example-bgcs.json.
Generate BGC embeddings using the embed.py script. The arguments to the script are the path to the input file, followed by the path to the output file. For instance, the following command will process the example BGC input file:
./embed.py data/example-bgcs.json data/features.jsonNext, perform BGC product classification using the classify.py script. Similarly to the previous script, the path to the input file with embeddings is provided in the first argument, and the path to the output file with predictions is given in the second argument. For example, the following command will generate product classifications using the embeddings from the previous step:
./classify.py data/features.json data/predictions.jsonThe output file is JSON containing a dictionary with two elements. The labels key provides an ordered list of 294 NPClassifier classes that the model can predict. The bgcs key provides a dictionary, where the keys correspond to BGC identifiers and the values are lists of top 20 predicted classes. The classes are listed in their rank order, such that the first element refers to the highest-likelihood class. The classes are represented by an index in the labels dictionary.
The code in this project is offered under the MIT license. See the LICENSE.md file for details.
