Skip to content

Structural Classification of Biosynthetic Gene Cluster Products

License

Notifications You must be signed in to change notification settings

HassounLab/BGCat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Structural Classification of Biosynthetic Gene Cluster Products

This repository contains the code and model weights necessary to run BGCat with user-supplied biosynthetic gene clusters (BGCs). BGCat provides detailed BGC product classification per the NPClassifier nomenclature based on the biosynthetic gene content of BGCs.

Environment Setup

We recommend using conda for managing this environment. The environment requires Python (3.10 or later), RDKit, and ESM. At the time of publication, the environment can be installed as follows:

conda create -n bgcat -c conda-forge rdkit python=3.10
conda activate bgcat
pip install esm httpx

Once installed, clone the repository and enter its directory.

git clone https://github.com/HassounLab/bgcat
cd bgcat

Next, download the ESM Cambrian (ESM C) 600M weights, and place them in data/weights:

wget -P data/weights https://huggingface.co/EvolutionaryScale/esmc-600m-2024-12/resolve/main/data/weights/esmc_600m_2024_12_v0.pth

Usage

The BGCat workflow consists of two major steps: embedding of BGCs with ESM C, followed by BGC product classification.

Start by identifying your BGCs of interest along with their biosynthetic genes. If you're using antiSMASH, retrieve all biosynthetic genes labeled biosynthetic and biosynthetic-additional (do not include regulatory, transport, or other genes). Construct an input JSON file containg a dictionary, where the keys are BGC identifiers and the values are the corresponding lists of biosynthetic genes, with each gene represented by a string of nucleotides. An example of an input file is provided in /data/example-bgcs.json.

Generate BGC embeddings using the embed.py script. The arguments to the script are the path to the input file, followed by the path to the output file. For instance, the following command will process the example BGC input file:

./embed.py data/example-bgcs.json data/features.json

Next, perform BGC product classification using the classify.py script. Similarly to the previous script, the path to the input file with embeddings is provided in the first argument, and the path to the output file with predictions is given in the second argument. For example, the following command will generate product classifications using the embeddings from the previous step:

./classify.py data/features.json data/predictions.json

The output file is JSON containing a dictionary with two elements. The labels key provides an ordered list of 294 NPClassifier classes that the model can predict. The bgcs key provides a dictionary, where the keys correspond to BGC identifiers and the values are lists of top 20 predicted classes. The classes are listed in their rank order, such that the first element refers to the highest-likelihood class. The classes are represented by an index in the labels dictionary.

License

The code in this project is offered under the MIT license. See the LICENSE.md file for details.

About

Structural Classification of Biosynthetic Gene Cluster Products

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages