PubChem-scraper

Drug discovery is a very long and cumbersome process. With the advent of computational methods like molecular docking and simulations, the burden has been reduced to some extent. But the process of creating, dowloading and coverting files required for docking is very repititive.

In most cases, when a lead compound for a drug is found, we search for other similar compounds to see if they show any increase affinity towards the protein of interest. Databases like PubChem, ChEMBL, etc are used to search for compounds that are similar to the lead compound. PubChem is very famous and commonly used database of chemical molecules and their activities. It is maintained by National Center for Biotechnology Information (NCBI) under National Institute of Health (NIH), USA. PubChem is a very useful database for searching chemcial structures. The searching process using these databases can be very manual and take a lot of human efforts. To automate the searching process, I have created this tool as an extension of the previous PDBQGT extractor. This tool searches the PubChem database for compounds similar to the input compounds and extracts the sdf files of ligand molecules from PubChem. For docking, Autodock files known as PDBQT files are required, so the sdf files downloaded from PubChem needs to be converted to PDBQT files to make them ready for docking against protein of interest. This conversion is achieved through OpenBabel. OpenBabel is a widely used molecule file format convereter software. It is available as a linux command line tool. NOTE: The conversion should not be trusted blindly. A manual check may be required to see if the 3D structure and bond orders are correct.

For using this tool, You only need to either have the structure of the compound as a pqb, pqbqt, sdf file, or the SMILE or the CID of the compound.

You may need to install the following python packages:

sys
subprocess
wget
pandas
csv
os

The script needs to be run as follows on your linux terminal: $ython Pubchem_scrapper.py molecule.file SMILE CID_of_the_compound Where molecule.file is the structure file of the compound, SMILE is the SMILE (Simplified Molecular Input Line Entry System) of the compound and CID_of_the_compound is the CID of the compound of interest. Giving only one argument is required, for the rest, the value should be set to 0.

Future prospects: To include more databases into the search and creating non-redundant search results.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
pubchem_scraper.py		pubchem_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PubChem-scraper

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PubChem-scraper

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages