Installation

KDE analysis method for evaluating molecular fingerprinting methods and similarity functions with electronic structure data. Modules:

kde_analysis: Code to use the KDE analysis for evaluating similarity measures.
neighborhood_ratios: Code to use the neighborhood ratios analysis found in Patterson et. al..
generate_db: Code to generate a MongoDB database with molecule pair information.
settings.py: This files contains settings for the rest of the code.

Installation

Clone this GitHub repository.

git clone git@github.com:D3TaLES/roboticsUI.git

Create a Conda environment.

conda create --name similarities --file similarities.yml
conda activate similarities

Settings

The file settings.py contains several default variable that are used throughout the rest of the code in this repository. Before running any code, be sure the check the variables in this file. In particular, check the default save directories. If using a MongoDB approach, check the MongoDB connection variables.

Quickstart

First, install this git repository and configure your settings.py file. Then you may begin. Start by loading your data as a pandas DataFrame before creating an analysis object and performing the analysis. Note that the DataFrame should contain columns with the name of each electronic property defined in the settings.py file as well as a smiles column.

This analysis will randomly pull num_trials samples of size size from your data and perform the KDE analysis.

import pandas as pd 
from similarities.kde_analysis import SimilarityAnalysisRand

# Load data
all_d = pd.read_pickle("path_to_dataset.pkl")

# Create analysis object
sim_anal = SimilarityAnalysisRand(anal_percent=0.20, top_percent=0.1, orig_df=all_d, anal_name="TEST")

# Perform analysis 
kde_df = sim_anal.random_sampling(size=100000, num_trials=10, plot=True, return_plot=False)

Documentation

KDE Analysis

Using a local data file and performing random sampling

When to Use: This method is used to perform the analysis on random samples from a dataset.
This will be the best option for most use cases.

First, you must load your dataset as a pandas DataFrame. Note that the DataFrame should contain columns with the name of each electronic property defined in the settings.py file as well as a smiles column.

import pandas as pd 

all_d = pd.read_pickle("path_to_dataset.pkl")

Next, create an analysis object and perform the analysis.

from similarities.kde_analysis import SimilarityAnalysisRand

# Create analysis object
sim_anal = SimilarityAnalysisRand(anal_percent=0.10, top_percent=0.1, verbose=0, orig_df=all_d, anal_name="RandTEST")

# Perform analysis 
kde_df = sim_anal.random_sampling(size=100000, num_trials=10, plot=True, return_plot=False, upper_bound=0.172)

Note that the analysis instructions included the key word argument upper_bound. This upper bound can be found as described in the next section.

Boundaries

Boundaries can be easily calculated using the analysis object. The random_sampling function is called but the replace_sim keyword argument is used to replace all similarity values with random values from some distribution. This distribution can be either "uniform", "uniformCorr", "normal" or "normalCorr". For the most part, you should use "uniform" for the upper bound and "uniformCorr" for the lower bound.

from similarities.analysis import *

sim_anal = SimilarityPairsDBAnalysis(anal_percent=0.10, top_percent=0.10, verbose=0, orig_df=all_d, anal_name="RandTEST")

upper_bound_df = sim_anal.random_sampling(size=100000, num_trials=20, plot=True, replace_sim="uniform", return_plot=False)
upper_bound = upper_bound_df.mean().mean()

Using a MongoDB database

When to Use: This method performs analysis on an entire dataset hosted in a MongoDB database. This method is more difficult because it requires connection to a MongoDB database. However, it is necessary for performing an analysis if the goal is to perform in on all possible molecule pairs in a large dataset (where the memory requirement would be too large to use the previously described method).

First, you must generate the MongoDB database collections.

from similarities.generate_db import *
d_percent, a_percent, t_percent = 0.10, 0.10, 0.10

# Create molecules MongoDB collection 
all_d = load_mols_db("path_to_dataset.pkl")

# Create molecule pairs MongoDB collection and all possible pair combination IDs. 
add_pairs_db_idx()

# Generate similarity and property difference data for all the molecule pairs IDs. 
create_pairs_db_parallel(verbose=2, sim_min=0.15)

Now, you may use this database to perform the KDE analysis on the entire dataset.

from similarities.kde_analysis import *

all_anal = SimilarityPairsDBAnalysis(anal_percent=0.10, top_percent=0.10, verbose=0)

# OPTIONAL: If you want to improve the computational efficiency of the KDE analysis, generate
# divides for all the similarity measures. This simply identifies the similarity value that divides
# the top_percent similarities from the rest. It can be relatively expensive to perform again and again. 
all_anal.gen_all_divides()

# Perform KDE analysis 
kde_results = all_anal.kde_all()

# Plot results 
ax = all_anal.plot_avg_df(kde_results, return_plot=True, red_labels=True, ratio_name="MongoDB", anal_name=f"all_{all_anal.perc_name}")

You may also perform the random sampling analysis with the MongoDB database data.

from similarities.kde_analysis import *

sim_anal = SimilarityPairsDBAnalysis(anal_percent=0.10, top_percent=0.1, verbose=0)

kde_df = sim_anal.random_sampling(size=100000, num_trials=20, plot=True, return_plot=False, upper_bound=0.172)

Neighborhood Ratios

The neighborhood ratios analysis is derived from Patterson et. al..

This analysis is performed much the same way as the KDE analysis. The only difference is that you must set the random_sampling key word argument method to equal "nhr".

from similarities.kde_analysis import *


sim_anal = SimilarityAnalysisRand(anal_percent=1, top_percent=1, verbose=1)
nhr_df = sim_anal.random_sampling(size=100000, num_trials=10, plot=True, method="nhr", return_plot=False,lower_bound=1.75)

Note that boundaries can be calculated the same was described above where method="nhr". Note also that either the SimilarityPairsDBAnalysis or the SimilarityAnalysisRand class may be used.

Ranking Analysis

The ranking analysis measures how many of the top-ranked pairs according to similarity overlap with the top-ranked pairs according to property difference.

This analysis is performed much the same way as the KDE analysis. Again, the only difference is that you must set the random_sampling key word argument method to equal "ranking".

from similarities.kde_analysis import *


sim_anal = SimilarityAnalysisRand(anal_percent=1, top_percent=1, verbose=1)
nhr_df = sim_anal.random_sampling(size=100000, num_trials=10, plot=True, method="ranking", return_plot=False,lower_bound=1.75)

Note that boundaries can be calculated the same was described above where method="ranking". Note also that either the SimilarityPairsDBAnalysis or the SimilarityAnalysisRand class may be used.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
similarities		similarities
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
similarities.yml		similarities.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Settings

Quickstart

Documentation

KDE Analysis

Using a local data file and performing random sampling

Boundaries

Using a MongoDB database

Neighborhood Ratios

Ranking Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Installation

Settings

Quickstart

Documentation

KDE Analysis

Using a local data file and performing random sampling

Boundaries

Using a MongoDB database

Neighborhood Ratios

Ranking Analysis

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages