Skip to content

CIDAG/Adspt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adspt

Toolbox for the paper: Cut-SOAP: A Descriptor for Machine Learning–Driven Adsorption Potential Energy Surface Modeling

Install dependencies

Most dependencies are included in environment.yml:

channels:
  - defaults
dependencies:
  - python
  - pandas
  - numpy
  - matplotlib
  - scikit-learn
  - pytorch
  - conda-forge::ase
  - conda-forge::dscribe

To automatically install the dependencies with anaconda, run:

conda env create -f environment.yml -n adspt

Qcalc is an extra in-house dependency, only needed to build datasets.

To install Qcalc directly from the GitHub repository, run:

pip install git+https://github.com/CIDAG/qcalc.git

Usage

The program contains 4 modules: build_dataset, train, predict, convert

build_dataset

Description: Build training dataset from a folder.

Syntax:

python3 adspt.py build_dataset [-h] [--dspath DS_PATH] [--testpct TEST_PCT] [--rmcorr RM_CORR] [--delpospes] {fhi-aims,vasp} {cm,cutcm,soap,cutsoap,spsoap,aasoap,mbtr,lmbtr,acsf} systems_folder output_folder

positional arguments:
  {fhi-aims}       What type of input format will be used
  {cm,cutcm,soap,cutsoap,spsoap,aasoap,mbtr,lmbtr,acsf} Which descriptor to use
  systems_folder        Path to folder containing the systems used for training
  output_folder         Folder to output the files to (path must exist)

options:
  -h, --help            show this help message and exit
  --dspath DS_PATH      Path to file with settings for the selected descriptor
  --testpct TEST_PCT    Percentage of the data to be separated as test
  --rmcorr RM_CORR      Columns with correlation above this value are removed
  --delpospes           Delete data with positive PES from the dataset

Descriptor settings:

Descriptor settings are JSON files that contain the parameters to execute the desired chemical descriptor conversion of the input files. If no JSON file is specified, the program will use the default settings found in:

./modules/build_data/INPUT_FORMAT/descriptors/default_settings.py

This file is also the main reference on how to setup your own parameters for the desired descritor. To understand the meaning of each parameter, check the DScribe documentation for the corresponding descriptor. Novel descriptors, such as Cut-SOAP, that were built upon DScribe framework, are documented in the main paper for this repository.

A typical MBTR setup file would look something like:

{
    "geometry": {"function": "inverse_distance"},
    "grid": {"min": 0.1, "max": 2, "sigma": 0.1, "n": 50},
    "weighting": {"function": "exp", "scale": 0.5, "threshold": 1e-3},
    "normalization": "l2",
    "periodic": "false",
}

Directory structure for FHI-Aims input:

  • Folders containing geometry.in files must follow the naming scheme: ADSORBENT_ADSORBATE_REST, e.g. Cu13_CH3_026_fhi_aims_pbe_light2.
  • If the system DOES NOT contain overlapping atom species between the adsorbent and adsorbate, there is no restrictions on geometry.in atoms ordering.
  • If the system contains overlapping atoms, the atoms inside the geometry.in file should be in the order ADSORBATE, then ADSORBENT. This is assumed by the program and non-compliance may lead to errors or undesired effects.
  • All folders with asdorption systems must contain adjacent folders fo the adsorbent and adsorbate references. These are denoted by prefixing them with clus_ and mol_, respectively.

In short, a valid directory structure should look like this:

clus_Zr16O32_fhi_aims_pbe_light2_OK/
mol_C_fhi_aims_pbe_light2_OK/
Zr16O32_C_001_fhi_aims_pbe_light2_OK/
Zr16O32_C_002_fhi_aims_pbe_light2_OK/
Zr16O32_C_003_fhi_aims_pbe_light2_OK/

where each folder contains the corresponding aims.out file.

train

Description: Trains ML models and predicts Potential Energy Surface (PES)

Syntax:

python3 adspt.py train [-h] [--hppath HP_PATH] [--scaledata] {linear,krr,mlp} features_path property_path output_folder

positional arguments:
  {linear,krr,mlp}  Which Machine Learning model to use
  features_path     Path to CSV file containing the features
  property_path     Path to CSV file containing the property
  output_folder     Folder to save BIN files (model and scaler data)

options:
  -h, --help        show this help message and exit
  --hppath HP_PATH  Path to file with hyperparams for the model
  --scaledata       Scale the training dataset and save the scaler parameters

Model hyperparameters:

Model hyperparameters settings are specified inside JSON files, just like descriptor settings. If no JSON file is specified, the program will use the default settings found in:

modules/train_steps/predictors/default_settings.py

This file is also the main reference on how to setup your own hyperparameters for the desired predictor. To understand the meaning of each parameter, check the Scikit-learn and PyTorch documentation pages the for the corresponding predictor algorithm.

A typical MLP setup file would look something like:

{
    "network": {
        "hidden_layers": [24, 12, 6],
        "activation_fn": "ReLU",
    },
    "optimization": {
        "optimizer": "Adam",
        "optim_params": {
            "lr": 0.001
        }
    },
    "training": {
        "train_size": 0.8,
        "epochs": 500,
        "batch_size": 10000,
    }
}

predict

Description: Predict Potential Energy Surface (PES) using a trained ML model

Syntax:

python3 adspt.py predict [-h] trained_model_file features_file scaler_path output_path

positional arguments:
  trained_model_file  Path to file containing a valid trained ML model
  features_file       Path to CSV file containing the features
  scaler_path         Path to scaler used in the training
  output_path         Path to save predicted PES

options:
  -h, --help          show this help message and exit

convert

Description: Convert new data to ML-friendly format

Syntax:

python3 adspt.py convert [-h] systems_folder output_folder descriptor

positional arguments:
  systems_folder  Path to folder containing the systems to convert
  output_folder   Path to the folder to output files
  descriptor      Path to descriptor algorithm

options:
  -h, --help      show this help message and exit

Datasets

The datasets used for training, validation, and testing, along with the additional synthetic data generated to assess real-world performance, are available at Zenodo.

Contains four folders:

  • binaries: serialized files for the fitted Cut-SOAP, trained model, and fitted scaler.
  • cutsoap_data: dataset formatted using Cut-SOAP
  • cutsoap_data_rbl: dataset formatted using Cut-SOAP + oversampling rebalancing.
  • synthetic_data: synthetic dataset generated to assess the model's real-world performance.

With these files, only the predict part of the pipeline is needed to reproduce most of the results, e.g.:

python3 adspt.py predict binaries/trained_mlp.bin synthetic/convert_features.csv binaries/trained_scaler.bin output_folder

NOTE: Serialized binaries are sensitive to library versions. Here are the recommended versions:

  • pytorch: 2.1.0
  • python: 3.11.4
  • pandas: 1.5.3
  • scikit-learn: 1.2.2
  • ase: 3.22.1
  • dscribe: 2.0.0

About

Toolbox for the paper: Cut-SOAP: A Descriptor for Machine Learning–Driven Adsorption Potential Energy Surface Modeling

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages