Toolbox for the paper: Cut-SOAP: A Descriptor for Machine Learning–Driven Adsorption Potential Energy Surface Modeling
Most dependencies are included in environment.yml:
channels:
- defaults
dependencies:
- python
- pandas
- numpy
- matplotlib
- scikit-learn
- pytorch
- conda-forge::ase
- conda-forge::dscribeTo automatically install the dependencies with anaconda, run:
conda env create -f environment.yml -n adspt
Qcalc is an extra in-house dependency, only needed to build datasets.
To install Qcalc directly from the GitHub repository, run:
pip install git+https://github.com/CIDAG/qcalc.gitThe program contains 4 modules: build_dataset, train, predict, convert
Description: Build training dataset from a folder.
Syntax:
python3 adspt.py build_dataset [-h] [--dspath DS_PATH] [--testpct TEST_PCT] [--rmcorr RM_CORR] [--delpospes] {fhi-aims,vasp} {cm,cutcm,soap,cutsoap,spsoap,aasoap,mbtr,lmbtr,acsf} systems_folder output_folder
positional arguments:
{fhi-aims} What type of input format will be used
{cm,cutcm,soap,cutsoap,spsoap,aasoap,mbtr,lmbtr,acsf} Which descriptor to use
systems_folder Path to folder containing the systems used for training
output_folder Folder to output the files to (path must exist)
options:
-h, --help show this help message and exit
--dspath DS_PATH Path to file with settings for the selected descriptor
--testpct TEST_PCT Percentage of the data to be separated as test
--rmcorr RM_CORR Columns with correlation above this value are removed
--delpospes Delete data with positive PES from the dataset
Descriptor settings:
Descriptor settings are JSON files that contain the parameters to execute the desired chemical descriptor conversion of the input files. If no JSON file is specified, the program will use the default settings found in:
./modules/build_data/INPUT_FORMAT/descriptors/default_settings.py
This file is also the main reference on how to setup your own parameters for the desired descritor. To understand the meaning of each parameter, check the DScribe documentation for the corresponding descriptor. Novel descriptors, such as Cut-SOAP, that were built upon DScribe framework, are documented in the main paper for this repository.
A typical MBTR setup file would look something like:
{
"geometry": {"function": "inverse_distance"},
"grid": {"min": 0.1, "max": 2, "sigma": 0.1, "n": 50},
"weighting": {"function": "exp", "scale": 0.5, "threshold": 1e-3},
"normalization": "l2",
"periodic": "false",
}Directory structure for FHI-Aims input:
- Folders containing
geometry.infiles must follow the naming scheme:ADSORBENT_ADSORBATE_REST, e.g.Cu13_CH3_026_fhi_aims_pbe_light2. - If the system DOES NOT contain overlapping atom species between the
adsorbent and adsorbate, there is no restrictions on
geometry.inatoms ordering. - If the system contains overlapping atoms, the atoms inside the
geometry.infile should be in the order ADSORBATE, then ADSORBENT. This is assumed by the program and non-compliance may lead to errors or undesired effects. - All folders with asdorption systems must contain adjacent folders fo the
adsorbent and adsorbate references. These are denoted by prefixing them with
clus_andmol_, respectively.
In short, a valid directory structure should look like this:
clus_Zr16O32_fhi_aims_pbe_light2_OK/
mol_C_fhi_aims_pbe_light2_OK/
Zr16O32_C_001_fhi_aims_pbe_light2_OK/
Zr16O32_C_002_fhi_aims_pbe_light2_OK/
Zr16O32_C_003_fhi_aims_pbe_light2_OK/
where each folder contains the corresponding aims.out file.
Description: Trains ML models and predicts Potential Energy Surface (PES)
Syntax:
python3 adspt.py train [-h] [--hppath HP_PATH] [--scaledata] {linear,krr,mlp} features_path property_path output_folder
positional arguments:
{linear,krr,mlp} Which Machine Learning model to use
features_path Path to CSV file containing the features
property_path Path to CSV file containing the property
output_folder Folder to save BIN files (model and scaler data)
options:
-h, --help show this help message and exit
--hppath HP_PATH Path to file with hyperparams for the model
--scaledata Scale the training dataset and save the scaler parameters
Model hyperparameters:
Model hyperparameters settings are specified inside JSON files, just like descriptor settings. If no JSON file is specified, the program will use the default settings found in:
modules/train_steps/predictors/default_settings.py
This file is also the main reference on how to setup your own hyperparameters for the desired predictor. To understand the meaning of each parameter, check the Scikit-learn and PyTorch documentation pages the for the corresponding predictor algorithm.
A typical MLP setup file would look something like:
{
"network": {
"hidden_layers": [24, 12, 6],
"activation_fn": "ReLU",
},
"optimization": {
"optimizer": "Adam",
"optim_params": {
"lr": 0.001
}
},
"training": {
"train_size": 0.8,
"epochs": 500,
"batch_size": 10000,
}
}Description: Predict Potential Energy Surface (PES) using a trained ML model
Syntax:
python3 adspt.py predict [-h] trained_model_file features_file scaler_path output_path
positional arguments:
trained_model_file Path to file containing a valid trained ML model
features_file Path to CSV file containing the features
scaler_path Path to scaler used in the training
output_path Path to save predicted PES
options:
-h, --help show this help message and exit
Description: Convert new data to ML-friendly format
Syntax:
python3 adspt.py convert [-h] systems_folder output_folder descriptor
positional arguments:
systems_folder Path to folder containing the systems to convert
output_folder Path to the folder to output files
descriptor Path to descriptor algorithm
options:
-h, --help show this help message and exit
The datasets used for training, validation, and testing, along with the additional synthetic data generated to assess real-world performance, are available at Zenodo.
Contains four folders:
- binaries: serialized files for the fitted Cut-SOAP, trained model, and fitted scaler.
- cutsoap_data: dataset formatted using Cut-SOAP
- cutsoap_data_rbl: dataset formatted using Cut-SOAP + oversampling rebalancing.
- synthetic_data: synthetic dataset generated to assess the model's real-world performance.
With these files, only the predict part of the pipeline is needed to reproduce most of the results, e.g.:
python3 adspt.py predict binaries/trained_mlp.bin synthetic/convert_features.csv binaries/trained_scaler.bin output_folder
NOTE: Serialized binaries are sensitive to library versions. Here are the recommended versions:
- pytorch: 2.1.0
- python: 3.11.4
- pandas: 1.5.3
- scikit-learn: 1.2.2
- ase: 3.22.1
- dscribe: 2.0.0