Skip to content

TimurKambarov/OPV-Property-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

OPV Property Prediction

Machine Learning models for predicting molecular properties of Organic Photovoltaics (OPVs) to accelerate the discovery of efficient and stable solar cell materials.

Python License

๐Ÿ“‹ Project Overview

This project develops predictive machine learning models to forecast critical molecular properties for Organic Photovoltaic (OPV) materials based solely on chemical structure (SMILES strings). By enabling rapid computational screening of candidate molecules, this work accelerates material discovery and reduces the time and cost associated with laboratory synthesis and testing.

Key Objectives

  • Predict HOMO energy levels (Highest Occupied Molecular Orbital)
  • Predict LUMO energy levels (Lowest Unoccupied Molecular Orbital)
  • Classify molecular stability (Low, Medium, High)

Business Impact

Traditional material discovery requires weeks of laboratory work per molecule. These models enable:

  • Screening thousands of molecules in minutes
  • Focusing expensive lab resources on the most promising candidates
  • Understanding structure-property relationships for rational molecular design

๐ŸŽฏ Model Performance

Regression Models (HOMO & LUMO Prediction)

  • LUMO Model: MAE 0.193 (within 6% of actual values), Rยฒ = 0.9244
  • HOMO Model: MAE 0.132 (within 7% of actual values), Rยฒ = 0.8263
  • Kaggle Competition: Average MAE 0.1246 (public), 0.1283 (private)

Classification Model (Stability Prediction)

  • Cross-validation F1-Score: 0.8951 (89.5% accuracy)
  • Kaggle Competition: F1-Score 0.5119 (public), 0.5427 (private)

Key Discoveries

  1. HOMO and LUMO are correlated (r = 0.345): These properties move together, reflecting coupled electronic structure
  2. Nitrogen is critical: Nitrogen-containing aromatic structures strongly influence electronic properties
  3. Size-stability relationship: Smaller, aromatic molecules tend to exhibit higher stability

๐Ÿ“Š Dataset Description

The project uses molecular data from multiple sources:

Dataset Molecules Contains
HOPV (Harvard) 350 SMILES, HOMO, LUMO, Stability
SQL Database 232 SMILES, HOMO, LUMO, Stability
Kaggle Test Set 200 SMILES (labels hidden)
Total Unique 779 -

Data Sources:

  • HOPV dataset from Harvard University Clean Energy Project
  • Additional SQL database with bond-level structural features
  • Kaggle competition test set for model validation

Note: Target values have been synthetically generated based on realistic chemical distributions for educational purposes.

๐Ÿš€ Installation & Setup

Prerequisites

  • Python 3.9 or higher
  • Conda (recommended) or pip

Option 1: Using Conda (Recommended)

# Clone the repository
git clone https://github.com/TimurKambarov/OPV-Property-Prediction.git
cd OPV-Property-Prediction

# Create conda environment
conda env create -f environment.yml
conda activate opv-env

Option 2: Using pip

# Clone the repository
git clone https://github.com/TimurKambarov/OPV-Property-Prediction.git
cd OPV-Property-Prediction

# Install dependencies
pip install -r requirements.txt

๐Ÿ“ Project Structure

OPV-Property-Prediction/
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ environment.yml
โ”œโ”€โ”€ .gitignore
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/                          # Original datasets
โ”‚   โ”‚   โ”œโ”€โ”€ HOPV_homolumo.data
โ”‚   โ”‚   โ”œโ”€โ”€ HOPV_stability.csv
โ”‚   โ”‚   โ”œโ”€โ”€ starter_dataset.xlsx
โ”‚   โ”‚   โ”œโ”€โ”€ 25261_classification_sample_submission.csv
โ”‚   โ”‚   โ””โ”€โ”€ 25261_regression_sample_submission.csv
โ”‚   โ””โ”€โ”€ processed/                    # Processed data
โ”‚
โ”œโ”€โ”€ models/                           # Trained models
โ”‚   โ”œโ”€โ”€ xgboost_homo_regressor.joblib       # HOMO regression model
โ”‚   โ”œโ”€โ”€ xgboost_lumo_regressor.joblib       # LUMO regression model
โ”‚   โ””โ”€โ”€ xgboost_stability_classifier.joblib # Stability classifier
โ”‚
โ”œโ”€โ”€ artifacts/                        # Model artifacts for inference
โ”‚   โ”œโ”€โ”€ feature_columns_homo.joblib         # HOMO model feature list
โ”‚   โ”œโ”€โ”€ feature_columns_lumo.joblib         # LUMO model feature list
โ”‚   โ”œโ”€โ”€ constant_columns_homo.joblib        # HOMO constant features to drop
โ”‚   โ”œโ”€โ”€ constant_columns_lumo.joblib        # LUMO constant features to drop
โ”‚   โ”œโ”€โ”€ feature_columns.joblib              # Stability model feature list
โ”‚   โ”œโ”€โ”€ constant_columns.joblib             # Stability constant features to drop
โ”‚   โ””โ”€โ”€ label_encoder.joblib                # Stability label encoder
โ”‚
โ”œโ”€โ”€ notebooks/                        # Jupyter notebooks
โ”‚   โ”œโ”€โ”€ 01_clustering.ipynb          # Molecular clustering analysis
โ”‚   โ”œโ”€โ”€ 02_train_regression.ipynb    # Train HOMO/LUMO models
โ”‚   โ”œโ”€โ”€ 03_test_regression.ipynb     # Generate regression predictions
โ”‚   โ”œโ”€โ”€ 04_train_classification.ipynb # Train stability classifier
โ”‚   โ””โ”€โ”€ 05_test_classification.ipynb  # Generate classification predictions
โ”‚
โ”œโ”€โ”€ results/                          # Model outputs
โ”‚   โ””โ”€โ”€ (Kaggle submission files)
โ”‚
โ”œโ”€โ”€ docs/                            # Documentation
โ”‚   โ””โ”€โ”€ Report_Template.pdf      # Full project report
โ”‚
โ””โ”€โ”€ images/                          # Visualizations

๐Ÿ’ป Usage

1. Clustering Analysis

Explore molecular families and structural patterns:

jupyter notebook notebooks/01_clustering.ipynb

2. Train Regression Models

Train XGBoost models for HOMO and LUMO prediction:

jupyter notebook notebooks/02_train_regression.ipynb

3. Generate Regression Predictions

Apply trained models to new molecules:

jupyter notebook notebooks/03_test_regression.ipynb

4. Train Classification Model

Train XGBoost classifier for stability prediction:

jupyter notebook notebooks/04_train_classification.ipynb

5. Generate Classification Predictions

Apply classifier to new molecules:

jupyter notebook notebooks/05_test_classification.ipynb

๐Ÿ”ฌ Methodology

Data Processing

  1. SMILES to Descriptors: Convert molecular structures to numerical features using RDKit
  2. Feature Engineering: Extract bond-level features and create clustering-based features
  3. Data Integration: Combine HOPV and SQL datasets after duplicate removal
  4. Missing Value Treatment: Mean imputation for target variables

Modeling Approach

  • Algorithm: XGBoost (gradient boosting)
  • Optimization: Optuna for hyperparameter tuning (100 trials, 5-fold CV)
  • Separate Models: Individual optimized models for HOMO, LUMO, and stability
  • Feature Selection: Comprehensive RDKit descriptors + bond features

Key Features

Most important molecular descriptors identified:

  • For LUMO: fr_Ar_N (aromatic nitrogen), fr_NH0, SMR_VSA3
  • For HOMO: fr_NH0, NHOHCount, NumHDonors
  • For Stability: Molecular size, aromatic content, bond characteristics

๐Ÿ“ˆ Results & Insights

Regression Performance

  • LUMO predictions explain 92% of variance (highly reliable for screening)
  • HOMO predictions explain 83% of variance (useful guidance, requires lab verification)
  • Average prediction error ~0.13 energy units

Classification Performance

  • 89% accuracy on cross-validation
  • Lower performance on Kaggle test set indicates need for more diverse training data

Scientific Insights

  1. Electronic coupling: HOMO and LUMO move together (r=0.345)
  2. Design principle: Incorporate nitrogen atoms in aromatic rings for favorable electronic properties
  3. Stability pattern: Smaller, aromatic molecules (Cluster 0) show highest stability

๐Ÿ”ฎ Future Improvements

  1. Expand Training Data: Increase from ~500 to several thousand molecules
  2. Reduce Missing Values: Ensure complete property measurements for all molecules
  3. Real Laboratory Data: Train on actual experimental measurements vs. synthetic data
  4. Advanced Features: Explore graph neural networks for direct SMILES processing
  5. Active Learning: Implement feedback loop with experimental results

๐Ÿ“š References

  • XGBoost: Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of KDD, 785-794.
  • RDKit: Landrum, G. (2023). RDKit: Open-source cheminformatics software. https://www.rdkit.org
  • Optuna: Akiba, T., et al. (2019). Optuna: A next-generation hyperparameter optimization framework. Proceedings of KDD, 2623-2631.
  • HOPV Dataset: Harvard University Clean Energy Project, Department of Chemistry and Chemical Biology

For detailed methodology and results, see docs/Report_Template.pdf.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ‘ค Author

Timur Kambarov

๐Ÿ™ Acknowledgments

  • OMEGALAB for the business case and project motivation
  • Harvard University Clean Energy Project for the HOPV dataset
  • Anthropic's Claude for development assistance :)

๐Ÿ“ง Contact

For questions or collaboration opportunities, please open an issue in this repository.


Note: This project was developed as part of a machine learning course focused on accelerating renewable energy material discovery.

About

Machine Learning models for predicting molecular properties of Organic Photovoltaics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors