OPV Property Prediction

Machine Learning models for predicting molecular properties of Organic Photovoltaics (OPVs) to accelerate the discovery of efficient and stable solar cell materials.

📋 Project Overview

This project develops predictive machine learning models to forecast critical molecular properties for Organic Photovoltaic (OPV) materials based solely on chemical structure (SMILES strings). By enabling rapid computational screening of candidate molecules, this work accelerates material discovery and reduces the time and cost associated with laboratory synthesis and testing.

Key Objectives

Predict HOMO energy levels (Highest Occupied Molecular Orbital)
Predict LUMO energy levels (Lowest Unoccupied Molecular Orbital)
Classify molecular stability (Low, Medium, High)

Business Impact

Traditional material discovery requires weeks of laboratory work per molecule. These models enable:

Screening thousands of molecules in minutes
Focusing expensive lab resources on the most promising candidates
Understanding structure-property relationships for rational molecular design

🎯 Model Performance

Regression Models (HOMO & LUMO Prediction)

LUMO Model: MAE 0.193 (within 6% of actual values), R² = 0.9244
HOMO Model: MAE 0.132 (within 7% of actual values), R² = 0.8263
Kaggle Competition: Average MAE 0.1246 (public), 0.1283 (private)

Classification Model (Stability Prediction)

Cross-validation F1-Score: 0.8951 (89.5% accuracy)
Kaggle Competition: F1-Score 0.5119 (public), 0.5427 (private)

Key Discoveries

HOMO and LUMO are correlated (r = 0.345): These properties move together, reflecting coupled electronic structure
Nitrogen is critical: Nitrogen-containing aromatic structures strongly influence electronic properties
Size-stability relationship: Smaller, aromatic molecules tend to exhibit higher stability

📊 Dataset Description

The project uses molecular data from multiple sources:

Dataset	Molecules	Contains
HOPV (Harvard)	350	SMILES, HOMO, LUMO, Stability
SQL Database	232	SMILES, HOMO, LUMO, Stability
Kaggle Test Set	200	SMILES (labels hidden)
Total Unique	779	-

Data Sources:

HOPV dataset from Harvard University Clean Energy Project
Additional SQL database with bond-level structural features
Kaggle competition test set for model validation

Note: Target values have been synthetically generated based on realistic chemical distributions for educational purposes.

🚀 Installation & Setup

Prerequisites

Python 3.9 or higher
Conda (recommended) or pip

Option 1: Using Conda (Recommended)

# Clone the repository
git clone https://github.com/TimurKambarov/OPV-Property-Prediction.git
cd OPV-Property-Prediction

# Create conda environment
conda env create -f environment.yml
conda activate opv-env

Option 2: Using pip

# Clone the repository
git clone https://github.com/TimurKambarov/OPV-Property-Prediction.git
cd OPV-Property-Prediction

# Install dependencies
pip install -r requirements.txt

📁 Project Structure

OPV-Property-Prediction/
├── README.md
├── LICENSE
├── requirements.txt
├── environment.yml
├── .gitignore
│
├── data/
│   ├── raw/                          # Original datasets
│   │   ├── HOPV_homolumo.data
│   │   ├── HOPV_stability.csv
│   │   ├── starter_dataset.xlsx
│   │   ├── 25261_classification_sample_submission.csv
│   │   └── 25261_regression_sample_submission.csv
│   └── processed/                    # Processed data
│
├── models/                           # Trained models
│   ├── xgboost_homo_regressor.joblib       # HOMO regression model
│   ├── xgboost_lumo_regressor.joblib       # LUMO regression model
│   └── xgboost_stability_classifier.joblib # Stability classifier
│
├── artifacts/                        # Model artifacts for inference
│   ├── feature_columns_homo.joblib         # HOMO model feature list
│   ├── feature_columns_lumo.joblib         # LUMO model feature list
│   ├── constant_columns_homo.joblib        # HOMO constant features to drop
│   ├── constant_columns_lumo.joblib        # LUMO constant features to drop
│   ├── feature_columns.joblib              # Stability model feature list
│   ├── constant_columns.joblib             # Stability constant features to drop
│   └── label_encoder.joblib                # Stability label encoder
│
├── notebooks/                        # Jupyter notebooks
│   ├── 01_clustering.ipynb          # Molecular clustering analysis
│   ├── 02_train_regression.ipynb    # Train HOMO/LUMO models
│   ├── 03_test_regression.ipynb     # Generate regression predictions
│   ├── 04_train_classification.ipynb # Train stability classifier
│   └── 05_test_classification.ipynb  # Generate classification predictions
│
├── results/                          # Model outputs
│   └── (Kaggle submission files)
│
├── docs/                            # Documentation
│   └── Report_Template.pdf      # Full project report
│
└── images/                          # Visualizations

💻 Usage

1. Clustering Analysis

Explore molecular families and structural patterns:

jupyter notebook notebooks/01_clustering.ipynb

2. Train Regression Models

Train XGBoost models for HOMO and LUMO prediction:

jupyter notebook notebooks/02_train_regression.ipynb

3. Generate Regression Predictions

Apply trained models to new molecules:

jupyter notebook notebooks/03_test_regression.ipynb

4. Train Classification Model

Train XGBoost classifier for stability prediction:

jupyter notebook notebooks/04_train_classification.ipynb

5. Generate Classification Predictions

Apply classifier to new molecules:

jupyter notebook notebooks/05_test_classification.ipynb

🔬 Methodology

Data Processing

SMILES to Descriptors: Convert molecular structures to numerical features using RDKit
Feature Engineering: Extract bond-level features and create clustering-based features
Data Integration: Combine HOPV and SQL datasets after duplicate removal
Missing Value Treatment: Mean imputation for target variables

Modeling Approach

Algorithm: XGBoost (gradient boosting)
Optimization: Optuna for hyperparameter tuning (100 trials, 5-fold CV)
Separate Models: Individual optimized models for HOMO, LUMO, and stability
Feature Selection: Comprehensive RDKit descriptors + bond features

Key Features

Most important molecular descriptors identified:

For LUMO: fr_Ar_N (aromatic nitrogen), fr_NH0, SMR_VSA3
For HOMO: fr_NH0, NHOHCount, NumHDonors
For Stability: Molecular size, aromatic content, bond characteristics

📈 Results & Insights

Regression Performance

LUMO predictions explain 92% of variance (highly reliable for screening)
HOMO predictions explain 83% of variance (useful guidance, requires lab verification)
Average prediction error ~0.13 energy units

Classification Performance

89% accuracy on cross-validation
Lower performance on Kaggle test set indicates need for more diverse training data

Scientific Insights

Electronic coupling: HOMO and LUMO move together (r=0.345)
Design principle: Incorporate nitrogen atoms in aromatic rings for favorable electronic properties
Stability pattern: Smaller, aromatic molecules (Cluster 0) show highest stability

🔮 Future Improvements

Expand Training Data: Increase from ~500 to several thousand molecules
Reduce Missing Values: Ensure complete property measurements for all molecules
Real Laboratory Data: Train on actual experimental measurements vs. synthetic data
Advanced Features: Explore graph neural networks for direct SMILES processing
Active Learning: Implement feedback loop with experimental results

📚 References

XGBoost: Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of KDD, 785-794.
RDKit: Landrum, G. (2023). RDKit: Open-source cheminformatics software. https://www.rdkit.org
Optuna: Akiba, T., et al. (2019). Optuna: A next-generation hyperparameter optimization framework. Proceedings of KDD, 2623-2631.
HOPV Dataset: Harvard University Clean Energy Project, Department of Chemistry and Chemical Biology

For detailed methodology and results, see docs/Report_Template.pdf.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👤 Author

Timur Kambarov

🙏 Acknowledgments

OMEGALAB for the business case and project motivation
Harvard University Clean Energy Project for the HOPV dataset
Anthropic's Claude for development assistance :)

📧 Contact

For questions or collaboration opportunities, please open an issue in this repository.

Note: This project was developed as part of a machine learning course focused on accelerating renewable energy material discovery.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
artifacts		artifacts
data		data
docs		docs
models		models
notebooks		notebooks
results		results
.gitattributes		.gitattributes
.gitignore		.gitignore
ARTIFACTS.md		ARTIFACTS.md
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

OPV Property Prediction

📋 Project Overview

Key Objectives

Business Impact

🎯 Model Performance

Regression Models (HOMO & LUMO Prediction)

Classification Model (Stability Prediction)

Key Discoveries

📊 Dataset Description

🚀 Installation & Setup

Prerequisites

Option 1: Using Conda (Recommended)

Option 2: Using pip

📁 Project Structure

💻 Usage

1. Clustering Analysis

2. Train Regression Models

3. Generate Regression Predictions

4. Train Classification Model

5. Generate Classification Predictions

🔬 Methodology

Data Processing

Modeling Approach

Key Features

📈 Results & Insights

Regression Performance

Classification Performance

Scientific Insights

🔮 Future Improvements

📚 References

📄 License

👤 Author

🙏 Acknowledgments

📧 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages