Machine Learning models for predicting molecular properties of Organic Photovoltaics (OPVs) to accelerate the discovery of efficient and stable solar cell materials.
This project develops predictive machine learning models to forecast critical molecular properties for Organic Photovoltaic (OPV) materials based solely on chemical structure (SMILES strings). By enabling rapid computational screening of candidate molecules, this work accelerates material discovery and reduces the time and cost associated with laboratory synthesis and testing.
- Predict HOMO energy levels (Highest Occupied Molecular Orbital)
- Predict LUMO energy levels (Lowest Unoccupied Molecular Orbital)
- Classify molecular stability (Low, Medium, High)
Traditional material discovery requires weeks of laboratory work per molecule. These models enable:
- Screening thousands of molecules in minutes
- Focusing expensive lab resources on the most promising candidates
- Understanding structure-property relationships for rational molecular design
- LUMO Model: MAE 0.193 (within 6% of actual values), Rยฒ = 0.9244
- HOMO Model: MAE 0.132 (within 7% of actual values), Rยฒ = 0.8263
- Kaggle Competition: Average MAE 0.1246 (public), 0.1283 (private)
- Cross-validation F1-Score: 0.8951 (89.5% accuracy)
- Kaggle Competition: F1-Score 0.5119 (public), 0.5427 (private)
- HOMO and LUMO are correlated (r = 0.345): These properties move together, reflecting coupled electronic structure
- Nitrogen is critical: Nitrogen-containing aromatic structures strongly influence electronic properties
- Size-stability relationship: Smaller, aromatic molecules tend to exhibit higher stability
The project uses molecular data from multiple sources:
| Dataset | Molecules | Contains |
|---|---|---|
| HOPV (Harvard) | 350 | SMILES, HOMO, LUMO, Stability |
| SQL Database | 232 | SMILES, HOMO, LUMO, Stability |
| Kaggle Test Set | 200 | SMILES (labels hidden) |
| Total Unique | 779 | - |
Data Sources:
- HOPV dataset from Harvard University Clean Energy Project
- Additional SQL database with bond-level structural features
- Kaggle competition test set for model validation
Note: Target values have been synthetically generated based on realistic chemical distributions for educational purposes.
- Python 3.9 or higher
- Conda (recommended) or pip
# Clone the repository
git clone https://github.com/TimurKambarov/OPV-Property-Prediction.git
cd OPV-Property-Prediction
# Create conda environment
conda env create -f environment.yml
conda activate opv-env# Clone the repository
git clone https://github.com/TimurKambarov/OPV-Property-Prediction.git
cd OPV-Property-Prediction
# Install dependencies
pip install -r requirements.txtOPV-Property-Prediction/
โโโ README.md
โโโ LICENSE
โโโ requirements.txt
โโโ environment.yml
โโโ .gitignore
โ
โโโ data/
โ โโโ raw/ # Original datasets
โ โ โโโ HOPV_homolumo.data
โ โ โโโ HOPV_stability.csv
โ โ โโโ starter_dataset.xlsx
โ โ โโโ 25261_classification_sample_submission.csv
โ โ โโโ 25261_regression_sample_submission.csv
โ โโโ processed/ # Processed data
โ
โโโ models/ # Trained models
โ โโโ xgboost_homo_regressor.joblib # HOMO regression model
โ โโโ xgboost_lumo_regressor.joblib # LUMO regression model
โ โโโ xgboost_stability_classifier.joblib # Stability classifier
โ
โโโ artifacts/ # Model artifacts for inference
โ โโโ feature_columns_homo.joblib # HOMO model feature list
โ โโโ feature_columns_lumo.joblib # LUMO model feature list
โ โโโ constant_columns_homo.joblib # HOMO constant features to drop
โ โโโ constant_columns_lumo.joblib # LUMO constant features to drop
โ โโโ feature_columns.joblib # Stability model feature list
โ โโโ constant_columns.joblib # Stability constant features to drop
โ โโโ label_encoder.joblib # Stability label encoder
โ
โโโ notebooks/ # Jupyter notebooks
โ โโโ 01_clustering.ipynb # Molecular clustering analysis
โ โโโ 02_train_regression.ipynb # Train HOMO/LUMO models
โ โโโ 03_test_regression.ipynb # Generate regression predictions
โ โโโ 04_train_classification.ipynb # Train stability classifier
โ โโโ 05_test_classification.ipynb # Generate classification predictions
โ
โโโ results/ # Model outputs
โ โโโ (Kaggle submission files)
โ
โโโ docs/ # Documentation
โ โโโ Report_Template.pdf # Full project report
โ
โโโ images/ # Visualizations
Explore molecular families and structural patterns:
jupyter notebook notebooks/01_clustering.ipynbTrain XGBoost models for HOMO and LUMO prediction:
jupyter notebook notebooks/02_train_regression.ipynbApply trained models to new molecules:
jupyter notebook notebooks/03_test_regression.ipynbTrain XGBoost classifier for stability prediction:
jupyter notebook notebooks/04_train_classification.ipynbApply classifier to new molecules:
jupyter notebook notebooks/05_test_classification.ipynb- SMILES to Descriptors: Convert molecular structures to numerical features using RDKit
- Feature Engineering: Extract bond-level features and create clustering-based features
- Data Integration: Combine HOPV and SQL datasets after duplicate removal
- Missing Value Treatment: Mean imputation for target variables
- Algorithm: XGBoost (gradient boosting)
- Optimization: Optuna for hyperparameter tuning (100 trials, 5-fold CV)
- Separate Models: Individual optimized models for HOMO, LUMO, and stability
- Feature Selection: Comprehensive RDKit descriptors + bond features
Most important molecular descriptors identified:
- For LUMO: fr_Ar_N (aromatic nitrogen), fr_NH0, SMR_VSA3
- For HOMO: fr_NH0, NHOHCount, NumHDonors
- For Stability: Molecular size, aromatic content, bond characteristics
- LUMO predictions explain 92% of variance (highly reliable for screening)
- HOMO predictions explain 83% of variance (useful guidance, requires lab verification)
- Average prediction error ~0.13 energy units
- 89% accuracy on cross-validation
- Lower performance on Kaggle test set indicates need for more diverse training data
- Electronic coupling: HOMO and LUMO move together (r=0.345)
- Design principle: Incorporate nitrogen atoms in aromatic rings for favorable electronic properties
- Stability pattern: Smaller, aromatic molecules (Cluster 0) show highest stability
- Expand Training Data: Increase from ~500 to several thousand molecules
- Reduce Missing Values: Ensure complete property measurements for all molecules
- Real Laboratory Data: Train on actual experimental measurements vs. synthetic data
- Advanced Features: Explore graph neural networks for direct SMILES processing
- Active Learning: Implement feedback loop with experimental results
- XGBoost: Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of KDD, 785-794.
- RDKit: Landrum, G. (2023). RDKit: Open-source cheminformatics software. https://www.rdkit.org
- Optuna: Akiba, T., et al. (2019). Optuna: A next-generation hyperparameter optimization framework. Proceedings of KDD, 2623-2631.
- HOPV Dataset: Harvard University Clean Energy Project, Department of Chemistry and Chemical Biology
For detailed methodology and results, see docs/Report_Template.pdf.
This project is licensed under the MIT License - see the LICENSE file for details.
Timur Kambarov
- OMEGALAB for the business case and project motivation
- Harvard University Clean Energy Project for the HOPV dataset
- Anthropic's Claude for development assistance :)
For questions or collaboration opportunities, please open an issue in this repository.
Note: This project was developed as part of a machine learning course focused on accelerating renewable energy material discovery.