Companion code for:
Beyond ANOVA: A Structural Equation Modeling and Ensemble Machine Learning Approach to Batch Reactor Process Optimization
Author: Anfal Rababah
Preprint: ChemRxiv / Zenodo
DOI: https://doi.org/10.5281/zenodo.17623232
License: MIT / CC BY 4.0
This repository contains all analysis code used in the manuscript, including:
- Synthetic kinetic data generation (1,024-run factorial design)
- ANOVA with main effects and interaction terms
- Structural Equation Modeling (SEM) with mediation analysis
- Machine Learning (XGBoost) with SHAP interpretability
- Cross-method triangulation and factor-ranking comparison
The full pipeline enables transparent, reproducible reactor optimization using statistical, causal-modeling, and ML approaches.
- Python: 3.8+
- OS: Windows, macOS, Linux
- RAM: 8 GB minimum (16 GB recommended for ML)
~10–15 minutes depending on internet speed.
# Clone the repository
cd Esterification_Optimization_Code
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
Option 2 — Using conda
conda env create -f environment.yml
conda activate esterification-opt
Required Python Packages
numpy>=1.21.0
pandas>=1.3.0
scipy>=1.7.0
matplotlib>=3.4.0
seaborn>=0.11.0
statsmodels>=0.13.0
scikit-learn>=1.0.0
xgboost>=1.5.0
shap>=0.40.0
semopy>=2.3.0
(Complete list in requirements.txt)
## Usage Instructions
### Step 1 — Generate Dataset (~30 s)
cd 01_Data_Generation
python generate_kinetic_data.py
## Outputs
Data/esterification_data_1024.csv
Summary statistics
Step 2 — ANOVA (~2 min)
cd 02_ANOVA_Analysis
python anova_main_effects.py
Outputs
ANOVA table
Output/Figures/Figure2_ANOVA_MainEffects.png
Step 3 — SEM Analysis (~5 min)
cd 03_SEM_Analysis
python sem_model_comparison.py
Outputs
Fit indices
Path coefficients
Mediation effects
Figure3_SEM_PathDiagram.pdf
### Step 4 — ML Analysis (10–15 min)
cd 04_ML_Analysis
python xgboost_training.py
python shap_analysis.py
python partial_dependence_plots.py
Outputs
XGBoost model
SHAP summary plot
PDP plots
Feature rankings
### Step 5 — Cross-Method Comparison (~1 min)
cd 05_Cross_Method_Comparison
python triangulation_analysis.py
### Step 6 — Generate All Figures (~5 min)
cd 06_Visualizations
python generate_all_figures.py
Reproducing Specific Tables
python 02_ANOVA_Analysis/anova_main_effects.py --output-table # Table 2
python 03_SEM_Analysis/sem_model_comparison.py --save-latex # Table 3
python 04_ML_Analysis/model_comparison.py --export-csv # Table 4
python 05_Cross_Method_Comparison/triangulation_analysis.py --format=latex # Table 5
## Computational Performance
Hardware used
Intel Core i7-10700K
32 GB RAM
Windows 10 / Ubuntu 20.04
## Pipeline runtime
Data generation: 30 s
ANOVA: 2 min
SEM: 5 min
ML training: 15 min
Total: ~25 minutes
## Troubleshooting
- ModuleNotFoundError: semopy
pip install semopy
- XGBoost memory issues
Reduce CV folds:
cv_folds = 3
n_iter = 20
- Slow SHAP computation
Use TreeExplainer:
explainer = shap.TreeExplainer(model)
- Extending the Code
Use with Experimental Data
Replace synthetic generator with:
df = pd.read_csv('your_experimental_data.csv')
Expected columns:
['Temperature_C','Acid_Concentration_M','Catalyst_Concentration_M','Time_min','Yield_pct']
Add more ML models
models = {
'XGBoost': xgb.XGBRegressor(),
'Random Forest': RandomForestRegressor(),
'Neural Network': MLPRegressor(),
'Gradient Boosting': GradientBoostingRegressor()
}
## Citation
Please cite this work as:
@article{Rababah2025,
title={Beyond ANOVA: A Structural Equation Modeling and Ensemble Machine Learning Approach to Batch Reactor Process Optimization},
author={Anfal Rababah},
journal={Zenodo},
year={2025},
doi={10.5281/zenodo.17623232}
}
## License
MIT License or CC BY 4.0.
Copyright (c) 2025 Anfal Rababah
## Contact
Email: Anfal0Rababah@gmail.com
GitHub Issues: Available in this repository
ResearchGate: Add your profile link
## Acknowledgments
NumPy, SciPy, pandas
Statsmodels
semopy
XGBoost + SHAP