Machine learning classification pipeline predicting cardiac disease severity (healthy / mild / moderate-severe) from 920 clinical patient records with 16 features. Applied ensemble methods (LightGBM, XGBoost), handled class imbalance and missing data (66% in coronary arteries, 53% in thalassemia), and used SHAP for feature interpretability.
- Xuyang Chen
- Xinghao Huang
- Kaiyue Wang
- Yiwen Wang
- Junyi Yao
- Boyi Zhang
This end-to-end machine learning project investigates whether models can predict heart disease severity using clinical indicators for early intervention and risk stratification. The pipeline covers exploratory data analysis, feature engineering, model training, hyperparameter tuning, and interpretability analysis.
Key metrics:
- Dataset: 920 patient records, 16 clinical features, 3 target classes
- Best model: Tuned LightGBM
- Primary metric: Macro F1 score (optimized for multiclass imbalance)
- Top predictors: ST depression, chest pain type, age, exercise-induced angina
CardioSight/
├── analysis/
│ ├── EDA.ipynb # Exploratory data analysis with statistical summaries
│ └── modelling.ipynb # Model training, tuning, evaluation, and SHAP analysis
├── dataset/
│ └── heart_disease.csv # UCI Heart Disease dataset (920 records, 16 features)
├── visualizations/ # Output artifacts (PNG plots, interactive Altair charts)
├── requirements.txt # Python dependencies with version pinning
└── README.md # This file
Source: UCI Heart Disease Dataset via Kaggle
Specifications:
- Size: 920 patient records
- Features: 16 clinical indicators
- Target:
nummapped to 3 severity classes
| Class | Label | Definition |
|---|---|---|
| 0 | No disease | Healthy baseline |
| 1 | Mild | Mild cardiovascular disease |
| 2 | Moderate–Severe | Significant heart disease requiring intervention |
| Feature | Type | Range/Values | Description |
|---|---|---|---|
age |
Numeric | 29–77 years | Patient age |
sex |
Binary | 0, 1 | 0 = female, 1 = male |
cp |
Categorical | 1–4 | Chest pain type (4 categories) |
trestbps |
Numeric | 94–200 mm Hg | Resting blood pressure |
chol |
Numeric | 126–564 mg/dl | Serum cholesterol |
fbs |
Binary | 0, 1 | Fasting blood sugar > 120 mg/dl |
restecg |
Categorical | 0–2 | Resting electrocardiogram results |
thalch |
Numeric | 60–202 bpm | Maximum heart rate achieved |
exang |
Binary | 0, 1 | Exercise-induced angina |
oldpeak |
Numeric | 0–6.2 | ST depression induced by exercise |
slope |
Categorical | 1–3 | Slope of peak exercise ST segment |
ca |
Numeric | 0–4 | Major vessels coloured by fluoroscopy (66% missing) |
thal |
Categorical | 3–7 | Thalassemia type (53% missing) |
Data Quality Notes:
ca(coronary arteries): 611 missing values (~66%)thal(thalassemia): 486 missing values (~53%)- Both features retained due to clinical significance; addressed via statistical imputation
Imputation strategies:
- Numeric features: mean imputation
- Categorical features: most-frequent imputation
- Binary features: most-frequent imputation
Feature scaling:
- Numeric features: standardization (zero mean, unit variance)
- Categorical features: one-hot encoding
- Binary features: passthrough encoding
Train/test split:
- Stratified 70/30 split
- Train: 644 samples | Test: 276 samples
- Stratification preserves class distribution across splits
Two domain-informed composite features created:
| Feature | Formula | Clinical Rationale |
|---|---|---|
risk_score |
cholesterol + resting_bp |
Combines two primary CVD risk factors |
cardiac_stress |
resting_bp × max_heart_rate |
Captures hemodynamic load during exercise |
| Model | Type | Purpose |
|---|---|---|
| DummyClassifier | Baseline | Stratified random baseline for comparison |
| Logistic Regression | Linear model | Linear baseline; surprisingly strong (~0.64 macro F1) |
| Random Forest | Ensemble | Captures non-linear patterns, handles feature interactions |
| XGBoost | Gradient boosting | Sequential optimization with regularization |
| LightGBM (tuned) | Gradient boosting | Best performer; optimized hyperparameters via grid search |
Evaluation methodology:
- Primary metric: Macro F1 score (unweighted average across 3 classes; handles imbalance)
- Cross-validation: stratified k-fold (specific folds in notebooks)
- Hyperparameter tuning: grid search with cross-validation for LightGBM
| Model | Macro F1 Score |
|---|---|
| DummyClassifier (baseline) | ~0.34 |
| Logistic Regression | ~0.64 |
| Random Forest | — |
| XGBoost | — |
| LightGBM (tuned) | Best |
Model interpretation via SHAP (SHapley Additive exPlanations) identified top predictive features:
-
ST depression (
oldpeak) — strongest single predictor- Continuous measure of exercise-induced ischemia
- Clear separation between severity classes
-
Chest pain type (
cp) — symptom classification with high signal- Different pain types strongly stratify disease severity
- Categorical feature with strong discriminative power
-
Age — demographic risk factor
- Older patients skew toward moderate/severe categories
- Non-linear relationship with severity
-
Exercise-induced angina (
exang) — binary clinical indicator- Binary flag indicating chest pain during exertion
- Strong predictor of disease presence and severity
All outputs saved in visualizations/:
eda_plot1–6.png— Feature distributions, correlations, severity class separationeda_plot4–5.html— Interactive Altair visualizations for explorationml_confusion_matrix.png— Confusion matrix for best model (LightGBM)ml_shap_feature_importance.png— SHAP summary plot showing feature contributions
# Clone repository
git clone https://github.com/z-boyi/CardioSight.git
cd CardioSight
# Install dependencies
pip install -r requirements.txt
# Launch Jupyter
jupyter notebook-
analysis/EDA.ipynb— Exploratory data analysis- Feature distributions, missing data patterns
- Correlation analysis, severity class visualization
- Statistical summaries by class
-
analysis/modelling.ipynb— Model training and evaluation- Preprocessing pipeline setup
- Baseline and production model training
- Hyperparameter tuning
- SHAP interpretability analysis
- Results visualization and comparison
| Category | Technologies |
|---|---|
| Data | pandas, numpy |
| ML | scikit-learn, XGBoost, LightGBM |
| Interpretability | SHAP |
| Visualization | Altair, Matplotlib, Seaborn |
| Environment | Jupyter, Python 3.8+ |
If you use this work, please cite the original dataset:
Janosi, A., Steinbrunn, W., Pfisterer, M., & Detrano, R. (1988). Heart Disease. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X
- Data Science: EDA, feature engineering, statistical analysis
- Machine Learning: Supervised learning, classification, ensemble methods, hyperparameter optimization
- Data Quality: Missing data imputation, class imbalance handling, data preprocessing
- Model Evaluation: Cross-validation, multiple metrics (Macro F1, accuracy, confusion matrices)
- Model Interpretability: SHAP analysis, feature importance
- Tools & Languages: Python, pandas, NumPy, scikit-learn, Jupyter, Git