A complete, end-to-end machine learning pipeline for early-stage diabetes risk prediction based on patient-reported symptoms. The pipeline covers everything from automated data acquisition and exploratory data analysis, through statistical feature selection and hyperparameter-optimised model training, to final model serialisation and inference.
- Dataset
- Pipeline Overview
- Machine Learning Algorithms
- Technologies & Libraries
- How to Run Locally
- Project Structure
- Results
| Property | Value |
|---|---|
| Source | Early Stage Diabetes Risk Prediction Dataset — Kaggle |
| Samples | 520 patients |
| Features | 16 symptom-based features + age |
| Target | class — Positive (diabetic) / Negative (non-diabetic) |
| Class distribution | ~61.5 % Positive / 38.5 % Negative (imbalanced) |
| Feature | Type | Description |
|---|---|---|
Age |
Numerical | Patient age (20–65) |
Gender |
Binary | Male / Female |
Polyuria |
Binary | Excessive urination |
Polydipsia |
Binary | Excessive thirst |
sudden weight loss |
Binary | Sudden weight loss |
weakness |
Binary | General weakness |
Polyphagia |
Binary | Excessive hunger |
Genital thrush |
Binary | Fungal infection |
visual blurring |
Binary | Blurred vision |
Itching |
Binary | Skin itching |
Irritability |
Binary | Irritability |
delayed healing |
Binary | Slow wound healing |
partial paresis |
Binary | Partial muscle weakness |
muscle stiffness |
Binary | Muscle stiffness |
Alopecia |
Binary | Hair loss |
Obesity |
Binary | Obesity |
All binary features are Yes/No values encoded as 1/0 using Label Encoding.
The notebook diabetes_risk_final.ipynb implements the following sequential steps:
The dataset is downloaded automatically from Kaggle using the kagglehub library — no manual downloading is required. A valid Kaggle API token (kaggle.json) must be present (see How to Run Locally).
path = kagglehub.dataset_download("ishandutta/early-stage-diabetes-risk-prediction-dataset")- Shape, dtypes, descriptive statistics (
data.info(),data.describe()) - Unique value counts per column — confirms no missing values
- Distribution plots (count plots) for all categorical features
- Stacked bar charts showing the relationship between each symptom and the target variable — identifies strong predictors (Polyuria, Polydipsia, Gender)
- Label Encoding (scikit-learn
LabelEncoder) is applied to all binary categorical features (Yes→1, No→0) - Label Encoding is preferred over One-Hot Encoding because all binary features have no ordinal ambiguity, and it keeps the feature count unchanged
- A Pearson correlation heatmap (seaborn) is produced on the encoded dataset
- Identifies potential multicollinearity: Polyuria and Polydipsia show a correlation of ~0.6
- Notable: Gender and Alopecia are negatively correlated with the target
Total: 520 samples → Train: 80% (416) | Test: 20% (104)
Stratified split to preserve class distribution
The split happens before feature selection to prevent data leakage.
Two types of tests are applied separately depending on feature type:
Categorical features — Chi-square / Fisher's Exact Test:
- Chi-square test (or Fisher's Exact Test when expected cell frequencies < 5)
- Selection criterion:
p < 0.05ANDCramér's V > 0.1(at least weak association) - Cramér's V quantifies the strength of association beyond statistical significance
Numerical feature (Age) — T-test:
- Normality checked with Shapiro-Wilk test + histogram + Q-Q plot
- Independent samples T-test for
Ageby class - Confirmed significant difference (p < 0.05); Age is retained
All tests are computed exclusively on the training set to avoid data leakage.
Each model's hyperparameters are tuned using Optuna (Bayesian/TPE optimisation) with 5-fold Stratified Cross-Validation on the training set, optimising ROC AUC.
| Model | Tuned Hyperparameters |
|---|---|
| Logistic Regression | C, solver, penalty |
| K-Nearest Neighbors | n_neighbors, weights, p (Manhattan/Euclidean) |
| Random Forest | n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features |
Three scikit-learn Pipeline objects are built (each with StandardScaler → classifier). All models are evaluated via 5-fold Stratified Cross-Validation on the training set using:
| Metric | Description |
|---|---|
| Accuracy | Overall correct predictions |
| Balanced Accuracy | Accuracy adjusted for class imbalance |
| Precision | Ratio of true positives among predicted positives |
| Recall (Sensitivity) | Ratio of true positives among actual positives |
| F1-Score | Harmonic mean of Precision and Recall |
| ROC AUC | Area Under the ROC Curve |
- Confusion matrices for each model (CV predictions on train set)
- Precision-Recall curves with Average Precision scores
- ROC curves with AUC values
Given the medical context, Type II errors (False Negatives) are most dangerous — failing to detect a diabetic patient. The analysis quantifies False Negative Rate (FNR) and False Positive Rate (FPR) per model and recommends the model with the lowest FNR.
The Wilcoxon signed-rank test is applied to compare model pairs based on their 5 cross-validation ROC AUC scores. This non-parametric test determines whether performance differences are statistically significant (caveat: only 5 paired observations limits statistical power).
The best-performing model is retrained on the full training set and evaluated on the held-out test set (first and only time the test set is used). The final model and selected feature list are saved:
diabetes_risk_model.pkl — serialised sklearn Pipeline
selected_features.pkl — list of feature names used by the model
The notebook demonstrates loading the saved model and running predictions on new samples, printing predicted class and class probabilities.
A linear classifier that models the log-odds of the target class as a linear combination of features. Uses class_weight='balanced' to handle class imbalance. L1/L2 regularisation tuned via Optuna.
A non-parametric instance-based learner that classifies a sample by majority vote among its k nearest neighbours. Distance metric (Manhattan/Euclidean) and weighting scheme are tuned. Does not support class_weight; imbalance is partially mitigated via stratified splits and AUC-based evaluation.
An ensemble of decision trees using bootstrap aggregation (bagging) and random feature subsets. Uses class_weight='balanced' to account for class imbalance. Typically the best performer due to its ability to capture non-linear feature interactions.
All algorithms are wrapped in scikit-learn Pipeline objects with StandardScaler as the first step (scaling is critical for Logistic Regression and KNN; harmless for Random Forest).
| Library | Version | Purpose |
|---|---|---|
Python |
3.8+ | Programming language |
jupyter |
— | Interactive notebook environment |
numpy |
— | Numerical computing |
pandas |
— | Data manipulation and analysis |
matplotlib |
— | Plotting |
seaborn |
— | Statistical data visualisation |
scikit-learn |
1.x | ML models, preprocessing, evaluation, pipelines |
optuna |
— | Automated hyperparameter optimisation (TPE) |
scipy |
— | Statistical tests (Chi-square, Fisher, T-test, Wilcoxon, Shapiro-Wilk) |
kagglehub |
— | Automatic Kaggle dataset download |
pickle |
stdlib | Model serialisation |
- Python 3.8 or higher
- A Kaggle account with API access enabled
git clone https://github.com/aneq05/Diabetes_detection.git
cd Diabetes_detectionpython -m venv venv
# Linux / macOS
source venv/bin/activate
# Windows
venv\Scripts\activatepip install jupyter numpy pandas matplotlib seaborn scikit-learn optuna scipy kagglehub-
Go to https://www.kaggle.com/settings → scroll to API → click Create New Token
-
A file called
kaggle.jsonwill be downloaded -
Place it in the default Kaggle credentials directory:
OS Path Linux / macOS ~/.kaggle/kaggle.jsonWindows C:\Users\<username>\.kaggle\kaggle.json -
Set correct permissions (Linux/macOS only):
chmod 600 ~/.kaggle/kaggle.json
jupyter notebook diabetes_risk_final.ipynbThen run all cells from top to bottom (Kernel → Restart & Run All).
Note: The dataset is downloaded automatically on first run. Subsequent runs reuse the cached copy in
~/.cache/kagglehub/.
- Exploratory plots (distributions, correlation heatmap, crosstabs)
- Feature selection report (Chi-square / Fisher / T-test results)
- Optuna optimisation progress logs
- Cross-validation evaluation tables
- Confusion matrices, Precision-Recall curves, ROC curves
- Final test-set metrics
- Saved files:
diabetes_risk_model.pkl,selected_features.pkl
Diabetes_detection/
├── diabetes_risk_final.ipynb # Main Jupyter notebook — full ML pipeline
├── diabetes_risk_model.pkl # Serialised final model (generated after running)
├── selected_features.pkl # Selected feature list (generated after running)
└── README.md # This file
After Optuna tuning and 5-fold cross-validation on the training set, all three models achieve strong performance. The final model is selected based on minimising the False Negative Rate (most important in a medical screening context — avoiding missed diagnoses).
| Model | ROC AUC (CV) | Balanced Accuracy (CV) | Recall (CV) |
|---|---|---|---|
| Logistic Regression | ~0.976 | — | — |
| K-Nearest Neighbors | ~0.998 | — | — |
| Random Forest | ~0.998 | — | — |
Random Forest is recommended as the final model due to its superior ability to capture non-linear interactions between symptoms and its lowest False Negative Rate in the medical error analysis.
Exact metric values may vary slightly across runs due to the stochastic nature of Optuna's search and cross-validation random seeds.
This project is open-source and available under the MIT License.