🩺 Diabetes Risk Detection — ML Pipeline

A complete, end-to-end machine learning pipeline for early-stage diabetes risk prediction based on patient-reported symptoms. The pipeline covers everything from automated data acquisition and exploratory data analysis, through statistical feature selection and hyperparameter-optimised model training, to final model serialisation and inference.

📊 Dataset

Property	Value
Source	Early Stage Diabetes Risk Prediction Dataset — Kaggle
Samples	520 patients
Features	16 symptom-based features + age
Target	`class` — Positive (diabetic) / Negative (non-diabetic)
Class distribution	~61.5 % Positive / 38.5 % Negative (imbalanced)

Features

Feature	Type	Description
`Age`	Numerical	Patient age (20–65)
`Gender`	Binary	Male / Female
`Polyuria`	Binary	Excessive urination
`Polydipsia`	Binary	Excessive thirst
`sudden weight loss`	Binary	Sudden weight loss
`weakness`	Binary	General weakness
`Polyphagia`	Binary	Excessive hunger
`Genital thrush`	Binary	Fungal infection
`visual blurring`	Binary	Blurred vision
`Itching`	Binary	Skin itching
`Irritability`	Binary	Irritability
`delayed healing`	Binary	Slow wound healing
`partial paresis`	Binary	Partial muscle weakness
`muscle stiffness`	Binary	Muscle stiffness
`Alopecia`	Binary	Hair loss
`Obesity`	Binary	Obesity

All binary features are Yes/No values encoded as 1/0 using Label Encoding.

🔄 Pipeline Overview

The notebook diabetes_risk_final.ipynb implements the following sequential steps:

1. Data Acquisition

The dataset is downloaded automatically from Kaggle using the kagglehub library — no manual downloading is required. A valid Kaggle API token (kaggle.json) must be present (see How to Run Locally).

path = kagglehub.dataset_download("ishandutta/early-stage-diabetes-risk-prediction-dataset")

2. Exploratory Data Analysis (EDA)

Shape, dtypes, descriptive statistics (data.info(), data.describe())
Unique value counts per column — confirms no missing values
Distribution plots (count plots) for all categorical features
Stacked bar charts showing the relationship between each symptom and the target variable — identifies strong predictors (Polyuria, Polydipsia, Gender)

3. Feature Encoding

Label Encoding (scikit-learn LabelEncoder) is applied to all binary categorical features (Yes→1, No→0)
Label Encoding is preferred over One-Hot Encoding because all binary features have no ordinal ambiguity, and it keeps the feature count unchanged

4. Correlation Analysis

A Pearson correlation heatmap (seaborn) is produced on the encoded dataset
Identifies potential multicollinearity: Polyuria and Polydipsia show a correlation of ~0.6
Notable: Gender and Alopecia are negatively correlated with the target

5. Train / Test Split

Total: 520 samples → Train: 80% (416) | Test: 20% (104)
Stratified split to preserve class distribution

The split happens before feature selection to prevent data leakage.

6. Statistical Feature Selection (on training set only)

Two types of tests are applied separately depending on feature type:

Categorical features — Chi-square / Fisher's Exact Test:

Chi-square test (or Fisher's Exact Test when expected cell frequencies < 5)
Selection criterion: p < 0.05 AND Cramér's V > 0.1 (at least weak association)
Cramér's V quantifies the strength of association beyond statistical significance

Numerical feature (Age) — T-test:

Normality checked with Shapiro-Wilk test + histogram + Q-Q plot
Independent samples T-test for Age by class
Confirmed significant difference (p < 0.05); Age is retained

All tests are computed exclusively on the training set to avoid data leakage.

7. Hyperparameter Optimisation with Optuna

Each model's hyperparameters are tuned using Optuna (Bayesian/TPE optimisation) with 5-fold Stratified Cross-Validation on the training set, optimising ROC AUC.

Model	Tuned Hyperparameters
Logistic Regression	`C`, `solver`, `penalty`
K-Nearest Neighbors	`n_neighbors`, `weights`, `p` (Manhattan/Euclidean)
Random Forest	`n_estimators`, `max_depth`, `min_samples_split`, `min_samples_leaf`, `max_features`

8. Model Training & Cross-Validation Evaluation

Three scikit-learn Pipeline objects are built (each with StandardScaler → classifier). All models are evaluated via 5-fold Stratified Cross-Validation on the training set using:

Metric	Description
Accuracy	Overall correct predictions
Balanced Accuracy	Accuracy adjusted for class imbalance
Precision	Ratio of true positives among predicted positives
Recall (Sensitivity)	Ratio of true positives among actual positives
F1-Score	Harmonic mean of Precision and Recall
ROC AUC	Area Under the ROC Curve

9. Visualisations

Confusion matrices for each model (CV predictions on train set)
Precision-Recall curves with Average Precision scores
ROC curves with AUC values

10. Type I / Type II Error Analysis

Given the medical context, Type II errors (False Negatives) are most dangerous — failing to detect a diabetic patient. The analysis quantifies False Negative Rate (FNR) and False Positive Rate (FPR) per model and recommends the model with the lowest FNR.

11. Statistical Model Comparison

The Wilcoxon signed-rank test is applied to compare model pairs based on their 5 cross-validation ROC AUC scores. This non-parametric test determines whether performance differences are statistically significant (caveat: only 5 paired observations limits statistical power).

12. Final Model Training & Serialisation

The best-performing model is retrained on the full training set and evaluated on the held-out test set (first and only time the test set is used). The final model and selected feature list are saved:

diabetes_risk_model.pkl   — serialised sklearn Pipeline
selected_features.pkl     — list of feature names used by the model

13. Inference Example

The notebook demonstrates loading the saved model and running predictions on new samples, printing predicted class and class probabilities.

🤖 Machine Learning Algorithms

Logistic Regression

A linear classifier that models the log-odds of the target class as a linear combination of features. Uses class_weight='balanced' to handle class imbalance. L1/L2 regularisation tuned via Optuna.

K-Nearest Neighbors (KNN)

A non-parametric instance-based learner that classifies a sample by majority vote among its k nearest neighbours. Distance metric (Manhattan/Euclidean) and weighting scheme are tuned. Does not support class_weight; imbalance is partially mitigated via stratified splits and AUC-based evaluation.

Random Forest

An ensemble of decision trees using bootstrap aggregation (bagging) and random feature subsets. Uses class_weight='balanced' to account for class imbalance. Typically the best performer due to its ability to capture non-linear feature interactions.

All algorithms are wrapped in scikit-learn Pipeline objects with StandardScaler as the first step (scaling is critical for Logistic Regression and KNN; harmless for Random Forest).

🛠️ Technologies & Libraries

Library	Version	Purpose
`Python`	3.8+	Programming language
`jupyter`	—	Interactive notebook environment
`numpy`	—	Numerical computing
`pandas`	—	Data manipulation and analysis
`matplotlib`	—	Plotting
`seaborn`	—	Statistical data visualisation
`scikit-learn`	1.x	ML models, preprocessing, evaluation, pipelines
`optuna`	—	Automated hyperparameter optimisation (TPE)
`scipy`	—	Statistical tests (Chi-square, Fisher, T-test, Wilcoxon, Shapiro-Wilk)
`kagglehub`	—	Automatic Kaggle dataset download
`pickle`	stdlib	Model serialisation

🚀 How to Run Locally

Prerequisites

Python 3.8 or higher
A Kaggle account with API access enabled

1. Clone the repository

git clone https://github.com/aneq05/Diabetes_detection.git
cd Diabetes_detection

2. Create and activate a virtual environment (recommended)

python -m venv venv
# Linux / macOS
source venv/bin/activate
# Windows
venv\Scripts\activate

3. Install dependencies

pip install jupyter numpy pandas matplotlib seaborn scikit-learn optuna scipy kagglehub

4. Configure Kaggle API credentials

Go to https://www.kaggle.com/settings → scroll to API → click Create New Token
A file called kaggle.json will be downloaded
Place it in the default Kaggle credentials directory:

OS Path

Linux / macOS ~/.kaggle/kaggle.json

Windows C:\Users\<username>\.kaggle\kaggle.json
Set correct permissions (Linux/macOS only):
```
chmod 600 ~/.kaggle/kaggle.json
```

5. Launch the notebook

jupyter notebook diabetes_risk_final.ipynb

Then run all cells from top to bottom (Kernel → Restart & Run All).

Note: The dataset is downloaded automatically on first run. Subsequent runs reuse the cached copy in ~/.cache/kagglehub/.

Expected outputs

Exploratory plots (distributions, correlation heatmap, crosstabs)
Feature selection report (Chi-square / Fisher / T-test results)
Optuna optimisation progress logs
Cross-validation evaluation tables
Confusion matrices, Precision-Recall curves, ROC curves
Final test-set metrics
Saved files: diabetes_risk_model.pkl, selected_features.pkl

📁 Project Structure

Diabetes_detection/
├── diabetes_risk_final.ipynb   # Main Jupyter notebook — full ML pipeline
├── diabetes_risk_model.pkl     # Serialised final model (generated after running)
├── selected_features.pkl       # Selected feature list (generated after running)
└── README.md                   # This file

📈 Results

After Optuna tuning and 5-fold cross-validation on the training set, all three models achieve strong performance. The final model is selected based on minimising the False Negative Rate (most important in a medical screening context — avoiding missed diagnoses).

Model	ROC AUC (CV)	Balanced Accuracy (CV)	Recall (CV)
Logistic Regression	~0.976	—	—
K-Nearest Neighbors	~0.998	—	—
Random Forest	~0.998	—	—

Random Forest is recommended as the final model due to its superior ability to capture non-linear interactions between symptoms and its lowest False Negative Rate in the medical error analysis.

Exact metric values may vary slightly across runs due to the stochastic nature of Optuna's search and cross-validation random seeds.

📄 License

This project is open-source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
diabetes_risk_pipeline.ipynb		diabetes_risk_pipeline.ipynb

OS	Path
Linux / macOS	`~/.kaggle/kaggle.json`
Windows	`C:\Users\<username>\.kaggle\kaggle.json`

Folders and files

Latest commit

History

Repository files navigation

🩺 Diabetes Risk Detection — ML Pipeline

📑 Table of Contents

📊 Dataset

Features

🔄 Pipeline Overview

1. Data Acquisition

2. Exploratory Data Analysis (EDA)

3. Feature Encoding

4. Correlation Analysis

5. Train / Test Split

6. Statistical Feature Selection (on training set only)

7. Hyperparameter Optimisation with Optuna

8. Model Training & Cross-Validation Evaluation

9. Visualisations

10. Type I / Type II Error Analysis

11. Statistical Model Comparison

12. Final Model Training & Serialisation

13. Inference Example

🤖 Machine Learning Algorithms

Logistic Regression

K-Nearest Neighbors (KNN)

Random Forest

🛠️ Technologies & Libraries

🚀 How to Run Locally

Prerequisites

1. Clone the repository

2. Create and activate a virtual environment (recommended)

3. Install dependencies

4. Configure Kaggle API credentials

5. Launch the notebook

Expected outputs

📁 Project Structure

📈 Results

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages