Skip to content

aneq05/Diabetes_detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

🩺 Diabetes Risk Detection — ML Pipeline

Python Jupyter scikit-learn Optuna License

A complete, end-to-end machine learning pipeline for early-stage diabetes risk prediction based on patient-reported symptoms. The pipeline covers everything from automated data acquisition and exploratory data analysis, through statistical feature selection and hyperparameter-optimised model training, to final model serialisation and inference.


📑 Table of Contents


📊 Dataset

Property Value
Source Early Stage Diabetes Risk Prediction Dataset — Kaggle
Samples 520 patients
Features 16 symptom-based features + age
Target classPositive (diabetic) / Negative (non-diabetic)
Class distribution ~61.5 % Positive / 38.5 % Negative (imbalanced)

Features

Feature Type Description
Age Numerical Patient age (20–65)
Gender Binary Male / Female
Polyuria Binary Excessive urination
Polydipsia Binary Excessive thirst
sudden weight loss Binary Sudden weight loss
weakness Binary General weakness
Polyphagia Binary Excessive hunger
Genital thrush Binary Fungal infection
visual blurring Binary Blurred vision
Itching Binary Skin itching
Irritability Binary Irritability
delayed healing Binary Slow wound healing
partial paresis Binary Partial muscle weakness
muscle stiffness Binary Muscle stiffness
Alopecia Binary Hair loss
Obesity Binary Obesity

All binary features are Yes/No values encoded as 1/0 using Label Encoding.


🔄 Pipeline Overview

The notebook diabetes_risk_final.ipynb implements the following sequential steps:

1. Data Acquisition

The dataset is downloaded automatically from Kaggle using the kagglehub library — no manual downloading is required. A valid Kaggle API token (kaggle.json) must be present (see How to Run Locally).

path = kagglehub.dataset_download("ishandutta/early-stage-diabetes-risk-prediction-dataset")

2. Exploratory Data Analysis (EDA)

  • Shape, dtypes, descriptive statistics (data.info(), data.describe())
  • Unique value counts per column — confirms no missing values
  • Distribution plots (count plots) for all categorical features
  • Stacked bar charts showing the relationship between each symptom and the target variable — identifies strong predictors (Polyuria, Polydipsia, Gender)

3. Feature Encoding

  • Label Encoding (scikit-learn LabelEncoder) is applied to all binary categorical features (Yes→1, No→0)
  • Label Encoding is preferred over One-Hot Encoding because all binary features have no ordinal ambiguity, and it keeps the feature count unchanged

4. Correlation Analysis

  • A Pearson correlation heatmap (seaborn) is produced on the encoded dataset
  • Identifies potential multicollinearity: Polyuria and Polydipsia show a correlation of ~0.6
  • Notable: Gender and Alopecia are negatively correlated with the target

5. Train / Test Split

Total: 520 samples → Train: 80% (416) | Test: 20% (104)
Stratified split to preserve class distribution

The split happens before feature selection to prevent data leakage.

6. Statistical Feature Selection (on training set only)

Two types of tests are applied separately depending on feature type:

Categorical features — Chi-square / Fisher's Exact Test:

  • Chi-square test (or Fisher's Exact Test when expected cell frequencies < 5)
  • Selection criterion: p < 0.05 AND Cramér's V > 0.1 (at least weak association)
  • Cramér's V quantifies the strength of association beyond statistical significance

Numerical feature (Age) — T-test:

  • Normality checked with Shapiro-Wilk test + histogram + Q-Q plot
  • Independent samples T-test for Age by class
  • Confirmed significant difference (p < 0.05); Age is retained

All tests are computed exclusively on the training set to avoid data leakage.

7. Hyperparameter Optimisation with Optuna

Each model's hyperparameters are tuned using Optuna (Bayesian/TPE optimisation) with 5-fold Stratified Cross-Validation on the training set, optimising ROC AUC.

Model Tuned Hyperparameters
Logistic Regression C, solver, penalty
K-Nearest Neighbors n_neighbors, weights, p (Manhattan/Euclidean)
Random Forest n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features

8. Model Training & Cross-Validation Evaluation

Three scikit-learn Pipeline objects are built (each with StandardScaler → classifier). All models are evaluated via 5-fold Stratified Cross-Validation on the training set using:

Metric Description
Accuracy Overall correct predictions
Balanced Accuracy Accuracy adjusted for class imbalance
Precision Ratio of true positives among predicted positives
Recall (Sensitivity) Ratio of true positives among actual positives
F1-Score Harmonic mean of Precision and Recall
ROC AUC Area Under the ROC Curve

9. Visualisations

  • Confusion matrices for each model (CV predictions on train set)
  • Precision-Recall curves with Average Precision scores
  • ROC curves with AUC values

10. Type I / Type II Error Analysis

Given the medical context, Type II errors (False Negatives) are most dangerous — failing to detect a diabetic patient. The analysis quantifies False Negative Rate (FNR) and False Positive Rate (FPR) per model and recommends the model with the lowest FNR.

11. Statistical Model Comparison

The Wilcoxon signed-rank test is applied to compare model pairs based on their 5 cross-validation ROC AUC scores. This non-parametric test determines whether performance differences are statistically significant (caveat: only 5 paired observations limits statistical power).

12. Final Model Training & Serialisation

The best-performing model is retrained on the full training set and evaluated on the held-out test set (first and only time the test set is used). The final model and selected feature list are saved:

diabetes_risk_model.pkl   — serialised sklearn Pipeline
selected_features.pkl     — list of feature names used by the model

13. Inference Example

The notebook demonstrates loading the saved model and running predictions on new samples, printing predicted class and class probabilities.


🤖 Machine Learning Algorithms

Logistic Regression

A linear classifier that models the log-odds of the target class as a linear combination of features. Uses class_weight='balanced' to handle class imbalance. L1/L2 regularisation tuned via Optuna.

K-Nearest Neighbors (KNN)

A non-parametric instance-based learner that classifies a sample by majority vote among its k nearest neighbours. Distance metric (Manhattan/Euclidean) and weighting scheme are tuned. Does not support class_weight; imbalance is partially mitigated via stratified splits and AUC-based evaluation.

Random Forest

An ensemble of decision trees using bootstrap aggregation (bagging) and random feature subsets. Uses class_weight='balanced' to account for class imbalance. Typically the best performer due to its ability to capture non-linear feature interactions.

All algorithms are wrapped in scikit-learn Pipeline objects with StandardScaler as the first step (scaling is critical for Logistic Regression and KNN; harmless for Random Forest).


🛠️ Technologies & Libraries

Library Version Purpose
Python 3.8+ Programming language
jupyter Interactive notebook environment
numpy Numerical computing
pandas Data manipulation and analysis
matplotlib Plotting
seaborn Statistical data visualisation
scikit-learn 1.x ML models, preprocessing, evaluation, pipelines
optuna Automated hyperparameter optimisation (TPE)
scipy Statistical tests (Chi-square, Fisher, T-test, Wilcoxon, Shapiro-Wilk)
kagglehub Automatic Kaggle dataset download
pickle stdlib Model serialisation

🚀 How to Run Locally

Prerequisites

1. Clone the repository

git clone https://github.com/aneq05/Diabetes_detection.git
cd Diabetes_detection

2. Create and activate a virtual environment (recommended)

python -m venv venv
# Linux / macOS
source venv/bin/activate
# Windows
venv\Scripts\activate

3. Install dependencies

pip install jupyter numpy pandas matplotlib seaborn scikit-learn optuna scipy kagglehub

4. Configure Kaggle API credentials

  1. Go to https://www.kaggle.com/settings → scroll to API → click Create New Token

  2. A file called kaggle.json will be downloaded

  3. Place it in the default Kaggle credentials directory:

    OS Path
    Linux / macOS ~/.kaggle/kaggle.json
    Windows C:\Users\<username>\.kaggle\kaggle.json
  4. Set correct permissions (Linux/macOS only):

    chmod 600 ~/.kaggle/kaggle.json

5. Launch the notebook

jupyter notebook diabetes_risk_final.ipynb

Then run all cells from top to bottom (Kernel → Restart & Run All).

Note: The dataset is downloaded automatically on first run. Subsequent runs reuse the cached copy in ~/.cache/kagglehub/.

Expected outputs

  • Exploratory plots (distributions, correlation heatmap, crosstabs)
  • Feature selection report (Chi-square / Fisher / T-test results)
  • Optuna optimisation progress logs
  • Cross-validation evaluation tables
  • Confusion matrices, Precision-Recall curves, ROC curves
  • Final test-set metrics
  • Saved files: diabetes_risk_model.pkl, selected_features.pkl

📁 Project Structure

Diabetes_detection/
├── diabetes_risk_final.ipynb   # Main Jupyter notebook — full ML pipeline
├── diabetes_risk_model.pkl     # Serialised final model (generated after running)
├── selected_features.pkl       # Selected feature list (generated after running)
└── README.md                   # This file

📈 Results

After Optuna tuning and 5-fold cross-validation on the training set, all three models achieve strong performance. The final model is selected based on minimising the False Negative Rate (most important in a medical screening context — avoiding missed diagnoses).

Model ROC AUC (CV) Balanced Accuracy (CV) Recall (CV)
Logistic Regression ~0.976
K-Nearest Neighbors ~0.998
Random Forest ~0.998

Random Forest is recommended as the final model due to its superior ability to capture non-linear interactions between symptoms and its lowest False Negative Rate in the medical error analysis.

Exact metric values may vary slightly across runs due to the stochastic nature of Optuna's search and cross-validation random seeds.


📄 License

This project is open-source and available under the MIT License.

About

ML pipeline created for the purpose of detecting diabetes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors