Skip to content

Latest commit

 

History

History
193 lines (147 loc) · 5.71 KB

File metadata and controls

193 lines (147 loc) · 5.71 KB

🤖 Machine Learning with Python

A collection of end-to-end Machine Learning classification projects built with Python and scikit-learn, covering two real-world healthcare and environmental datasets. Each notebook walks through the complete ML pipeline — from raw data exploration and preprocessing all the way to training, evaluating, and comparing 7 different classification models.


📂 Repository Structure

Machine_learning_with_python/
├── Diabetes_Preprocessing_7_models.ipynb        # Diabetes prediction — full ML pipeline
├── water_potability_prediction_7_models.ipynb   # Water potability prediction — full ML pipeline
└── water_potability.csv                         # Dataset for the water potability project

📓 Projects Overview

🩺 Project 1 — Diabetes Prediction

Notebook: Diabetes_Preprocessing_7_models.ipynb

Predicts whether a patient is diabetic based on diagnostic measurements. Uses the well-known Pima Indians Diabetes Dataset.

Dataset Features

Feature Description
Pregnancies Number of pregnancies
Glucose Plasma glucose concentration (2-hour OGTT)
BloodPressure Diastolic blood pressure (mm Hg)
SkinThickness Triceps skin fold thickness (mm)
Insulin 2-hour serum insulin (mu U/ml)
BMI Body mass index (weight/height²)
DiabetesPedigreeFunction Diabetes pedigree function (genetic risk)
Age Age in years
Outcome Target — 1: Diabetic, 0: Non-diabetic

Pipeline Steps

  • Exploratory Data Analysis (EDA) — distributions, correlations, class balance
  • Handling biologically invalid zero values (e.g., zero Glucose, zero BMI) → replaced with NaN
  • Missing value imputation (mean/median)
  • Feature scaling — StandardScaler / MinMaxScaler
  • Train/test split
  • Training and evaluating 7 classification models
  • Comparing model performance with metrics and visualizations

💧 Project 2 — Water Potability Prediction

Notebook: water_potability_prediction_7_models.ipynb
Dataset: water_potability.csv

Predicts whether water is safe to drink based on physicochemical measurements.

Dataset Features

Feature Description
ph pH level of water (0–14)
Hardness Water hardness (mg/L)
Solids Total dissolved solids (ppm)
Chloramines Chloramines concentration (ppm)
Sulfate Sulfate concentration (mg/L)
Conductivity Electrical conductivity (μS/cm)
Organic_carbon Organic carbon (ppm)
Trihalomethanes Trihalomethanes concentration (μg/L)
Turbidity Water turbidity (NTU)
Potability Target — 1: Potable, 0: Not potable

Pipeline Steps

  • EDA — feature distributions, missing values analysis, class imbalance
  • Missing value imputation
  • Feature scaling and normalization
  • Train/test split
  • Training and evaluating 7 classification models
  • Model comparison with evaluation metrics

🧠 Models Used (Both Projects)

Both notebooks train and compare the following 7 classifiers using scikit-learn:

# Model Type
1 Logistic Regression Linear
2 K-Nearest Neighbors (KNN) Instance-based
3 Support Vector Machine (SVM) Kernel-based
4 Decision Tree Tree-based
5 Random Forest Ensemble (Bagging)
6 Gradient Boosting Ensemble (Boosting)
7 Naive Bayes Probabilistic

📊 Evaluation Metrics

Each model is evaluated using:

  • Accuracy — overall correct predictions
  • Precision — how many predicted positives are actually positive
  • Recall — how many actual positives were correctly identified
  • F1-Score — harmonic mean of Precision and Recall
  • Confusion Matrix — visualizes TP, TN, FP, FN
  • ROC-AUC Curve — model discrimination ability

🛠️ Tech Stack

Library Purpose
pandas Data loading, manipulation, preprocessing
numpy Numerical operations
matplotlib Data visualization and plots
seaborn Statistical visualizations
scikit-learn ML models, preprocessing, evaluation metrics
Jupyter Notebook Interactive development environment

⚙️ Setup & Installation

1. Clone the repository:

git clone https://github.com/Mortezamohasebati/Machine_learning_with_python.git
cd Machine_learning_with_python

2. Install required packages:

pip install pandas numpy matplotlib seaborn scikit-learn jupyter

3. Launch Jupyter Notebook:

jupyter notebook

4. Open either notebook and run all cells (Kernel → Restart & Run All)


🔄 ML Pipeline Overview

Both projects follow this standard pipeline:

Raw Data
   │
   ▼
Exploratory Data Analysis (EDA)
   │
   ▼
Data Preprocessing
 ├─ Handle missing / invalid values
 ├─ Feature scaling (StandardScaler / MinMaxScaler)
 └─ Train / Test split (typically 80/20)
   │
   ▼
Model Training (7 classifiers)
   │
   ▼
Evaluation & Comparison
 ├─ Accuracy, Precision, Recall, F1
 ├─ Confusion Matrix
 └─ ROC-AUC Curve
   │
   ▼
Best Model Selection

📚 Concepts Covered

  • Binary classification
  • Exploratory Data Analysis (EDA)
  • Data cleaning — handling missing and invalid values
  • Feature scaling — StandardScaler, MinMaxScaler
  • Supervised learning — 7 classification algorithms
  • Model evaluation — Accuracy, Precision, Recall, F1, ROC-AUC
  • Confusion Matrix interpretation
  • Comparing multiple models on the same dataset

📜 License

This project is open source and free to use for educational and research purposes.