AI-Powered Mango Classifier

This repository contains a machine learning project for classifying mangoes as "Ripe" or "Rotten" based on image analysis. Using a supervised learning approach, various models are trained and evaluated on a dataset of mango images to distinguish between the two classes based on visual features.

Overview

The goal of this project is to build a classifier that can accurately determine whether a mango is ripe or rotten from an image. The project utilizes a dataset from Kaggle, focusing on the 2,100 images labeled as Ripe and Rotten.

The analysis involves several stages:

Data Preprocessing: Images are processed to extract meaningful features.
Feature Extraction: Statistical, color, and texture features are extracted from each image.
Model Training: Multiple classification models are trained on the extracted features.
Evaluation: Models are evaluated using various metrics to compare their performance.

Project Structure

The repository is organized as follows:

└── mangifera/
    ├── requirements.txt         # Project dependencies
    ├── notebooks/
    │   └── mangifera.ipynb      # Main Jupyter notebook with analysis and visualizations
    └── src/
        ├── data/                # Modules for data loading, processing, and visualization
        │   ├── features.py      # Feature extraction and scaling
        │   ├── graphic.py       # Visualization functions
        │   ├── processed.py     # Preprocessing pipeline orchestrator
        │   └── raw.py           # Raw data loading from Kaggle
        └── model/               # Modules for different classification models
            ├── classification.py# Base classification model class
            ├── dnn.py           # Deep Neural Network (Keras)
            ├── forest.py        # Random Forest Classifier
            ├── neural.py        # MLP Neural Network
            ├── regression.py    # Ridge Regression
            └── tree.py          # Decision Tree Classifier

Data Pipeline

1. Data Loading

The dataset is automatically downloaded from Kaggle using kagglehub.
Image paths and labels are organized into CSV files for training and validation sets, separating Ripe and Rotten classes.

2. Preprocessing

Label Encoding: The categorical labels 'Ripe' and 'Rotten' are encoded into binary values (0 and 1) using LabelEncoder.
Feature Scaling: All extracted features are scaled to a [0, 1] range using MinMaxScaler to ensure consistent scale across all features.
Data Balancing: To address the imbalanced nature of the dataset (more Rotten images than Ripe), class weights are calculated and applied during model training.

3. Feature Extraction

A comprehensive set of features is extracted from each image to capture its visual characteristics:

RGB Statistics: Mean and standard deviation for each of the R, G, and B color channels.
RGB Histograms: 256-bin histograms for each color channel, concatenated into a single feature vector.
Haralick Texture Features: Calculated from the Gray-Level Co-occurrence Matrix (GLCM), these include:
- Contrast
- Correlation
- Energy
- Homogeneity

4. Dimensionality Reduction

Principal Component Analysis (PCA) is applied to reduce the dimensionality of the feature set while retaining 95% of the variance.
The models are trained and evaluated on both the full feature set and the PCA-reduced feature set to compare performance.

Models

Several classification models were trained and evaluated to find the best-performing one for this task. Hyperparameter tuning was performed using GridSearchCV.

Ridge Regression (Ridge): A regularized linear model used as a baseline.
Decision Tree (DecisionTreeClassifier): A non-linear model that splits data based on feature values.
Random Forest (RandomForestClassifier): An ensemble of decision trees to improve robustness and reduce overfitting.
MLP Classifier (MLPClassifier): A multi-layer perceptron (feed-forward neural network) from scikit-learn.
Deep Neural Network (DNN): A sequential model built with Keras/TensorFlow, featuring multiple dense layers, BatchNormalization, and Dropout for regularization.

Results

The models trained on the full extracted feature set significantly outperformed those trained on PCA-reduced data. This suggests that the feature reduction process, while efficient, removed information crucial for classification.
The Deep Neural Network (DNN) and MLP Classifier achieved the highest validation accuracies, reaching up to 99% and 98%, respectively.
The Random Forest model also performed exceptionally well, with a validation accuracy of 97%.
Analysis of the confusion matrices showed that while models were highly accurate in identifying rotten mangoes, they occasionally struggled with ripe ones, indicating the class imbalance still had a minor effect on predictions.
Simpler models like Ridge Regression and Decision Trees were less effective, highlighting the complexity of the classification task.

Setup and Usage

Prerequisites

Python 3.8+
An environment with the packages listed in requirements.txt.

Installation

Clone the repository:

git clone https://github.com/dejesusbg/mangifera.git
cd mangifera

Install the required dependencies:
```
pip install -r requirements.txt
```
Set up Kaggle API credentials to allow the script to download the dataset. Follow the instructions here.

Running the Analysis

The entire analysis, from data download to model evaluation, can be reproduced by running the notebooks/mangifera.ipynb notebook in a Jupyter environment. The notebook is structured to be executed cell-by-cell and includes detailed explanations and visualizations for each step.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
assets/images		assets/images
notebooks		notebooks
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
mangoes.png		mangoes.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Powered Mango Classifier

Table of Contents

Overview

Project Structure

Data Pipeline

1. Data Loading

2. Preprocessing

3. Feature Extraction

4. Dimensionality Reduction

Models

Results

Setup and Usage

Prerequisites

Installation

Running the Analysis

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI-Powered Mango Classifier

Table of Contents

Overview

Project Structure

Data Pipeline

1. Data Loading

2. Preprocessing

3. Feature Extraction

4. Dimensionality Reduction

Models

Results

Setup and Usage

Prerequisites

Installation

Running the Analysis

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages