This repository contains a machine learning project for classifying mangoes as "Ripe" or "Rotten" based on image analysis. Using a supervised learning approach, various models are trained and evaluated on a dataset of mango images to distinguish between the two classes based on visual features.
The goal of this project is to build a classifier that can accurately determine whether a mango is ripe or rotten from an image. The project utilizes a dataset from Kaggle, focusing on the 2,100 images labeled as Ripe and Rotten.
The analysis involves several stages:
- Data Preprocessing: Images are processed to extract meaningful features.
- Feature Extraction: Statistical, color, and texture features are extracted from each image.
- Model Training: Multiple classification models are trained on the extracted features.
- Evaluation: Models are evaluated using various metrics to compare their performance.
The repository is organized as follows:
βββ mangifera/
βββ requirements.txt # Project dependencies
βββ notebooks/
β βββ mangifera.ipynb # Main Jupyter notebook with analysis and visualizations
βββ src/
βββ data/ # Modules for data loading, processing, and visualization
β βββ features.py # Feature extraction and scaling
β βββ graphic.py # Visualization functions
β βββ processed.py # Preprocessing pipeline orchestrator
β βββ raw.py # Raw data loading from Kaggle
βββ model/ # Modules for different classification models
βββ classification.py# Base classification model class
βββ dnn.py # Deep Neural Network (Keras)
βββ forest.py # Random Forest Classifier
βββ neural.py # MLP Neural Network
βββ regression.py # Ridge Regression
βββ tree.py # Decision Tree Classifier
- The dataset is automatically downloaded from Kaggle using
kagglehub. - Image paths and labels are organized into CSV files for training and validation sets, separating
RipeandRottenclasses.
- Label Encoding: The categorical labels 'Ripe' and 'Rotten' are encoded into binary values (0 and 1) using
LabelEncoder. - Feature Scaling: All extracted features are scaled to a [0, 1] range using
MinMaxScalerto ensure consistent scale across all features. - Data Balancing: To address the imbalanced nature of the dataset (more
Rottenimages thanRipe), class weights are calculated and applied during model training.
A comprehensive set of features is extracted from each image to capture its visual characteristics:
- RGB Statistics: Mean and standard deviation for each of the R, G, and B color channels.
- RGB Histograms: 256-bin histograms for each color channel, concatenated into a single feature vector.
- Haralick Texture Features: Calculated from the Gray-Level Co-occurrence Matrix (GLCM), these include:
- Contrast
- Correlation
- Energy
- Homogeneity
- Principal Component Analysis (PCA) is applied to reduce the dimensionality of the feature set while retaining 95% of the variance.
- The models are trained and evaluated on both the full feature set and the PCA-reduced feature set to compare performance.
Several classification models were trained and evaluated to find the best-performing one for this task. Hyperparameter tuning was performed using GridSearchCV.
- Ridge Regression (
Ridge): A regularized linear model used as a baseline. - Decision Tree (
DecisionTreeClassifier): A non-linear model that splits data based on feature values. - Random Forest (
RandomForestClassifier): An ensemble of decision trees to improve robustness and reduce overfitting. - MLP Classifier (
MLPClassifier): A multi-layer perceptron (feed-forward neural network) from scikit-learn. - Deep Neural Network (DNN): A sequential model built with Keras/TensorFlow, featuring multiple dense layers,
BatchNormalization, andDropoutfor regularization.
- The models trained on the full extracted feature set significantly outperformed those trained on PCA-reduced data. This suggests that the feature reduction process, while efficient, removed information crucial for classification.
- The Deep Neural Network (DNN) and MLP Classifier achieved the highest validation accuracies, reaching up to 99% and 98%, respectively.
- The Random Forest model also performed exceptionally well, with a validation accuracy of 97%.
- Analysis of the confusion matrices showed that while models were highly accurate in identifying rotten mangoes, they occasionally struggled with ripe ones, indicating the class imbalance still had a minor effect on predictions.
- Simpler models like Ridge Regression and Decision Trees were less effective, highlighting the complexity of the classification task.
- Python 3.8+
- An environment with the packages listed in
requirements.txt.
-
Clone the repository:
git clone https://github.com/dejesusbg/mangifera.git cd mangifera -
Install the required dependencies:
pip install -r requirements.txt
-
Set up Kaggle API credentials to allow the script to download the dataset. Follow the instructions here.
The entire analysis, from data download to model evaluation, can be reproduced by running the notebooks/mangifera.ipynb notebook in a Jupyter environment. The notebook is structured to be executed cell-by-cell and includes detailed explanations and visualizations for each step.
