A full end-to-end supervised classification pipeline using scikit-learn, supporting multiple datasets and 7 classifiers with cross-validation, optional hyperparameter tuning, and rich output artefacts.
# path
The path for the project is: C:\Users\prana\Decode_Labs_Data_Classification>
#install requirements
pip install -r requirements.txt
# Run on Iris (default)
python main.py
# Run on Wine dataset with only two models
python main.py --dataset wine --models random_forest svm
# Run on Breast Cancer with hyperparameter tuning + 10-fold CV
python main.py --dataset breast_cancer --tune --cv 10
# Custom CSV (last column = target)
python main.py --dataset data/my_data.csv
# Run tests
python -m pytest test_classification.py -vproject2_data_classification/
│
|_ data/
| |- my_data.csv
├── main.py ← CLI entry point & pipeline orchestrator
├── data_loader.py ← Dataset loading, validation, train/test split, scaling
├── models.py ← Registry of 7 classifiers with hyper-parameter grids
├── trainer.py ← Fit, cross-validate, tune (GridSearchCV), evaluate
├── evaluator.py ← Multi-model comparison table + charts
├── test_classification.py← 20+ unit & integration tests (pytest)
├── requirements.txt
└── results/ ← Auto-created at runtime
├── comparison.csv
├── results.json
├── model_comparison.png
└── cm_<model>.png (one per model)
The Data Classification Pipeline is a Machine Learning-based system designed to automate the classification of structured datasets using multiple supervised learning algorithms. The project performs data preprocessing, feature scaling, model training, cross-validation, performance evaluation, and visualization of classification results.
The objective of this project is to create a scalable and reusable classification framework capable of evaluating multiple machine learning models and identifying the best-performing classifier for a given dataset.
- Python 3.11
- Pandas
- NumPy
- Scikit-Learn
- Matplotlib
- Joblib
- Logging Framework
- Dataset Loading
- Data Validation
- Data Preprocessing
- Train-Test Splitting
- Feature Scaling
- Model Training
- Cross Validation
- Model Evaluation
- Confusion Matrix Generation
- Results Visualization and Storage
- Analyzed project requirements and objectives.
- Designed overall system architecture.
- Planned modular folder structure.
- Selected machine learning algorithms for implementation.
- Defined dataset handling strategy.
- Project architecture finalized.
- Development roadmap established.
- Implemented dataset loading functionality.
- Added CSV dataset support.
- Developed automatic target column identification.
- Created dataset metadata extraction methods.
- Implemented feature-target separation logic.
- Functional Data Loader module completed.
- Dataset preprocessing workflow initiated.
- Implemented train-test splitting mechanism.
- Added feature scaling using StandardScaler.
- Developed preprocessing pipeline.
- Configured reproducibility using random states.
- Integrated target label encoding functionality.
- Complete preprocessing module developed.
- Standardized feature transformation workflow established.
- Integrated Logistic Regression.
- Integrated Decision Tree Classifier.
- Integrated Random Forest Classifier.
- Integrated Gradient Boosting Classifier.
- Integrated Support Vector Machine (SVM).
- Integrated K-Nearest Neighbors (KNN).
- Integrated Naive Bayes Classifier.
- Multi-model classification framework completed.
- Automated model evaluation pipeline prepared.
- Implemented model evaluation metrics.
- Added Accuracy Score calculations.
- Added Cross Validation support.
- Developed Confusion Matrix generation.
- Created visualization export functionality.
- Configured results storage mechanism.
- Evaluation framework completed.
- Visualization module operational.
- Fixed LabelEncoder initialization issue.
- Corrected target encoding workflow.
- Resolved logging configuration errors.
- Fixed dataset loading exceptions.
- Addressed train-test split edge cases.
- Performed end-to-end pipeline testing.
- Generated confusion matrix reports.
- Verified successful execution across all classification models.
- Stable and fully functional classification pipeline.
- Successful generation of evaluation reports and visualizations.
- CSV Dataset Support
- Automatic Target Detection
- Label Encoding
- Feature Scaling
- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier
- Gradient Boosting Classifier
- Support Vector Machine (SVM)
- K-Nearest Neighbors (KNN)
- Naive Bayes Classifier
- Accuracy Measurement
- Cross Validation
- Model Comparison
- Confusion Matrix Visualization
- Automated Results Storage
- Performance Charts
- Confusion Matrix Export
- Logging System
- Successfully loaded and processed datasets.
- Trained multiple machine learning classification models.
- Generated evaluation metrics for performance comparison.
- Created confusion matrix visualizations for each model.
- Automatically stored outputs in the results directory.
- Achieved successful end-to-end execution of the classification workflow.
- XGBoost Integration
- LightGBM Integration
- Hyperparameter Optimization using Optuna
- SHAP Explainable AI
- Streamlit Dashboard
- FastAPI Deployment
- Docker Containerization
- MLflow Experiment Tracking
- Automated Model Selection
- PDF Report Generation
The Data Classification Pipeline successfully demonstrates an end-to-end machine learning workflow, covering dataset preprocessing, model training, evaluation, and visualization. Through systematic development, debugging, and optimization during Week 2, the project evolved into a reliable classification framework capable of handling real-world machine learning tasks while maintaining scalability for future enhancements.