Data Preprocessing in Machine Learning

Overview

Data preprocessing is a fundamental step in Machine Learning (ML) to clean, transform, and prepare raw data for analysis and modeling. This project demonstrates essential preprocessing techniques using Python and the scikit-learn library.

Features

Handling missing data using SimpleImputer.
Encoding categorical variables (both independent and dependent variables).
Splitting the dataset into training and test sets.
Applying feature scaling for better model performance.

Prerequisites

Before running the code, ensure you have the following libraries installed:

pip install numpy pandas scikit-learn matplotlib

Usage

Import Required Libraries
- NumPy for numerical operations
- Pandas for data handling
- Matplotlib for visualization (if needed in future expansion)
- Scikit-learn for preprocessing functions

Load the Dataset

import pandas as pd
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

Handle Missing Data

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

Encode Categorical Data

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

le = LabelEncoder()
y = le.fit_transform(y)

Split Dataset into Training and Test Sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

Feature Scaling

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

Output

Preprocessed dataset ready for ML models.
Scaled numerical features for optimized performance.
Encoded categorical data for model compatibility.

Author

This project is created as a part of learning and implementing data preprocessing techniques in Machine Learning.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Data.csv		Data.csv
LICENSE		LICENSE
README.md		README.md
data_preprocessing_template.ipynb		data_preprocessing_template.ipynb
data_preprocessing_tools.py		data_preprocessing_tools.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Preprocessing in Machine Learning

Overview

Features

Prerequisites

Usage

Output

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Preprocessing in Machine Learning

Overview

Features

Prerequisites

Usage

Output

Author

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages