Skip to content

MazenS0liman/Data-Engineering-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data-Engineering-Project

Image 1

Project Description

This project is splitted into four milestones where the aim was to create an ETL pipeline for a fintech dataset with the following main objectives:

  1. Data Cleaning, Transformation & Feature Engineering using Pandas and PySpark
  2. Loading Data into Postgress Database
  3. Stream Processing using Kafka & Zookeeper
  4. Visualize Data in Dashboard and create DAG in Airflow

Milestone 1

Image 2

Objective:

  1. Perform exploratory data analysis (EDA) with visualization and extract additional data.
  2. Perform data cleaning by tidying column names, handle inconsistent data, missing data and outliers.
  3. Introduce new features, encoding and normalization.
  4. Create a lookup table where values in the lookup table can be later used to reverse all of the imputed values to their original values.

Diagram

Image 3

Milestone 2

Image 4

Objective:

  1. Utilize Docker to create a container that performs the tasks implemented in Milestone 1.
  2. Save the clean dataset in Postgres Database.
  3. Receive a data stream using Kafka & Zookeeper and process the message then save it to the database.

Diagram

Image 5

Image 6

Milestone 3

Image 7

Objective:

This milestone focus on getting hands-on experience with PySpark by implementing the following:

  1. Loading the dataset
  2. Perform some simple cleaning
    • Column renaming
    • Detect missing
    • Handle missing
    • Check missing
  3. Perform some analysis on the dataset
  4. Add new columns with feature engineering
  5. Encode categorical columns
  6. Create a lookup table for encoding only
  7. Saving Cleaned dataseta and lookup table
  8. Saving the output into a postgres database

Milestone 4

Image 8

Objective:

  1. Create an ETL pipeline using Airflow
  2. Creating a dashboard for the output data where the aim to give insights on the following 5 questions: - What is the trend of loan issuance over the months for each year? - What is the percentage distribution of loan grades in the dataset? - What is the distribution of loan amounts across different grades? - Which states have the highest average loan amount? - How does the loan amount relate to annual income across states?

Diagram

Image 9

Video

Showcase.mp4

About

This repository focus on creating an ETL (Extract, Transform, Load) pipeline for a fintech dataset with the goal of processing and visualizing financial data using various tools and technologies such as Pandas, PySpark, Kafka, Postgres, Docker, and Airflow.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors