This project is splitted into four milestones where the aim was to create an ETL pipeline for a fintech dataset with the following main objectives:
- Data Cleaning, Transformation & Feature Engineering using Pandas and PySpark
- Loading Data into Postgress Database
- Stream Processing using Kafka & Zookeeper
- Visualize Data in Dashboard and create DAG in Airflow
- Perform exploratory data analysis (EDA) with visualization and extract additional data.
- Perform data cleaning by tidying column names, handle inconsistent data, missing data and outliers.
- Introduce new features, encoding and normalization.
- Create a lookup table where values in the lookup table can be later used to reverse all of the imputed values to their original values.
- Utilize Docker to create a container that performs the tasks implemented in Milestone 1.
- Save the clean dataset in Postgres Database.
- Receive a data stream using Kafka & Zookeeper and process the message then save it to the database.
This milestone focus on getting hands-on experience with PySpark by implementing the following:
- Loading the dataset
- Perform some simple cleaning
- Column renaming
- Detect missing
- Handle missing
- Check missing
- Perform some analysis on the dataset
- Add new columns with feature engineering
- Encode categorical columns
- Create a lookup table for encoding only
- Saving Cleaned dataseta and lookup table
- Saving the output into a postgres database
- Create an ETL pipeline using Airflow
- Creating a dashboard for the output data where the aim to give insights on the following 5 questions: - What is the trend of loan issuance over the months for each year? - What is the percentage distribution of loan grades in the dataset? - What is the distribution of loan amounts across different grades? - Which states have the highest average loan amount? - How does the loan amount relate to annual income across states?








