End-to-End Real-Time E-commerce Data Pipeline

📝 Project Overview

This project demonstrates the construction of an end-to-end streaming data pipeline that simulates and processes user events from an e-commerce platform in real-time. Raw data from the source is streamed using Apache Kafka, stored in Google BigQuery, and transformed into a clean, analytics-ready data model using dbt (Data Build Tool).

This project is designed to showcase a fundamental understanding of modern data architecture, real-time data ingestion, and efficient data modeling principles.

✨ Key Features

Real-Time Data Ingestion: Utilizes Kafka to handle a continuous stream of data, mimicking a production environment where events occur every second.
Scalable Architecture: Leverages Google BigQuery as a serverless, highly scalable data warehouse capable of handling data volumes from small to massive.
Modern Data Transformation: Implements data transformation best practices with dbt, separating raw data from analytics-ready data and building tested, documented models.
Reproducible Environment: Uses Docker to run Kafka, ensuring a consistent and easy-to-set-up development environment.

🏗️ Pipeline Architecture

The pipeline consists of several key components that work sequentially to process the data:

Python Producer: A script that reads data from a CSV file and streams each row as a JSON message to a Kafka topic.
Apache Kafka: Acts as a reliable message broker, receiving the data stream from the producer and making it available to the consumer.
Python Consumer: A script that subscribes to the Kafka topic, consumes messages in real-time, and loads them into a raw table in BigQuery.
Google BigQuery: Serves as the Data Warehouse with two layers:
- ecommerce_raw Dataset: A landing zone for raw data directly from Kafka.
- ecommerce_analytics Dataset: An analytics zone containing clean views and tables transformed by dbt.
dbt (Data Build Tool): Fetches data from the raw zone, then cleans, transforms, and aggregates it into data models ready for business analysis.

🛠️ Setup & Installation

To run this project in a local environment, follow these steps:

Prerequisites

Git
Docker & Docker Compose
Python 3.8+
A Google Cloud Platform (GCP) account with the BigQuery API enabled.

Installation Steps

Clone the Repository

git clone [https://github.com/aDJi2003/streaming-ecommerce-analytics.git](https://github.com/aDJi2003/streaming-ecommerce-analytics)
cd streaming-ecommerce-analytics

Configure Google Cloud
- Create a Service Account in GCP with the BigQuery Data Editor and BigQuery Job User roles.
- Download the Service Account key as a JSON file.
- IMPORTANT: Save this JSON file in the project's root directory, but NEVER commit it to Git. The .gitignore file is already configured to ignore it.
- Either rename your JSON file to match the path in the scripts or update the path inside kafka_consumer/consumer.py and ~/.dbt/profiles.yml.
Set Up Python Environment
- (Optional but recommended) Create and activate a virtual environment:
```
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows
```
- Install all required dependencies:
```
pip install -r requirements.txt
```
Set Up dbt
- Initialize your dbt profile. dbt will ask for the location of your JSON key file.
```
dbt init
```
- Ensure your ~/.dbt/profiles.yml file is correctly configured to point to your GCP project and the ecommerce_analytics dataset.

🚀 How to Run the Pipeline

Execute the following commands from the project's root directory, using a separate terminal for each step.

Start Kafka (in Terminal 1) Wait for about 30-60 seconds to allow Kafka to fully initialize.
```
docker-compose up -d
```
Run the Consumer (in Terminal 2) The consumer must be running first to be ready to receive messages.
```
python kafka_consumer/consumer.py
```
Run the Producer (in Terminal 3) Once the consumer is ready, run the producer to start sending data.
```
python kafka_producer/producer.py
```
Run the dbt Transformations (in Terminal 1 or 4) After the data has been loaded into BigQuery, run dbt to perform the transformations.
```
cd dbt_ecommerce
dbt run
```

❤️ Closing Note

Made with ☕ for data engineering.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dbt_ecommerce		dbt_ecommerce
images		images
kafka_consumer		kafka_consumer
kafka_producer		kafka_producer
.gitignore		.gitignore
README.md		README.md
credentials.json.example		credentials.json.example
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

End-to-End Real-Time E-commerce Data Pipeline

📝 Project Overview

✨ Key Features

🏗️ Pipeline Architecture

🛠️ Setup & Installation

Prerequisites

Installation Steps

🚀 How to Run the Pipeline

❤️ Closing Note

About

Uh oh!

Releases

Packages

Languages

aDJi2003/streaming-ecommerce-analytics

Folders and files

Latest commit

History

Repository files navigation

End-to-End Real-Time E-commerce Data Pipeline

📝 Project Overview

✨ Key Features

🏗️ Pipeline Architecture

🛠️ Setup & Installation

Prerequisites

Installation Steps

🚀 How to Run the Pipeline

❤️ Closing Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages