This project demonstrates the construction of an end-to-end streaming data pipeline that simulates and processes user events from an e-commerce platform in real-time. Raw data from the source is streamed using Apache Kafka, stored in Google BigQuery, and transformed into a clean, analytics-ready data model using dbt (Data Build Tool).
This project is designed to showcase a fundamental understanding of modern data architecture, real-time data ingestion, and efficient data modeling principles.
- Real-Time Data Ingestion: Utilizes Kafka to handle a continuous stream of data, mimicking a production environment where events occur every second.
- Scalable Architecture: Leverages Google BigQuery as a serverless, highly scalable data warehouse capable of handling data volumes from small to massive.
- Modern Data Transformation: Implements data transformation best practices with dbt, separating raw data from analytics-ready data and building tested, documented models.
- Reproducible Environment: Uses Docker to run Kafka, ensuring a consistent and easy-to-set-up development environment.
The pipeline consists of several key components that work sequentially to process the data:
- Python Producer: A script that reads data from a CSV file and streams each row as a JSON message to a Kafka topic.
- Apache Kafka: Acts as a reliable message broker, receiving the data stream from the producer and making it available to the consumer.
- Python Consumer: A script that subscribes to the Kafka topic, consumes messages in real-time, and loads them into a raw table in BigQuery.
- Google BigQuery: Serves as the Data Warehouse with two layers:
ecommerce_rawDataset: A landing zone for raw data directly from Kafka.ecommerce_analyticsDataset: An analytics zone containing clean views and tables transformed by dbt.
- dbt (Data Build Tool): Fetches data from the raw zone, then cleans, transforms, and aggregates it into data models ready for business analysis.
To run this project in a local environment, follow these steps:
- Git
- Docker & Docker Compose
- Python 3.8+
- A Google Cloud Platform (GCP) account with the BigQuery API enabled.
-
Clone the Repository
git clone [https://github.com/aDJi2003/streaming-ecommerce-analytics.git](https://github.com/aDJi2003/streaming-ecommerce-analytics) cd streaming-ecommerce-analytics -
Configure Google Cloud
- Create a Service Account in GCP with the
BigQuery Data EditorandBigQuery Job Userroles. - Download the Service Account key as a JSON file.
- IMPORTANT: Save this JSON file in the project's root directory, but NEVER commit it to Git. The
.gitignorefile is already configured to ignore it. - Either rename your JSON file to match the path in the scripts or update the path inside
kafka_consumer/consumer.pyand~/.dbt/profiles.yml.
- Create a Service Account in GCP with the
-
Set Up Python Environment
- (Optional but recommended) Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # or `venv\Scripts\activate` on Windows
- Install all required dependencies:
pip install -r requirements.txt
- (Optional but recommended) Create and activate a virtual environment:
-
Set Up dbt
- Initialize your dbt profile. dbt will ask for the location of your JSON key file.
dbt init
- Ensure your
~/.dbt/profiles.ymlfile is correctly configured to point to your GCP project and theecommerce_analyticsdataset.
- Initialize your dbt profile. dbt will ask for the location of your JSON key file.
Execute the following commands from the project's root directory, using a separate terminal for each step.
-
Start Kafka (in Terminal 1) Wait for about 30-60 seconds to allow Kafka to fully initialize.
docker-compose up -d
-
Run the Consumer (in Terminal 2) The consumer must be running first to be ready to receive messages.
python kafka_consumer/consumer.py
-
Run the Producer (in Terminal 3) Once the consumer is ready, run the producer to start sending data.
python kafka_producer/producer.py
-
Run the dbt Transformations (in Terminal 1 or 4) After the data has been loaded into BigQuery, run dbt to perform the transformations.
cd dbt_ecommerce dbt run
Made with ☕ for data engineering.
