Skip to content

aswithabukka/Netflix-Azure-Data-Engineering-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Netflix-Azure-Data-Engineering-Project

Introduction: In this project, I built a highly scalable data pipeline for processing and analyzing Netflix dataset using Azure Data Engineering Stack. The goal was to build an end-to-end ETL solution that efficiently ingests, transforms, and visualizes data in Databricks, Delta Live Tables (DLT), Azure Synapse, and Power BI.

πŸ”§ Tech Stack & Tools Used: βœ… Databricks: Used for data processing, transformation, and orchestration βœ… Delta Live Tables (DLT): Implemented incremental data processing with autoloader βœ… Azure Data Factory (ADF): For data ingestion & workflow automation βœ… Azure Data Lake Gen2: Storage layer for Bronze, Silver, and Gold tables βœ… Azure Synapse Analytics: Warehouse for querying structured data βœ… Power BI: For interactive dashboards and visualizations βœ… GitHub: Version control & collaboration βœ… Azure Key Vault: Secure storage of credentials βœ… dbutils: Databricks Utilities for handling widgets, secrets, and storage

πŸ“‚ Data Pipeline Architecture: πŸ“Œ Ingestion Layer: Data is loaded incrementally using Databricks Autoloader from Azure Data Lake. πŸ“Œ Bronze Layer (Raw Data Store): Stores raw ingested data in Delta format. πŸ“Œ Silver Layer (Transformations & Cleansing): Applied validations, deduplication, and aggregations. πŸ“Œ Gold Layer (Star Schema): Data is structured for analytics & reporting. πŸ“Œ Orchestration: Used Databricks Workflows and Azure Data Factory for scheduling jobs. πŸ“Œ Visualization: Power BI connected to Azure Synapse for interactive reporting.

πŸ”Ή ⚑ Key Implementations & Optimization Techniques πŸ”Ή Incremental Loading: βœ… Used Databricks Autoloader for continuous ingestion with checkpointing. βœ… Ensured idempotency to avoid duplicate data loads.

πŸ”Ή Delta Live Tables for Streaming & Batch Processing βœ… Implemented Change Data Capture (CDC) using Delta format. βœ… Used @dlt.table and @dlt.expect_all_or_drop() to enforce data quality checks.

πŸ”Ή Orchestration with Conditional Execution βœ… Implemented ForEach & If-Else Conditions to execute specific jobs based on day of execution. βœ… Leveraged dbutils.jobs.taskValues.set() for cross-task communication.

πŸ”Ή Optimizations using Databricks SQL βœ… Used OPTIMIZE & ZORDER BY to speed up queries & partition pruning. βœ… Data Skipping to reduce scan times on large datasets.

πŸ”Ή Security & Access Control βœ… Configured Azure Key Vault to store secrets & credentials securely. βœ… Used dbutils.secrets.get() to retrieve authentication tokens securely.

πŸ”Ή Outcome & Impact πŸ”₯ End-to-end ETL Pipeline successfully processes Netflix dataset with high efficiency & scalability. πŸ”₯ Automated workflow execution with Databricks Workflows & ADF reduces manual effort. πŸ”₯ Optimized queries & data warehouse design lead to faster analytics in Power BI. πŸ”₯ Fully secure environment with Azure Key Vault integration.

About

Built an incremental data processing pipeline using Azure Data Factory, Databricks, and Delta Live Tables for real-time transformations and analytics. Data flows from GitHub to Databricks via AutoLoader, stored in Data Lake Gen2, transformed into a Star Schema, and served in Azure Synapse for Power BI reporting.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors