FMCG Data Engineering Pipeline — Lakehouse Architecture on Databricks

Overview

This project simulates a real-world scenario in the FMCG (Fast-Moving Consumer Goods) industry: a large retail company acquires a smaller competitor and needs to consolidate data from both entities into a single, unified system.

The goal was to build a production-style, end-to-end ETL pipeline that ingests raw data from both companies, resolves inconsistencies, and delivers clean, analytics-ready datasets — all on a Lakehouse architecture using Databricks and Amazon S3.

This project was built as part of the Codebasics Data Engineering course by Dhaval Patel.

Architecture

The pipeline is structured around the Medallion Architecture and follows a two-company consolidation design.

The child company runs a full, self-contained pipeline: raw data is ingested from Amazon S3 into a Bronze layer, transformed through a Silver layer, and aggregated into its own Gold layer. That Gold layer is then merged into the parent company's Gold layer, which is managed under Unity Catalog. The unified Gold layer is what powers the Databricks dashboards.

Orchestration across the entire flow is handled by Lakeflow Jobs.

                        ┌─────────────────────────────────┐
                        │         Parent Company           │
                        │                                  │
                        │   Unity Catalog                  │
                        │        │                         │
                        │        ▼                         │
                        │   ┌─────────┐                   │
                        │   │  Gold   │ ──► Dashboards     │
                        │   └────▲────┘                   │
                        │        │ Merge                   │
                        └────────┼────────────────────────┘
                                 │
┌────────────────────────────────┼────────────────────────┐
│  Child Company                 │                         │
│                                │                         │
│  S3 (Raw Data)                 │                         │
│      │                         │                         │
│      ▼                         │                         │
│  Lakeflow Jobs                 │                         │
│      │                         │                         │
│      ▼        ▼        ▼       │                         │
│  [Bronze] → [Silver] → [Gold] ─┘                         │
│   Raw      Transform  Business                           │
└──────────────────────────────────────────────────────────┘

Layer Breakdown

Bronze — Raw Ingestion Raw data from the child company is landed into Amazon S3 and ingested into the Bronze layer as-is, preserving the original state for auditability and reprocessing.

Silver — Transformation & Standardization Data is cleaned, validated, and standardized at this layer. This includes resolving schema differences, inconsistent formats, and duplicate records introduced by operating as a separate company.

Gold (Child) — Business-Ready Aggregated and enriched datasets are produced for the child company's data, structured to be compatible with the parent company's schema.

Gold (Parent) — Unified Analytics Layer The child company's Gold data is merged into the parent company's Gold layer, managed via Unity Catalog. This unified layer is the single source of truth for all reporting and dashboards.

Tech Stack

Tool / Technology	Purpose
Databricks	Processing, orchestration, and dashboards
Apache Spark	Distributed data transformation
Amazon S3	Raw and processed data storage
Unity Catalog	Data governance and catalog for parent company
Lakeflow Jobs	Pipeline scheduling and orchestration
Python	Pipeline logic and transformations
SQL	Data querying and aggregation
Medallion Architecture	Layered data organization pattern

Pipeline Workflow

Raw data from both companies is uploaded to Amazon S3
Databricks ingests the raw files into the Bronze Delta tables
Spark jobs clean and standardize the data into the Silver layer
Business logic is applied to produce unified Gold tables
Databricks dashboards connect to the Gold layer for reporting
The entire pipeline runs on a scheduled Databricks Workflow

Getting Started

Prerequisites

A Databricks workspace (Community Edition works for exploration)
An Amazon S3 bucket with appropriate IAM permissions
The raw dataset files loaded into S3

Steps

Clone this repository
Import the notebooks into your Databricks workspace
Configure your S3 connection in the Databricks cluster settings (or via a mounted storage path)
Run the notebooks in order: Bronze → Silver → Gold
Set up a Databricks Workflow to automate the pipeline on a schedule
Connect the Databricks dashboard to the Gold layer tables

Business Use Case

This pipeline enables the acquiring retail company to:

Get a unified, consistent view of customers, sales, and inventory across both companies
Identify data quality issues introduced by the acquisition early in the pipeline
Support finance, operations, and leadership with reliable, up-to-date reporting
Scale the pipeline as more data sources are added post-acquisition

Acknowledgements

Built as part of the Codebasics Data Engineering course by Dhaval Patel.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
1_setup		1_setup
2_dimension_data_processing		2_dimension_data_processing
3_fact_data_processing		3_fact_data_processing
Project_Architecture.png		Project_Architecture.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FMCG Data Engineering Pipeline — Lakehouse Architecture on Databricks

Overview

Architecture

Layer Breakdown

Tech Stack

Pipeline Workflow

Getting Started

Prerequisites

Steps

Business Use Case

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FMCG Data Engineering Pipeline — Lakehouse Architecture on Databricks

Overview

Architecture

Layer Breakdown

Tech Stack

Pipeline Workflow

Getting Started

Prerequisites

Steps

Business Use Case

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages