The modern data engineering landscape is undergoing a radical transformation. For practitioners grounded in the robust, open-source traditions of the Hadoop ecosystem—leveraging tools like MySQL, Hive, PySpark, Oozie, Airflow, and ClickHouse—the transition to the Azure cloud represents both a technological translation and a philosophical shift. The move is from a world of statically provisioned clusters and tightly coupled storage-compute architectures to one of ephemeral, serverless resources, decoupled storage, and identity-based security perimeters.
This research report provides an exhaustive analysis tailored specifically for the experienced data engineer seeking to translate deep on-premises or IaaS (Infrastructure as a Service) expertise into the Azure PaaS (Platform as a Service) lexicon. The objective is twofold: to provide a rigorous theoretical mapping of legacy tools to their cloud-native counterparts, and to detail a comprehensive, execution-ready "One Day" hands-on project. This project is designed to be cost-neutral, leveraging Azure’s free tier and trial capabilities to build an end-to-end pipeline—ingesting unformatted data, transforming it via Spark, and serving it through serverless SQL—thereby creating a tangible portfolio asset for immediate job market relevance.
To effectively operate within Azure, one must first dismantle the mental models of the legacy Hadoop stack and reconstruct them according to cloud-native principles. The transition is not merely about swapping tool A for tool B; it involves fundamentally rethinking how data is stored, processed, and secured.
In a traditional Hadoop cluster (HDFS), data locality is paramount. The NameNode and DataNodes work in concert to ensure that computation (MapReduce/Spark) occurs on the same physical hardware where the data resides to minimize network I/O. This architecture, while efficient for spinning disks and limited bandwidth, is rigid. Scaling storage requires adding compute nodes, and scaling compute requires rebalancing storage.
Azure, like other hyperscale clouds, enforces a strict decoupling of these resources.
- Storage is Infinite and Remote: Azure Data Lake Storage Gen2 (ADLS Gen2) serves as the HDFS equivalent. It is a highly durable, globally available object store that supports the HDFS API (ABFS driver). It scales independently of compute, allowing for petabytes of storage without a single active CPU core.1
- Compute is Ephemeral and Stateless: Compute resources—whether Azure Databricks clusters, Synapse Spark pools, or Data Factory pipelines—are provisioned on-demand. They mount the storage layer over the high-speed Azure backbone network. This allows for "pause and resume" economics, where you stop paying for compute the moment the job finishes, a stark contrast to the "always-on" Oozie/Airflow worker nodes of the past.2
In legacy environments, security was often network-centric (firewalls, VPNs) or relied on Kerberos ticketing systems that were notoriously difficult to manage. In Azure, Microsoft Entra ID (formerly Azure Active Directory) is the control plane. Access to data in ADLS Gen2 is not just about network visibility; it requires an OAuth token or a Managed Identity with specific Role-Based Access Control (RBAC) assignments (e.g., "Storage Blob Data Contributor"). For the data engineer, this means shifting focus from managing keytab files to managing Service Principals and System-Assigned Identities.4
The following table articulates the direct transition path for the tools mentioned in your background.
| Legacy/Hadoop Tool | Azure Native Equivalent | Azure "Evolutionary" Equivalent | Rationale & Architectural Shift |
|---|---|---|---|
| MySQL / RDBMS | Azure SQL Database | Azure Cosmos DB | Azure SQL is the PaaS version of SQL Server/MySQL. Cosmos DB offers global distribution for NoSQL workloads. |
| HDFS | Azure Blob Storage | ADLS Gen2 | ADLS Gen2 adds a hierarchical namespace (folders) to object storage, enabling atomic operations essential for Big Data. |
| Hive / Impala | Synapse Dedicated Pool | Synapse Serverless SQL | Serverless SQL allows querying data directly in the lake (Parquet/Delta) using T-SQL without provisioning a warehouse. |
| PySpark (YARN) | HDInsight | Azure Databricks | Databricks is the optimized, managed Spark platform. HDInsight is legacy (Hortonworks on Azure). |
| Oozie / Airflow | Azure Data Factory (ADF) | ADF + Managed Airflow | ADF is the native orchestrator. It uses JSON-based pipelines rather than Python/XML, though it can trigger Airflow DAGs. |
| ClickHouse | Synapse Dedicated Pool | Azure Data Explorer (Kusto) | For real-time, high-ingest telemetry, Data Explorer is the closest to ClickHouse. For warehousing, Synapse is standard. |
| Tableau/Looker | Power BI | Power BI (DirectQuery) | Power BI is deeply integrated, offering cost advantages and native connectivity to Synapse/ADLS. |
Before commencing the project, it is critical to understand the deep technical nuances of the specific services we will employ: Azure Data Factory, Azure Databricks, and Azure Synapse Analytics.
You are accustomed to defining dependencies in code (Airflow DAGs) or XML (Oozie). Azure Data Factory (ADF) introduces a visual, metadata-driven paradigm.
Airflow often requires the installation of libraries (pip install boto3, pip install pandas) on worker nodes to interact with external systems. ADF inverts this. It uses a concept called the Integration Runtime (IR).
- Azure IR: A fully serverless compute infrastructure managed by Microsoft. It automatically scales to move data between cloud endpoints (e.g., HTTP source to ADLS).
- Self-Hosted IR: A software agent installed on a private VM or on-premise server. This is the bridge that allows the cloud to reach into a private VNet or an on-premise MySQL database without opening corporate firewalls.
For your "One Day" project, the Azure IR will be sufficient, utilizing its massive bandwidth to ingest data from public HTTP endpoints.5
ADF separates orchestration (Control Flow) from transformation (Data Flow).
- Control Flow: Handles the logic—
If Condition,For Each,Wait,Web Hook. This is analogous to Airflow Operators. - Mapping Data Flows: A visual, code-free interface for data transformation (Joins, Aggregates, Pivots) that ADF compiles into Spark jobs running on backend clusters. While powerful, for a PySpark veteran, these can feel restrictive. Therefore, our project will use ADF strictly for Orchestration and Ingestion (Copy Activity), while delegating Transformation to Databricks.7
Your experience with PySpark is your greatest asset. However, running Spark on Databricks differs from running it on bare-metal YARN.
Databricks uses a proprietary execution engine called Photon, written in C++, which vectorizes query execution. This provides significant performance gains over standard open-source Spark.
- Pricing Model: Databricks charges in Databricks Units (DBUs), a normalized unit of processing power per hour.
- The "Free" Nuance: The Azure Databricks "Trial" (Premium tier) gives you free DBUs for 14 days. However, the Virtual Machines (VMs) that the software runs on are provisioned in your Azure subscription and will cost money (or consume your Azure Free Credit). This distinction is vital to avoid unexpected bills.9
Traditionally, Hive Metastore (HMS) managed table metadata. Databricks introduces Unity Catalog, a centralized governance layer that manages permissions, lineage, and discovery across all workspaces. For our project, we will use the local Hive Metastore compatibility mode for simplicity, but acknowledge Unity Catalog as the enterprise standard you will discuss in interviews.11
The "Lakehouse" paradigm is the convergence of the Data Warehouse (structured, ACID transactions, SQL) and the Data Lake (cheap, unstructured, massive scale).
Standard Parquet files (which you likely used with Hive) are immutable. Updating a row requires rewriting the file. Delta Lake, an open-source storage layer that Databricks defaults to, brings ACID transactions to Parquet. It creates a _delta_log folder that tracks changes (JSON transaction logs), enabling features like "Time Travel" (querying data as it existed yesterday) and MERGE statements (Upserts). This solves the "Lambda Architecture" complexity where batch and speed layers had to be maintained separately.2
Your background involves ClickHouse or Hive for querying. Synapse Serverless is a query-as-a-service engine. It does not store data. Instead, it spins up stateless compute nodes that read Delta/Parquet files from ADLS Gen2, process the query, and return the result.
- Cost Implication: You are charged ~$5 USD per Terabyte of data scanned. This incentivizes efficient data engineering—partitioning your data (e.g., by Date) becomes a direct cost-saving measure, not just a performance tuner.12
To simulate a complete enterprise environment within a single day and zero cost, we must be strategic. We will build a pipeline that ingests raw data from a public source, refines it using the Medallion Architecture (Bronze/Silver/Gold), and serves it to a BI dashboard.
- Source: Public GitHub Repository (Simulating a REST API vendor).
- Dataset: NYC Taxi Data or AdventureWorks (CSV/JSON).
- Ingestion (ADF):
- Pipeline triggers an HTTP Connector to fetch data.
- Lands raw files in ADLS Gen2 (
bronzecontainer).
- Processing (Databricks):
- Notebook reads from
bronze. - Cleans and converts to Delta format.
- Writes to
silvercontainer. - Aggregates business KPIs and writes to
goldcontainer.
- Notebook reads from
- Serving (Synapse):
- Serverless SQL View created over the
goldDelta tables.
- Serverless SQL View created over the
- Visualization (Power BI):
- Connects to Synapse via DirectQuery to visualize the KPIs.
Azure's pricing can be opaque. Here is the strategy to remain within the $200 free credit:
- Region: Use
East USorEast US 2. These regions often have the lowest rates and widest service availability. - Databricks: Use the 14-Day Premium Trial to waive DBU costs. Use
Standard_DS3_v2instances (4 core, 14GB RAM) for the cluster, and set Auto-termination to 15 minutes. This ensures you only pay for the VM when actively running the job. - Synapse: Do not create a Dedicated SQL Pool (formerly SQL DW). The Serverless pool is auto-provisioned and charges only for queries run.
- Storage: Use LRS (Locally Redundant Storage) rather than GRS/RA-GRS.
- Clean Up: At the end of the day, delete the entire Resource Group.
This section is the operational manual. It assumes you have created a Microsoft account and activated the Azure Free Account ($200 credit).
We begin by laying the bedrock: the Resource Group and Storage Account.
Step 1.1: The Resource Group
- Navigate to the Azure Portal (portal.azure.com).
- Search for Resource groups.
- Click Create.
- Name:
rg-data-migration-lab-001 - Region:
East US
- Name:
- Insight: In a real job, this would be defined in Terraform or Bicep. The Resource Group is the lifecycle boundary; deleting it deletes everything inside.
Step 1.2: The Data Lake (ADLS Gen2)
- Search for Storage accounts -> Create.
- Resource Group:
rg-data-migration-lab-001 - Name:
sadatalab[yourname]001(must be lowercase and globally unique). - Region:
East US. - Performance:
Standard. - Redundancy:
Locally-redundant storage (LRS). - Advanced Tab (CRITICAL): Check the box Enable hierarchical namespace. This is the switch that turns a Blob store into a Data Lake.
- Click Review + create.
Step 1.3: Container Architecture
- Once deployed, go to the Storage Account resource.
- Navigate to Data storage -> Containers.
- Create three containers:
bronze(Raw landing zone)silver(Cleaned/Enriched data)gold(Aggregated/Business-ready data)
- Analysis: This structure allows for granular security policies. In production, the
goldcontainer might be read-only for analysts, whilebronzeis restricted to the ETL service principal.
Step 1.4: Azure Data Factory (ADF)
- Search for Data factories -> Create.
- Name:
adf-migration-lab-[yourname]. - Version:
V2. - Git Configuration: Select "Configure Git later". (We will skip CI/CD for this one-day sprint).
- Click Create.
Step 1.5: Azure Databricks
- Search for Azure Databricks -> Create.
- Name:
dbx-migration-lab-001. - Pricing Tier: Select Premium (14-Days Free DBUs). Do not select Standard, as it lacks Role-Based Access Control features we may want to explore.
- Click Create.
Step 1.6: Azure Synapse Analytics
- Search for Azure Synapse Analytics -> Create.
- Resource Group:
rg-data-migration-lab-001. - Workspace Name:
syn-migration-lab-[yourname]. - Data Lake Storage Gen2:
- Account name: Select
sadatalab[yourname]001. - File system name: Select
usersor create new.
- Account name: Select
- Security: Allow the workspace Managed Identity to access the Data Lake.
- Click Create.
We will ingest a CSV file from a public GitHub repository. This mimics an external API integration. We will use the NYC Taxi Trip Data (Yellow Taxi) sample often hosted in public repos or Azure Open Datasets.
Step 2.1: The Linked Services
- Open the ADF resource and click Launch Studio.
- Go to the Manage icon (toolbox on the left) -> Linked Services.
- Source (HTTP):
- Click New -> HTTP.
- Base URL:
https://raw.githubusercontent.com/Sivaprasad-V/Adventure-Works-Azure-Data-Engineering-Project/main/Data/(Or any raw git URL containing CSV data). - Authentication:
Anonymous. - Name it
ls_http_github.
- Sink (ADLS Gen2):
- Click New -> Azure Data Lake Storage Gen2.
- Authentication method:
Managed Identity. - Subscription: Select yours.
- Storage account name:
sadatalab[yourname]001. - Troubleshooting: If connection fails, ensure the ADF Managed Identity has the Storage Blob Data Contributor role on the storage account (Access Control IAM tab in Storage Account).
- Name it
ls_adls_main.
Step 2.2: The Datasets
- Go to the Author icon (pencil) -> Datasets.
- Source Dataset:
- New -> HTTP -> CSV.
- Linked Service:
ls_http_github. - Relative URL:
Sales.csv(or specific file path). - "First row as header": Checked.
- Sink Dataset:
- New -> ADLS Gen2 -> CSV (or Binary).
- Linked Service:
ls_adls_main. - File path:
bronze/sales_raw/sales.csv.
Step 2.3: The Pipeline
- Pipelines -> New pipeline.
- Name:
pl_ingest_sales. - Drag the Copy Data activity onto the canvas.
- Source Tab: Select the HTTP source dataset.
- Sink Tab: Select the ADLS sink dataset.
- Debug: Click the Debug button at the top.
- Watch the "Output" tab. You will see the status transition from Queued -> In Progress -> Succeeded.
- Deep Dive: Click the "eyeglass" icon on the success row. It shows the Data Transfer Details: throughput, number of files, and data read/written. This is the equivalent of analyzing the MapReduce counters in Hadoop.
- Verification: Go to your Storage Account in the portal (or use Azure Storage Explorer). Open the
bronzecontainer. Verifysales.csvexists.
Now we enter the Spark domain. We will mount the storage and perform transformations.
Step 3.1: Cluster Setup
- Open the Databricks workspace.
- Compute -> Create compute.
- Policy:
Unrestricted. - Databricks Runtime Version:
12.2 LTS(Spark 3.3.2, Scala 2.12). Avoid the absolute newest unless necessary; LTS is stable. - Node Type:
Standard_DS3_v2. - Workers: Fixed (1 worker is enough for this data).
- Terminate after:
15minutes.
Step 3.2: Mounting ADLS (The Secure Way)
In a real interview, you will be asked how to connect securely. Do not say "Access Keys." The answer is Service Principals.
-
Create Service Principal (SPN):
- In Azure Portal, search Microsoft Entra ID.
- App registrations -> New registration. Name:
spn-databricks-lab. - Copy the Application (client) ID and Directory (tenant) ID.
- Certificates & secrets -> New client secret. Copy the Value immediately (you won't see it again).
-
Grant Permissions:
- Go to your Storage Account (
sadatalab...). - Access Control (IAM) -> Add role assignment.
- Role: Storage Blob Data Contributor.
- Assign to:
spn-databricks-lab.
- Go to your Storage Account (
-
The Notebook Code:
Create a new Notebook in Databricks and run this initialization cell:
# Configuration for Service Principal
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<YOUR_CLIENT_ID>",
"fs.azure.account.oauth2.client.secret": "<YOUR_CLIENT_SECRET>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<YOUR_TENANT_ID>/oauth2/token"
}
# Mount the containers
def mount_container(container_name):
mount_point = f"/mnt/{container_name}"
if not any(mount.mountPoint == mount_point for mount in dbutils.fs.mounts()):
dbutils.fs.mount(
source = f"abfss://{container_name}@<YOUR_STORAGE_ACCOUNT>.dfs.core.windows.net/",
mount_point = mount_point,
extra_configs = configs
)
print(f"Mounted {container_name}")
else:
print(f"{container_name} already mounted")
mount_container("bronze")
mount_container("silver")
mount_container("gold")Step 3.3: The Bronze to Silver Transformation
We will read the raw CSV, clean types, and write to Delta.
from pyspark.sql.functions import col, current_timestamp, to_date
# Read Bronze
df_raw = spark.read.csv("/mnt/bronze/sales_raw/sales.csv", header=True, inferSchema=True)
# Transform
df_silver = df_raw \
.withColumnRenamed("OrderDate", "order_date") \
.withColumnRenamed("SalesAmount", "sales_amount") \
.withColumn("ingestion_date", current_timestamp()) \
.filter(col("sales_amount").isNotNull())
# Write to Silver (Delta Format)
df_silver.write.format("delta").mode("overwrite").save("/mnt/silver/sales_clean")Step 3.4: The Silver to Gold Aggregation
Create business-level aggregates.
# Aggregate
df_gold = df_silver.groupBy("ProductCategory", "order_date") \
.sum("sales_amount") \
.withColumnRenamed("sum(sales_amount)", "total_sales")
# Write to Gold
df_gold.write.format("delta").mode("overwrite").save("/mnt/gold/sales_by_category")Insight: Unlike Hive where you might run MSCK REPAIR TABLE, Delta handles the file management. The _delta_log folder now contains the transaction history.
Step 3.5: Orchestrating the Notebook from ADF
- Go back to ADF Studio.
- Manage -> Linked Services -> New -> Azure Databricks.
- Select your workspace.
- Authentication: You need a token. In Databricks Workspace -> Top Right User Icon -> User Settings -> Developer -> Generate New Token. Paste this into ADF.
- Select "Existing Interactive Cluster" and pick the cluster you created.
- Go to your Pipeline (
pl_ingest_sales). - Drag a Databricks Notebook activity onto the canvas.
- Connect the Success (Green) output of the Copy Activity to the Notebook Activity.
- Settings: Browse and select the notebook you just wrote.
- Debug Run: Execute the pipeline. ADF will first copy the file, then trigger the Spark job.
The data is in the lake. Now we make it queryable for BI tools.
Step 4.1: The Serverless View
- Open Synapse Studio.
- Develop (Paper icon) -> SQL Scripts -> New SQL Script.
- Ensure "Connect to" is set to Built-in (This is the Serverless pool).
- Run the following T-SQL to create a database and view:
SQL
CREATE DATABASE DiscoveryDB;
GO
USE DiscoveryDB;
GO
CREATE OR ALTER VIEW v_Sales_Gold AS
SELECT
*
FROM
OPENROWSET(
BULK 'https://<YOUR_STORAGE_ACCOUNT>.dfs.core.windows.net/gold/sales_by_category/',
FORMAT = 'DELTA'
) AS [result];Architectural Note: The OPENROWSET function with FORMAT = 'DELTA' is a critical capability. Synapse reads the Delta transaction log to identify the valid Parquet files, ensuring it reads a consistent snapshot. This enables the "Lakehouse" pattern where Spark writes and SQL reads simultaneously without locking issues.14
Step 5.1: Connection
- Open Power BI Desktop.
- Get Data -> Azure Synapse Analytics SQL.
- Server: Go to Synapse Workspace Overview in Azure Portal. Copy the Serverless SQL endpoint (e.g.,
syn-lab-ondemand.sql.azuresynapse.net). - Database:
DiscoveryDB. - Data Connectivity mode: DirectQuery.
- Why DirectQuery? Import mode copies data into Power BI's internal VertiPaq engine. DirectQuery leaves the data in the Lake and pushes T-SQL to Synapse every time you click a visual. This demonstrates true "Big Data" capability where the dataset might be too large for your laptop's RAM.
- Authentication: Sign in with your Microsoft Account (Azure AD).
Step 5.2: The Dashboard
- Select the
v_Sales_Goldview. - Create a Line Chart:
order_dateon X-axis,total_saleson Y-axis. - Create a Slicer:
ProductCategory. - Interact with the slicer. Notice the spinning wheel? That is Power BI sending a T-SQL query to Synapse, which is scanning the Delta files in ADLS Gen2, and returning the result.
Completing the lab is step one. Understanding why it works and how to manage it differentiates a Junior from a Senior Engineer.
The most common interview question: "How do you optimize costs in Azure?"
| Strategy | Description | Application in our Project |
|---|---|---|
| Reserved Instances | Pre-paying for 1-3 years of compute for a discount (up to 72%). | Applicable to the Databricks VMs if this were a long-running production job. |
| Spot Instances | Using unused Azure capacity for up to 90% discount, with risk of eviction. | Perfect for the Databricks job clusters in our pipeline, as the job can be retried if evicted. |
| Partition Pruning | Organizing data in folders like Year=2023/Month=01. |
In Synapse Serverless, querying WHERE Year=2023 scans only that folder, reducing the cost (charged per TB scanned). |
| Auto-Termination | Shutting down clusters when idle. | We set this to 15 minutes in Databricks to save credits. |
You are used to Jenkins or Airflow DAG deployments. In Azure, Azure DevOps or GitHub Actions is the standard.
- Infrastructure as Code (IaC): While we used the Portal, in production you would use Terraform or Bicep to deploy the Resource Group and ADF.
- ADF Deployment: ADF has a unique "ARM Template" publish mechanism. The
publishbranch contains the JSON definition of your pipeline. The CI/CD pipeline deploys this JSON to the Test/Prod Data Factory.
- "Access Denied" on ADLS:
- Symptom: ADF or Databricks fails to read/write.
- Fix: Checks ACLs. Even if you are "Owner" of the subscription, you might not have data plane access. You must add yourself or the Service Principal to the Storage Blob Data Contributor role. "Contributor" alone is not enough (it controls the resource, not the data).
- Databricks Cluster Won't Start:
- Symptom: "Quota Exceeded".
- Fix: Your free trial has a vCPU core limit (usually 4 or 8 cores). Ensure you aren't trying to spin up a huge cluster. Stick to
Standard_DS3_v2(4 cores).
- Power BI Credential Error:
- Symptom: Cannot connect to Synapse.
- Fix: Ensure your client IP address is allowed in the Synapse Networking firewall settings, or allow "Allow Azure services" if connecting from Power BI Service.
Your experience with ClickHouse implies a familiarity with high-performance, real-time analytics. How does Azure compare?
Azure Data Explorer (ADX) vs. ClickHouse:
- ADX (Kusto): This is Azure's native equivalent to ClickHouse. It is a log-analytics and telemetry engine optimized for append-heavy, real-time streaming data. It uses the Kusto Query Language (KQL), which is highly expressive for time-series data.
- Synapse vs. ClickHouse: Synapse Dedicated Pools use columnar storage similar to ClickHouse (MergeTree), but are optimized for complex joins and ANSI SQL compliance rather than pure ingestion speed of logs.
- Fabric Real-Time Analytics: Microsoft Fabric (the new evolution) includes a "Real-Time Intelligence" workload that is essentially ADX/Kusto integrated with the Lakehouse, allowing you to run ClickHouse-style workloads on the same data.
This report has outlined a path from the Hadoop ecosystem to the Azure Data Platform. By executing the "One Day" project, you have touched the core pillars of modern cloud data engineering:
- Ingestion: Azure Data Factory (replacing Oozie/Sqoop).
- Storage: ADLS Gen2 (replacing HDFS).
- Transformation: Databricks/Spark (replacing YARN/PySpark).
- Serving: Synapse Serverless (replacing Hive/Impala).
- Visualization: Power BI (replacing Tableau/Looker).
The resulting architecture—a Delta Lakehouse—is the current industry standard. It combines the low cost of object storage with the transactional integrity of databases and the processing power of Spark. For a hiring manager, presenting this project demonstrates not just familiarity with the tool names, but a competent understanding of how they integrate to solve business problems securely and cost-effectively.
- Be ready to explain: "Why did you choose Synapse Serverless over Dedicated?" (Answer: Cost efficiency for ad-hoc queries on the lake).
- Be ready to explain: "How did you handle security?" (Answer: Managed Identities and Service Principals with RBAC, avoiding access keys).
- Be ready to explain: "How does this scale?" (Answer: ADLS separates storage from compute; Databricks scales workers automatically; Serverless SQL scales per query).
November 27, 2025