Skip to content

Production-ready Big Data analytics platform optimized for 8GB RAM VPS with Azure Data Lake Storage Gen2. Features Hadoop HDFS 3.3.6, Spark 3.5.0, Iceberg 1.4.2, Hive 4.0.0, Airflow 2.8.0, Superset 3.1.0, Prometheus, and Grafana.

Notifications You must be signed in to change notification settings

EgiStr/insightera-cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

27 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ InsightEra Big Data Cluster

A production-ready Big Data analytics platform powered by Docker Compose, optimized for 8GB RAM VPS with Azure Data Lake Storage integration.

πŸ“‹ Table of Contents

Features

Category Technology Version Purpose
Storage Apache Hadoop HDFS 3.3.6 Distributed file storage
Storage Azure Data Lake Gen2 - Cloud object storage (primary)
Storage MinIO latest S3-compatible storage (optional)
Processing Apache Spark 3.5.0 Data processing engine
Table Format Apache Iceberg 1.4.2 ACID transactions, time travel
Metastore Apache Hive 4.0.0 Schema management
Orchestration Apache Airflow 2.8.0 Workflow orchestration
Visualization Apache Superset 3.1.0 BI dashboards
Monitoring Prometheus v2.48.0 Metrics collection
Monitoring Grafana 10.2.2 Dashboards & alerting
Database PostgreSQL 15 Metadata storage
Cache Redis 7 Message broker

πŸ“Š System Requirements

Minimum (8GB RAM VPS - Optimized)

Resource Requirement Notes
RAM 8 GB Uses profile-based deployment
CPU 2 Cores Single worker mode
Storage 50 GB Metadata only, data on Azure
Network 100 Mbps For Azure connectivity

Recommended (Production)

Resource Requirement Notes
RAM 32 GB Full cluster with monitoring
CPU 8+ Cores Multiple workers
Storage 500 GB Local + Azure hybrid
Network 1 Gbps High-throughput data

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     INSIGHTERA BIG DATA PLATFORM                                 β”‚
β”‚                   (Optimized for 8GB RAM / Azure Integration)                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                                  β”‚
β”‚  PRESENTATION LAYER (Optional - use profiles to enable)                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”‚
β”‚  β”‚   Apache Superset   β”‚              β”‚       Grafana       β”‚                   β”‚
β”‚  β”‚   (--profile viz)   β”‚              β”‚ (--profile monitor) β”‚                   β”‚
β”‚  β”‚     Port: 8088      β”‚              β”‚     Port: 3000      β”‚                   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
β”‚             β”‚                                    β”‚                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                                  β”‚
β”‚  ORCHESTRATION LAYER                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚                         Apache Airflow 2.8.0                             β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚    β”‚
β”‚  β”‚  β”‚Webserver β”‚   β”‚Scheduler β”‚   β”‚  Worker  β”‚   Memory: 512MB each        β”‚    β”‚
β”‚  β”‚  β”‚  :8082   β”‚   β”‚          β”‚   β”‚ (Celery) β”‚                             β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                        β”‚                                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                                  β”‚
β”‚  PROCESSING LAYER                                                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚                    Apache Spark 3.5.0 + Iceberg 1.4.2                    β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                               β”‚    β”‚
β”‚  β”‚  β”‚ Spark Master β”‚        β”‚Spark Worker 1β”‚    Memory: 512MB / 1.5GB      β”‚    β”‚
β”‚  β”‚  β”‚    :8080     β”‚        β”‚              β”‚    Worker 2: --profile scale  β”‚    β”‚
β”‚  β”‚  β”‚   (512MB)    β”‚        β”‚   (1.5GB)    β”‚                               β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                               β”‚    β”‚
β”‚  β”‚                                                                          β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚    β”‚
β”‚  β”‚  β”‚               Iceberg REST Catalog (256MB)                        β”‚   β”‚    β”‚
β”‚  β”‚  β”‚                        Port: 8181                                 β”‚   β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                        β”‚                                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                                  β”‚
β”‚  METADATA LAYER                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚   Hive Metastore 4.0.0  β”‚        β”‚       PostgreSQL 15 (512MB)         β”‚     β”‚
β”‚  β”‚        :9083            │◄───────│   Stores: Airflow, Hive, Superset   β”‚     β”‚
β”‚  β”‚       (512MB)           β”‚        β”‚           Port: 5432                β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                                                                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                                  β”‚
β”‚  STORAGE LAYER                                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚                    ☁️  AZURE DATA LAKE STORAGE Gen2 (PRIMARY)              β”‚  β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚
β”‚  β”‚  β”‚  Container: warehouse    β”‚  Iceberg tables, processed data          β”‚  β”‚  β”‚
β”‚  β”‚  β”‚  Container: raw-data     β”‚  Raw ingestion files                     β”‚  β”‚  β”‚
β”‚  β”‚  β”‚  Container: spark-logs   β”‚  Application logs                        β”‚  β”‚  β”‚
β”‚  β”‚  β”‚  Protocol: abfss://      β”‚  Azure Blob File System Secure           β”‚  β”‚  β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                      LOCAL STORAGE (VPS 50GB)                             β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚
β”‚  β”‚  β”‚  Hadoop    β”‚  β”‚   Redis    β”‚  β”‚        Temp/Cache Data           β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  NameNode  β”‚  β”‚  (256MB)   β”‚  β”‚     (Spark shuffle, logs)        β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  (512MB)   β”‚  β”‚            β”‚  β”‚                                  β”‚    β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              Network: insightera-network (172.28.0.0/16)

πŸ“š Services Documentation

Core Services (Default Profile ~5.5GB RAM)

PostgreSQL 15

Property Value
Container insightera-postgres
Port 5432
Memory Limit 512MB
Purpose Metadata store for Airflow, Hive, Superset
Databases airflow, hive_metastore, superset

Redis 7

Property Value
Container insightera-redis
Port 6379
Memory Limit 256MB
Purpose Celery message broker for Airflow
Config maxmemory 200mb, LRU eviction

Hadoop HDFS 3.3.6

Component Container Port Memory
NameNode insightera-namenode 9870 (UI), 8020 (RPC) 512MB
DataNode 1 insightera-datanode-1 9864 512MB
DataNode 2 insightera-datanode-2 9864 512MB (profile: scale)

Heap Settings: HADOOP_HEAPSIZE_MIN=256m, HADOOP_HEAPSIZE_MAX=384m

Apache Spark 3.5.0

Component Container Port Memory
Master insightera-spark-master 8080 (UI), 7077 (RPC) 512MB
Worker 1 insightera-spark-worker-1 - 1536MB
Worker 2 insightera-spark-worker-2 - 1536MB (profile: scale)

Executor Config (Optimized for 8GB):

spark.driver.memory=512m
spark.executor.memory=768m
spark.executor.cores=1
spark.sql.shuffle.partitions=50

Apache Iceberg 1.4.2

Property Value
Container insightera-iceberg-rest
Port 8181
Memory Limit 256MB
Catalog Type REST API
Storage Azure Data Lake Gen2 / HDFS

Catalogs Available:

  • iceberg - REST catalog with Azure ADLS (default)
  • azure_catalog - Direct Azure Hadoop catalog
  • hive_catalog - Hive metastore catalog

Apache Hive 4.0.0

Component Container Port Memory
Metastore insightera-hive-metastore 9083 512MB
HiveServer2 insightera-hive-server 10000, 10002 512MB (profile: hive)

Apache Airflow 2.8.0

Component Container Port Memory
Webserver insightera-airflow-webserver 8082 512MB
Scheduler insightera-airflow-scheduler - 512MB
Worker insightera-airflow-worker - 512MB
Triggerer insightera-airflow-triggerer - 512MB (profile: airflow)
Flower insightera-flower 5555 - (profile: flower)

Executor: CeleryExecutor with Redis broker

Optional Services (Use Profiles)

Apache Superset 3.1.0 (--profile viz)

Property Value
Container insightera-superset
Port 8088
Memory Limit 512MB
Default Login admin / admin

Prometheus v2.48.0 (--profile monitoring)

Property Value
Container insightera-prometheus
Port 9090
Memory Limit 256MB
Retention 15 days

Grafana 10.2.2 (--profile monitoring)

Property Value
Container insightera-grafana
Port 3000
Memory Limit 256MB
Default Login admin / admin
Pre-installed Plugins clock-panel, simple-json-datasource

MinIO (--profile minio)

Property Value
Container insightera-minio
Ports 9000 (API), 9001 (Console)
Memory Limit 512MB
Purpose S3-compatible storage (alternative to Azure)

πŸš€ Quick Start

1. Clone and Configure

cd /path/to/insightera-cluster

# Copy environment template
cp .env.example .env

# Edit with your Azure credentials
vim .env

2. Configure Azure Data Lake Storage

Edit .env with your Azure credentials:

# Required for Azure ADLS Gen2
AZURE_STORAGE_ACCOUNT=your_storage_account_name
AZURE_STORAGE_KEY=your_storage_account_key
AZURE_STORAGE_CONTAINER=warehouse

3. Start the Cluster

# Make scripts executable
chmod +x scripts/*.sh

# Start with default profile (optimized for 8GB RAM)
./scripts/start.sh default

# Check health
./scripts/healthcheck.sh

4. Access Services

Service URL Default Credentials
Hadoop NameNode http://localhost:9870 -
Spark Master http://localhost:8080 -
Airflow http://localhost:8082 admin / admin
Iceberg REST http://localhost:8181 -
Prometheus http://localhost:9090 -
Hive Server UI http://localhost:10002 -
Iceberg REST http://localhost:8181 -

πŸŽ›οΈ Profiles & Memory Management

Profile System (Critical for 8GB RAM)

Profile RAM Usage Services Included Use Case
default ~5.5GB Core services Recommended for 8GB VPS
minimal ~3GB PostgreSQL, Redis, NameNode, Spark Master Testing only
monitoring ~6.5GB Core + Prometheus + Grafana With monitoring
viz ~6GB Core + Superset With visualization
minio ~6GB Core + MinIO Use MinIO instead of Azure
scale ~7.5GB Core + extra workers More processing power
full ~8GB+ All services Requires 16GB+ RAM

Starting with Profiles

# Recommended for 8GB VPS
./scripts/start.sh default

# Add visualization
./scripts/start.sh viz

# Add monitoring
./scripts/start.sh monitoring

# Use MinIO instead of Azure
./scripts/start.sh minio

# Full cluster (requires 16GB+ RAM)
./scripts/start.sh full

Memory Allocation Table (Default Profile)

Service Memory Limit Memory Reserved CPU Limit
PostgreSQL 512MB 256MB 0.5
Redis 256MB 128MB 0.25
Hadoop NameNode 512MB 256MB 0.5
Hadoop DataNode 512MB 256MB 0.5
Spark Master 512MB 256MB 0.5
Spark Worker 1536MB 1024MB 1.0
Hive Metastore 512MB 256MB 0.5
Iceberg REST 256MB 128MB 0.25
Airflow (each) 512MB 256MB 0.5
Total ~5.5GB ~3.5GB -

Monitoring Memory Usage

# Real-time memory monitoring
docker stats --no-stream

# Check which services are running
docker compose ps

# View memory allocation
docker compose config | grep -A 5 "resources:"

☁️ Azure Data Lake Storage Setup

1. Create Azure Storage Account

# Using Azure CLI
az storage account create \
  --name insighterastorage \
  --resource-group your-resource-group \
  --location southeastasia \
  --sku Standard_LRS \
  --kind StorageV2 \
  --hierarchical-namespace true  # Required for ADLS Gen2

2. Create Containers

# Create required containers
az storage container create --name warehouse --account-name insighterastorage
az storage container create --name raw-data --account-name insighterastorage
az storage container create --name spark-logs --account-name insighterastorage

3. Get Access Keys

# Get storage account key
az storage account keys list \
  --account-name insighterastorage \
  --query '[0].value' -o tsv

4. Configure .env

# Azure Data Lake Storage Gen2
AZURE_STORAGE_ACCOUNT=insighterastorage
AZURE_STORAGE_KEY=your_storage_key_here
AZURE_STORAGE_CONTAINER=warehouse
AZURE_CATALOG_WAREHOUSE=abfss://warehouse@insighterastorage.dfs.core.windows.net/iceberg/

5. Using Azure in Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("AzureIcebergExample") \
    .config("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.iceberg.type", "rest") \
    .config("spark.sql.catalog.iceberg.uri", "http://iceberg-rest:8181") \
    .getOrCreate()

# Write to Azure via Iceberg
df.writeTo("iceberg.warehouse.my_table").create()

# Read from Azure via Iceberg
df = spark.table("iceberg.warehouse.my_table")

# Direct Azure access
df = spark.read.parquet("abfss://raw-data@insighterastorage.dfs.core.windows.net/")

πŸ“ Project Structure

insightera-cluster/
β”œβ”€β”€ docker-compose.yaml          # Main orchestration (memory-optimized)
β”œβ”€β”€ .env                          # Environment configuration
β”œβ”€β”€ .env.example                  # Template with Azure config
β”œβ”€β”€ .github/
β”‚   └── copilot-instructions.md  # AI coding guidelines
β”‚
β”œβ”€β”€ docker/                       # Custom Docker images
β”‚   β”œβ”€β”€ hadoop/                   # Hadoop 3.3.6 with Azure libs
β”‚   β”œβ”€β”€ spark/                    # Spark 3.5.0 + Iceberg 1.4.2
β”‚   β”œβ”€β”€ hive/                     # Hive 4.0.0 Metastore
β”‚   β”œβ”€β”€ superset/                 # Superset 3.1.0
β”‚   └── postgres/                 # Multi-database init
β”‚
β”œβ”€β”€ hadoop/config/                # Hadoop configuration
β”‚   β”œβ”€β”€ core-site.xml
β”‚   β”œβ”€β”€ hdfs-site.xml
β”‚   β”œβ”€β”€ yarn-site.xml
β”‚   └── mapred-site.xml
β”‚
β”œβ”€β”€ spark/                        # Spark configuration & apps
β”‚   β”œβ”€β”€ config/
β”‚   β”‚   β”œβ”€β”€ spark-defaults.conf   # Optimized for 8GB + Azure
β”‚   β”‚   └── spark-env.sh
β”‚   └── apps/                     # Your Spark applications
β”‚
β”œβ”€β”€ hive/config/                  # Hive configuration
β”‚   β”œβ”€β”€ hive-site.xml
β”‚   └── hive-env.sh
β”‚
β”œβ”€β”€ airflow/                      # Airflow DAGs & plugins
β”‚   β”œβ”€β”€ dags/
β”‚   β”‚   β”œβ”€β”€ spark_iceberg_etl_example.py
β”‚   β”‚   └── data_quality_checks.py
β”‚   β”œβ”€β”€ plugins/
β”‚   └── logs/
β”‚
β”œβ”€β”€ prometheus/                   # Prometheus configuration
β”‚   β”œβ”€β”€ prometheus.yml
β”‚   └── alerts/alerts.yml
β”‚
β”œβ”€β”€ grafana/                      # Grafana dashboards
β”‚   β”œβ”€β”€ provisioning/
β”‚   β”‚   β”œβ”€β”€ dashboards/
β”‚   β”‚   └── datasources/
β”‚   └── dashboards/
β”‚       └── cluster-overview.json
β”‚
└── scripts/                      # Utility scripts
    β”œβ”€β”€ start.sh                  # Profile-based startup
    β”œβ”€β”€ stop.sh
    β”œβ”€β”€ healthcheck.sh
    β”œβ”€β”€ deploy.sh
    └── backup.sh

πŸ› οΈ Usage Examples

Spark SQL with Iceberg (Azure Backend)

# Connect to Spark SQL
docker exec -it insightera-spark-master spark-sql

# Create namespace and table
CREATE NAMESPACE IF NOT EXISTS iceberg.warehouse;

CREATE TABLE iceberg.warehouse.events (
    id STRING,
    event_time TIMESTAMP,
    event_type STRING,
    user_id STRING,
    data STRING
)
USING iceberg
PARTITIONED BY (days(event_time), event_type);

# Insert data
INSERT INTO iceberg.warehouse.events VALUES
    ('1', current_timestamp(), 'click', 'user_001', '{"page": "/home"}'),
    ('2', current_timestamp(), 'view', 'user_002', '{"page": "/product"}');

# Query with time travel
SELECT * FROM iceberg.warehouse.events VERSION AS OF 1;

# Schema evolution
ALTER TABLE iceberg.warehouse.events ADD COLUMN source STRING;

PySpark with Azure Data Lake

from pyspark.sql import SparkSession

# Session is pre-configured with Azure credentials
spark = SparkSession.builder \
    .appName("AzureDataLakeExample") \
    .getOrCreate()

# Read from Azure Data Lake directly
df_raw = spark.read.json(
    "abfss://raw-data@{account}.dfs.core.windows.net/events/"
)

# Write to Iceberg table (stored in Azure)
df_raw.writeTo("iceberg.warehouse.events_processed").create()

# Read from Iceberg with time travel
df_v1 = spark.read \
    .option("snapshot-id", 1234567890) \
    .table("iceberg.warehouse.events_processed")

# Incremental read
df_changes = spark.read \
    .option("start-snapshot-id", 123) \
    .option("end-snapshot-id", 456) \
    .table("iceberg.warehouse.events_processed")

Airflow DAG Example

from airflow import DAG
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG(
    dag_id='azure_etl_pipeline',
    start_date=datetime(2024, 1, 1),
    schedule_interval='@daily',
    catchup=False,
) as dag:

    # Check Spark availability
    check_spark = BashOperator(
        task_id='check_spark',
        bash_command='curl -s http://spark-master:8080 > /dev/null'
    )

    # Run Spark job with Azure
    etl_job = SparkSubmitOperator(
        task_id='run_etl',
        application='/opt/spark/apps/etl_job.py',
        conn_id='spark_default',
        conf={
            'spark.sql.catalog.iceberg.uri': 'http://iceberg-rest:8181',
            'spark.driver.memory': '512m',
            'spark.executor.memory': '768m',
        }
    )

    check_spark >> etl_job

Hive Queries (Optional - use profile 'hive')

# Start HiveServer2 first
docker compose --profile hive up -d hive-server

# Connect to HiveServer2
docker exec -it insightera-hive-server beeline \
    -u "jdbc:hive2://localhost:10000"

# Query Iceberg tables via Hive
SHOW DATABASES;
USE iceberg;
SHOW TABLES;
SELECT * FROM warehouse.events LIMIT 10;

πŸ“Š Monitoring

Using Profiles for Monitoring

# Start with monitoring
./scripts/start.sh monitoring

# Access dashboards
# Grafana: http://localhost:3000 (admin/admin)
# Prometheus: http://localhost:9090

Pre-configured Grafana Dashboards

Dashboard Description
Cluster Overview CPU, Memory, Disk usage across all services
Container Metrics Individual container resource consumption
Spark Metrics Job execution, memory, shuffle stats
Airflow Status DAG runs, task durations, failures

Prometheus Alert Rules

# Pre-configured alerts in prometheus/alerts/alerts.yml
- High CPU Usage (>80% for 5min)
- High Memory Usage (>85%)
- Disk Space Warning (<20% free)
- Container Down
- HDFS Capacity Warning (<10% free)
- Spark Worker Unavailable

Memory Monitoring Commands

# Real-time memory stats
docker stats

# Memory usage summary
docker stats --no-stream --format \
    "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"

# Check for OOM events
dmesg | grep -i "out of memory"

# Container resource limits
docker inspect insightera-spark-worker-1 | \
    jq '.[0].HostConfig.Memory'

πŸ”§ Configuration Reference

Environment Variables

Variable Default Description
AZURE_STORAGE_ACCOUNT - Azure Storage account name
AZURE_STORAGE_KEY - Azure Storage access key
AZURE_STORAGE_CONTAINER warehouse Default container name
SPARK_WORKER_CORES 1 Cores per Spark worker
SPARK_WORKER_MEMORY 1g Memory per Spark worker
SPARK_EXECUTOR_MEMORY 768m Executor memory (8GB optimized)
POSTGRES_USER insightera PostgreSQL username
POSTGRES_PASSWORD - PostgreSQL password
AIRFLOW__CORE__FERNET_KEY - Airflow encryption key

Scaling Resources

# Scale Spark workers (requires --profile scale for worker-2)
docker compose --profile scale up -d

# Adjust memory in .env
SPARK_WORKER_MEMORY=2g  # If you have more RAM

# Apply changes
docker compose up -d

🚒 Deployment to Remote Server

Using Deploy Script

# Deploy to your VPS
./scripts/deploy.sh 157.10.252.183 ~/.ssh/your_key your_user

# What it does:
# 1. Syncs project files via rsync
# 2. Installs Docker if needed
# 3. Configures environment
# 4. Builds and starts services

Manual Deployment

# SSH to server
ssh -i ~/.ssh/your_key user@your-server

# Clone or copy files
git clone https://github.com/your-repo/insightera-cluster.git
cd insightera-cluster

# Configure environment
cp .env.example .env
vim .env  # Add Azure credentials

# Start cluster
chmod +x scripts/*.sh
./scripts/start.sh default
./scripts/healthcheck.sh

πŸ”’ Security Checklist

Before production deployment:

  • Change all default passwords in .env
  • Generate new Fernet key for Airflow
  • Generate new secret key for Superset
  • Configure Azure Service Principal (instead of storage key)
  • Enable SSL/TLS for external access
  • Set up firewall rules
  • Configure proper RBAC in Azure
  • Enable audit logging
# Generate Fernet key for Airflow
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

# Generate random secret for Superset
openssl rand -hex 32

πŸ› Troubleshooting

Common Issues

Out of Memory (OOM) on 8GB VPS

# Check memory usage
docker stats --no-stream

# If memory > 7GB, stop optional services
docker compose stop superset grafana prometheus

# Use minimal profile
./scripts/stop.sh
./scripts/start.sh minimal

Services Won't Start

# Check logs for specific service
docker compose logs -f postgres
docker compose logs -f spark-master

# Check container status
docker compose ps

# Restart with clean state
docker compose down
docker compose up -d

Azure Connection Issues

# Verify Azure credentials
docker exec insightera-spark-master env | grep AZURE

# Test Azure connectivity from Spark
docker exec -it insightera-spark-master spark-shell
# In Spark shell:
spark.read.text("abfss://warehouse@youraccount.dfs.core.windows.net/test.txt")

# Common errors:
# - "No FileSystem for scheme: abfss" β†’ Missing Azure libraries
# - "403 Forbidden" β†’ Wrong storage key or SAS token
# - "Account not found" β†’ Wrong account name

HDFS Issues

# Check HDFS health
docker exec insightera-namenode hdfs dfsadmin -report

# Leave safe mode
docker exec insightera-namenode hdfs dfsadmin -safemode leave

# Check disk usage
docker exec insightera-namenode hdfs dfs -df -h

# Fix permission issues
docker exec insightera-namenode hdfs dfs -chmod -R 777 /user

Spark Job Failures

# Check Spark Master logs
docker logs insightera-spark-master

# Check Spark Worker logs
docker logs insightera-spark-worker-1

# Access Spark UI for detailed job info
open http://localhost:8080

# Common memory errors - reduce memory in job:
spark-submit --driver-memory 256m --executor-memory 512m your_job.py

Airflow Issues

# Check Airflow webserver logs
docker logs insightera-airflow-webserver

# Check scheduler logs
docker logs insightera-airflow-scheduler

# Reinitialize database
docker exec insightera-airflow-webserver airflow db init

# Reset admin password
docker exec insightera-airflow-webserver \
    airflow users create --username admin --password admin \
    --firstname Admin --lastname User --role Admin --email admin@local

Iceberg Catalog Issues

# Check Iceberg REST service
curl http://localhost:8181/v1/config

# Check catalog namespaces
curl http://localhost:8181/v1/namespaces

# Verify Iceberg tables
docker exec -it insightera-spark-master spark-sql -e "SHOW NAMESPACES IN iceberg"

Health Check Commands

# Run full health check
./scripts/healthcheck.sh

# Check individual services
curl -s http://localhost:9870  # Hadoop NameNode
curl -s http://localhost:8080  # Spark Master
curl -s http://localhost:8082/health  # Airflow
curl -s http://localhost:8181/v1/config  # Iceberg REST

# Check container health status
docker inspect --format='{{.State.Health.Status}}' insightera-postgres

Performance Tuning for 8GB RAM

# Reduce shuffle partitions
# In spark-defaults.conf or SparkSession:
spark.sql.shuffle.partitions=20  # Instead of default 200

# Enable memory-efficient settings
spark.memory.fraction=0.4
spark.memory.storageFraction=0.3

# Use disk-based shuffle
spark.shuffle.spill=true
spark.shuffle.spill.compress=true

πŸ“ License

This project is licensed under the MIT License.

🀝 Contributing

Contributions are welcome! Please read our contributing guidelines before submitting PRs.


Built with ❀️ for Big Data enthusiasts

Optimized for 8GB RAM VPS with Azure Data Lake Storage integration

About

Production-ready Big Data analytics platform optimized for 8GB RAM VPS with Azure Data Lake Storage Gen2. Features Hadoop HDFS 3.3.6, Spark 3.5.0, Iceberg 1.4.2, Hive 4.0.0, Airflow 2.8.0, Superset 3.1.0, Prometheus, and Grafana.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published