A production-ready Big Data analytics platform powered by Docker Compose, optimized for 8GB RAM VPS with Azure Data Lake Storage integration.
- Features
- System Requirements
- Architecture
- Services Documentation
- Quick Start
- Profiles & Memory Management
- Azure Data Lake Storage Setup
- Usage Examples
- Monitoring
- Troubleshooting
| Category | Technology | Version | Purpose |
|---|---|---|---|
| Storage | Apache Hadoop HDFS | 3.3.6 | Distributed file storage |
| Storage | Azure Data Lake Gen2 | - | Cloud object storage (primary) |
| Storage | MinIO | latest | S3-compatible storage (optional) |
| Processing | Apache Spark | 3.5.0 | Data processing engine |
| Table Format | Apache Iceberg | 1.4.2 | ACID transactions, time travel |
| Metastore | Apache Hive | 4.0.0 | Schema management |
| Orchestration | Apache Airflow | 2.8.0 | Workflow orchestration |
| Visualization | Apache Superset | 3.1.0 | BI dashboards |
| Monitoring | Prometheus | v2.48.0 | Metrics collection |
| Monitoring | Grafana | 10.2.2 | Dashboards & alerting |
| Database | PostgreSQL | 15 | Metadata storage |
| Cache | Redis | 7 | Message broker |
| Resource | Requirement | Notes |
|---|---|---|
| RAM | 8 GB | Uses profile-based deployment |
| CPU | 2 Cores | Single worker mode |
| Storage | 50 GB | Metadata only, data on Azure |
| Network | 100 Mbps | For Azure connectivity |
| Resource | Requirement | Notes |
|---|---|---|
| RAM | 32 GB | Full cluster with monitoring |
| CPU | 8+ Cores | Multiple workers |
| Storage | 500 GB | Local + Azure hybrid |
| Network | 1 Gbps | High-throughput data |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INSIGHTERA BIG DATA PLATFORM β
β (Optimized for 8GB RAM / Azure Integration) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β PRESENTATION LAYER (Optional - use profiles to enable) β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β Apache Superset β β Grafana β β
β β (--profile viz) β β (--profile monitor) β β
β β Port: 8088 β β Port: 3000 β β
β ββββββββββββ¬βββββββββββ ββββββββββββ¬βββββββββββ β
β β β β
βββββββββββββββ΄βββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββ€
β β
β ORCHESTRATION LAYER β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Apache Airflow 2.8.0 β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β β βWebserver β βScheduler β β Worker β Memory: 512MB each β β
β β β :8082 β β β β (Celery) β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
ββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββ€
β β
β PROCESSING LAYER β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Apache Spark 3.5.0 + Iceberg 1.4.2 β β
β β ββββββββββββββββ ββββββββββββββββ β β
β β β Spark Master β βSpark Worker 1β Memory: 512MB / 1.5GB β β
β β β :8080 β β β Worker 2: --profile scale β β
β β β (512MB) β β (1.5GB) β β β
β β ββββββββββββββββ ββββββββββββββββ β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Iceberg REST Catalog (256MB) β β β
β β β Port: 8181 β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
ββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββ€
β β
β METADATA LAYER β
β βββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ β
β β Hive Metastore 4.0.0 β β PostgreSQL 15 (512MB) β β
β β :9083 ββββββββββ Stores: Airflow, Hive, Superset β β
β β (512MB) β β Port: 5432 β β
β βββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β STORAGE LAYER β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β βοΈ AZURE DATA LAKE STORAGE Gen2 (PRIMARY) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Container: warehouse β Iceberg tables, processed data β β β
β β β Container: raw-data β Raw ingestion files β β β
β β β Container: spark-logs β Application logs β β β
β β β Protocol: abfss:// β Azure Blob File System Secure β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β LOCAL STORAGE (VPS 50GB) β β
β β ββββββββββββββ ββββββββββββββ ββββββββββββββββββββββββββββββββββββ β β
β β β Hadoop β β Redis β β Temp/Cache Data β β β
β β β NameNode β β (256MB) β β (Spark shuffle, logs) β β β
β β β (512MB) β β β β β β β
β β ββββββββββββββ ββββββββββββββ ββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Network: insightera-network (172.28.0.0/16)
| Property | Value |
|---|---|
| Container | insightera-postgres |
| Port | 5432 |
| Memory Limit | 512MB |
| Purpose | Metadata store for Airflow, Hive, Superset |
| Databases | airflow, hive_metastore, superset |
| Property | Value |
|---|---|
| Container | insightera-redis |
| Port | 6379 |
| Memory Limit | 256MB |
| Purpose | Celery message broker for Airflow |
| Config | maxmemory 200mb, LRU eviction |
| Component | Container | Port | Memory |
|---|---|---|---|
| NameNode | insightera-namenode |
9870 (UI), 8020 (RPC) | 512MB |
| DataNode 1 | insightera-datanode-1 |
9864 | 512MB |
| DataNode 2 | insightera-datanode-2 |
9864 | 512MB (profile: scale) |
Heap Settings: HADOOP_HEAPSIZE_MIN=256m, HADOOP_HEAPSIZE_MAX=384m
| Component | Container | Port | Memory |
|---|---|---|---|
| Master | insightera-spark-master |
8080 (UI), 7077 (RPC) | 512MB |
| Worker 1 | insightera-spark-worker-1 |
- | 1536MB |
| Worker 2 | insightera-spark-worker-2 |
- | 1536MB (profile: scale) |
Executor Config (Optimized for 8GB):
spark.driver.memory=512m
spark.executor.memory=768m
spark.executor.cores=1
spark.sql.shuffle.partitions=50| Property | Value |
|---|---|
| Container | insightera-iceberg-rest |
| Port | 8181 |
| Memory Limit | 256MB |
| Catalog Type | REST API |
| Storage | Azure Data Lake Gen2 / HDFS |
Catalogs Available:
iceberg- REST catalog with Azure ADLS (default)azure_catalog- Direct Azure Hadoop cataloghive_catalog- Hive metastore catalog
| Component | Container | Port | Memory |
|---|---|---|---|
| Metastore | insightera-hive-metastore |
9083 | 512MB |
| HiveServer2 | insightera-hive-server |
10000, 10002 | 512MB (profile: hive) |
| Component | Container | Port | Memory |
|---|---|---|---|
| Webserver | insightera-airflow-webserver |
8082 | 512MB |
| Scheduler | insightera-airflow-scheduler |
- | 512MB |
| Worker | insightera-airflow-worker |
- | 512MB |
| Triggerer | insightera-airflow-triggerer |
- | 512MB (profile: airflow) |
| Flower | insightera-flower |
5555 | - (profile: flower) |
Executor: CeleryExecutor with Redis broker
| Property | Value |
|---|---|
| Container | insightera-superset |
| Port | 8088 |
| Memory Limit | 512MB |
| Default Login | admin / admin |
| Property | Value |
|---|---|
| Container | insightera-prometheus |
| Port | 9090 |
| Memory Limit | 256MB |
| Retention | 15 days |
| Property | Value |
|---|---|
| Container | insightera-grafana |
| Port | 3000 |
| Memory Limit | 256MB |
| Default Login | admin / admin |
| Pre-installed Plugins | clock-panel, simple-json-datasource |
| Property | Value |
|---|---|
| Container | insightera-minio |
| Ports | 9000 (API), 9001 (Console) |
| Memory Limit | 512MB |
| Purpose | S3-compatible storage (alternative to Azure) |
cd /path/to/insightera-cluster
# Copy environment template
cp .env.example .env
# Edit with your Azure credentials
vim .envEdit .env with your Azure credentials:
# Required for Azure ADLS Gen2
AZURE_STORAGE_ACCOUNT=your_storage_account_name
AZURE_STORAGE_KEY=your_storage_account_key
AZURE_STORAGE_CONTAINER=warehouse# Make scripts executable
chmod +x scripts/*.sh
# Start with default profile (optimized for 8GB RAM)
./scripts/start.sh default
# Check health
./scripts/healthcheck.sh| Service | URL | Default Credentials |
|---|---|---|
| Hadoop NameNode | http://localhost:9870 | - |
| Spark Master | http://localhost:8080 | - |
| Airflow | http://localhost:8082 | admin / admin |
| Iceberg REST | http://localhost:8181 | - |
| Prometheus | http://localhost:9090 | - |
| Hive Server UI | http://localhost:10002 | - |
| Iceberg REST | http://localhost:8181 | - |
| Profile | RAM Usage | Services Included | Use Case |
|---|---|---|---|
default |
~5.5GB | Core services | Recommended for 8GB VPS |
minimal |
~3GB | PostgreSQL, Redis, NameNode, Spark Master | Testing only |
monitoring |
~6.5GB | Core + Prometheus + Grafana | With monitoring |
viz |
~6GB | Core + Superset | With visualization |
minio |
~6GB | Core + MinIO | Use MinIO instead of Azure |
scale |
~7.5GB | Core + extra workers | More processing power |
full |
~8GB+ | All services | Requires 16GB+ RAM |
# Recommended for 8GB VPS
./scripts/start.sh default
# Add visualization
./scripts/start.sh viz
# Add monitoring
./scripts/start.sh monitoring
# Use MinIO instead of Azure
./scripts/start.sh minio
# Full cluster (requires 16GB+ RAM)
./scripts/start.sh full| Service | Memory Limit | Memory Reserved | CPU Limit |
|---|---|---|---|
| PostgreSQL | 512MB | 256MB | 0.5 |
| Redis | 256MB | 128MB | 0.25 |
| Hadoop NameNode | 512MB | 256MB | 0.5 |
| Hadoop DataNode | 512MB | 256MB | 0.5 |
| Spark Master | 512MB | 256MB | 0.5 |
| Spark Worker | 1536MB | 1024MB | 1.0 |
| Hive Metastore | 512MB | 256MB | 0.5 |
| Iceberg REST | 256MB | 128MB | 0.25 |
| Airflow (each) | 512MB | 256MB | 0.5 |
| Total | ~5.5GB | ~3.5GB | - |
# Real-time memory monitoring
docker stats --no-stream
# Check which services are running
docker compose ps
# View memory allocation
docker compose config | grep -A 5 "resources:"# Using Azure CLI
az storage account create \
--name insighterastorage \
--resource-group your-resource-group \
--location southeastasia \
--sku Standard_LRS \
--kind StorageV2 \
--hierarchical-namespace true # Required for ADLS Gen2# Create required containers
az storage container create --name warehouse --account-name insighterastorage
az storage container create --name raw-data --account-name insighterastorage
az storage container create --name spark-logs --account-name insighterastorage# Get storage account key
az storage account keys list \
--account-name insighterastorage \
--query '[0].value' -o tsv# Azure Data Lake Storage Gen2
AZURE_STORAGE_ACCOUNT=insighterastorage
AZURE_STORAGE_KEY=your_storage_key_here
AZURE_STORAGE_CONTAINER=warehouse
AZURE_CATALOG_WAREHOUSE=abfss://warehouse@insighterastorage.dfs.core.windows.net/iceberg/from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("AzureIcebergExample") \
.config("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.iceberg.type", "rest") \
.config("spark.sql.catalog.iceberg.uri", "http://iceberg-rest:8181") \
.getOrCreate()
# Write to Azure via Iceberg
df.writeTo("iceberg.warehouse.my_table").create()
# Read from Azure via Iceberg
df = spark.table("iceberg.warehouse.my_table")
# Direct Azure access
df = spark.read.parquet("abfss://raw-data@insighterastorage.dfs.core.windows.net/")insightera-cluster/
βββ docker-compose.yaml # Main orchestration (memory-optimized)
βββ .env # Environment configuration
βββ .env.example # Template with Azure config
βββ .github/
β βββ copilot-instructions.md # AI coding guidelines
β
βββ docker/ # Custom Docker images
β βββ hadoop/ # Hadoop 3.3.6 with Azure libs
β βββ spark/ # Spark 3.5.0 + Iceberg 1.4.2
β βββ hive/ # Hive 4.0.0 Metastore
β βββ superset/ # Superset 3.1.0
β βββ postgres/ # Multi-database init
β
βββ hadoop/config/ # Hadoop configuration
β βββ core-site.xml
β βββ hdfs-site.xml
β βββ yarn-site.xml
β βββ mapred-site.xml
β
βββ spark/ # Spark configuration & apps
β βββ config/
β β βββ spark-defaults.conf # Optimized for 8GB + Azure
β β βββ spark-env.sh
β βββ apps/ # Your Spark applications
β
βββ hive/config/ # Hive configuration
β βββ hive-site.xml
β βββ hive-env.sh
β
βββ airflow/ # Airflow DAGs & plugins
β βββ dags/
β β βββ spark_iceberg_etl_example.py
β β βββ data_quality_checks.py
β βββ plugins/
β βββ logs/
β
βββ prometheus/ # Prometheus configuration
β βββ prometheus.yml
β βββ alerts/alerts.yml
β
βββ grafana/ # Grafana dashboards
β βββ provisioning/
β β βββ dashboards/
β β βββ datasources/
β βββ dashboards/
β βββ cluster-overview.json
β
βββ scripts/ # Utility scripts
βββ start.sh # Profile-based startup
βββ stop.sh
βββ healthcheck.sh
βββ deploy.sh
βββ backup.sh
# Connect to Spark SQL
docker exec -it insightera-spark-master spark-sql
# Create namespace and table
CREATE NAMESPACE IF NOT EXISTS iceberg.warehouse;
CREATE TABLE iceberg.warehouse.events (
id STRING,
event_time TIMESTAMP,
event_type STRING,
user_id STRING,
data STRING
)
USING iceberg
PARTITIONED BY (days(event_time), event_type);
# Insert data
INSERT INTO iceberg.warehouse.events VALUES
('1', current_timestamp(), 'click', 'user_001', '{"page": "/home"}'),
('2', current_timestamp(), 'view', 'user_002', '{"page": "/product"}');
# Query with time travel
SELECT * FROM iceberg.warehouse.events VERSION AS OF 1;
# Schema evolution
ALTER TABLE iceberg.warehouse.events ADD COLUMN source STRING;from pyspark.sql import SparkSession
# Session is pre-configured with Azure credentials
spark = SparkSession.builder \
.appName("AzureDataLakeExample") \
.getOrCreate()
# Read from Azure Data Lake directly
df_raw = spark.read.json(
"abfss://raw-data@{account}.dfs.core.windows.net/events/"
)
# Write to Iceberg table (stored in Azure)
df_raw.writeTo("iceberg.warehouse.events_processed").create()
# Read from Iceberg with time travel
df_v1 = spark.read \
.option("snapshot-id", 1234567890) \
.table("iceberg.warehouse.events_processed")
# Incremental read
df_changes = spark.read \
.option("start-snapshot-id", 123) \
.option("end-snapshot-id", 456) \
.table("iceberg.warehouse.events_processed")from airflow import DAG
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from airflow.operators.bash import BashOperator
from datetime import datetime
with DAG(
dag_id='azure_etl_pipeline',
start_date=datetime(2024, 1, 1),
schedule_interval='@daily',
catchup=False,
) as dag:
# Check Spark availability
check_spark = BashOperator(
task_id='check_spark',
bash_command='curl -s http://spark-master:8080 > /dev/null'
)
# Run Spark job with Azure
etl_job = SparkSubmitOperator(
task_id='run_etl',
application='/opt/spark/apps/etl_job.py',
conn_id='spark_default',
conf={
'spark.sql.catalog.iceberg.uri': 'http://iceberg-rest:8181',
'spark.driver.memory': '512m',
'spark.executor.memory': '768m',
}
)
check_spark >> etl_job# Start HiveServer2 first
docker compose --profile hive up -d hive-server
# Connect to HiveServer2
docker exec -it insightera-hive-server beeline \
-u "jdbc:hive2://localhost:10000"
# Query Iceberg tables via Hive
SHOW DATABASES;
USE iceberg;
SHOW TABLES;
SELECT * FROM warehouse.events LIMIT 10;# Start with monitoring
./scripts/start.sh monitoring
# Access dashboards
# Grafana: http://localhost:3000 (admin/admin)
# Prometheus: http://localhost:9090| Dashboard | Description |
|---|---|
| Cluster Overview | CPU, Memory, Disk usage across all services |
| Container Metrics | Individual container resource consumption |
| Spark Metrics | Job execution, memory, shuffle stats |
| Airflow Status | DAG runs, task durations, failures |
# Pre-configured alerts in prometheus/alerts/alerts.yml
- High CPU Usage (>80% for 5min)
- High Memory Usage (>85%)
- Disk Space Warning (<20% free)
- Container Down
- HDFS Capacity Warning (<10% free)
- Spark Worker Unavailable# Real-time memory stats
docker stats
# Memory usage summary
docker stats --no-stream --format \
"table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"
# Check for OOM events
dmesg | grep -i "out of memory"
# Container resource limits
docker inspect insightera-spark-worker-1 | \
jq '.[0].HostConfig.Memory'| Variable | Default | Description |
|---|---|---|
AZURE_STORAGE_ACCOUNT |
- | Azure Storage account name |
AZURE_STORAGE_KEY |
- | Azure Storage access key |
AZURE_STORAGE_CONTAINER |
warehouse | Default container name |
SPARK_WORKER_CORES |
1 | Cores per Spark worker |
SPARK_WORKER_MEMORY |
1g | Memory per Spark worker |
SPARK_EXECUTOR_MEMORY |
768m | Executor memory (8GB optimized) |
POSTGRES_USER |
insightera | PostgreSQL username |
POSTGRES_PASSWORD |
- | PostgreSQL password |
AIRFLOW__CORE__FERNET_KEY |
- | Airflow encryption key |
# Scale Spark workers (requires --profile scale for worker-2)
docker compose --profile scale up -d
# Adjust memory in .env
SPARK_WORKER_MEMORY=2g # If you have more RAM
# Apply changes
docker compose up -d# Deploy to your VPS
./scripts/deploy.sh 157.10.252.183 ~/.ssh/your_key your_user
# What it does:
# 1. Syncs project files via rsync
# 2. Installs Docker if needed
# 3. Configures environment
# 4. Builds and starts services# SSH to server
ssh -i ~/.ssh/your_key user@your-server
# Clone or copy files
git clone https://github.com/your-repo/insightera-cluster.git
cd insightera-cluster
# Configure environment
cp .env.example .env
vim .env # Add Azure credentials
# Start cluster
chmod +x scripts/*.sh
./scripts/start.sh default
./scripts/healthcheck.shBefore production deployment:
- Change all default passwords in
.env - Generate new Fernet key for Airflow
- Generate new secret key for Superset
- Configure Azure Service Principal (instead of storage key)
- Enable SSL/TLS for external access
- Set up firewall rules
- Configure proper RBAC in Azure
- Enable audit logging
# Generate Fernet key for Airflow
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
# Generate random secret for Superset
openssl rand -hex 32# Check memory usage
docker stats --no-stream
# If memory > 7GB, stop optional services
docker compose stop superset grafana prometheus
# Use minimal profile
./scripts/stop.sh
./scripts/start.sh minimal# Check logs for specific service
docker compose logs -f postgres
docker compose logs -f spark-master
# Check container status
docker compose ps
# Restart with clean state
docker compose down
docker compose up -d# Verify Azure credentials
docker exec insightera-spark-master env | grep AZURE
# Test Azure connectivity from Spark
docker exec -it insightera-spark-master spark-shell
# In Spark shell:
spark.read.text("abfss://warehouse@youraccount.dfs.core.windows.net/test.txt")
# Common errors:
# - "No FileSystem for scheme: abfss" β Missing Azure libraries
# - "403 Forbidden" β Wrong storage key or SAS token
# - "Account not found" β Wrong account name# Check HDFS health
docker exec insightera-namenode hdfs dfsadmin -report
# Leave safe mode
docker exec insightera-namenode hdfs dfsadmin -safemode leave
# Check disk usage
docker exec insightera-namenode hdfs dfs -df -h
# Fix permission issues
docker exec insightera-namenode hdfs dfs -chmod -R 777 /user# Check Spark Master logs
docker logs insightera-spark-master
# Check Spark Worker logs
docker logs insightera-spark-worker-1
# Access Spark UI for detailed job info
open http://localhost:8080
# Common memory errors - reduce memory in job:
spark-submit --driver-memory 256m --executor-memory 512m your_job.py# Check Airflow webserver logs
docker logs insightera-airflow-webserver
# Check scheduler logs
docker logs insightera-airflow-scheduler
# Reinitialize database
docker exec insightera-airflow-webserver airflow db init
# Reset admin password
docker exec insightera-airflow-webserver \
airflow users create --username admin --password admin \
--firstname Admin --lastname User --role Admin --email admin@local# Check Iceberg REST service
curl http://localhost:8181/v1/config
# Check catalog namespaces
curl http://localhost:8181/v1/namespaces
# Verify Iceberg tables
docker exec -it insightera-spark-master spark-sql -e "SHOW NAMESPACES IN iceberg"# Run full health check
./scripts/healthcheck.sh
# Check individual services
curl -s http://localhost:9870 # Hadoop NameNode
curl -s http://localhost:8080 # Spark Master
curl -s http://localhost:8082/health # Airflow
curl -s http://localhost:8181/v1/config # Iceberg REST
# Check container health status
docker inspect --format='{{.State.Health.Status}}' insightera-postgres# Reduce shuffle partitions
# In spark-defaults.conf or SparkSession:
spark.sql.shuffle.partitions=20 # Instead of default 200
# Enable memory-efficient settings
spark.memory.fraction=0.4
spark.memory.storageFraction=0.3
# Use disk-based shuffle
spark.shuffle.spill=true
spark.shuffle.spill.compress=trueThis project is licensed under the MIT License.
Contributions are welcome! Please read our contributing guidelines before submitting PRs.
Built with β€οΈ for Big Data enthusiasts
Optimized for 8GB RAM VPS with Azure Data Lake Storage integration