Skip to content

vilhjalmur12/DataStack

Repository files navigation

Big Data Stack

This repository provides a modular, open-source Big Data stack that supports both local (development) and production deployments.

  • Development: Uses Docker Compose to spin up the stack locally.
  • Production: Uses Helm Charts for Kubernetes-based deployment.

📂 Repository Structure

big-data-stack/
│── docker-compose.yml        # Master Compose file for local development
│── .env                      # Environment variables
│── README.md                 # Documentation
│── scripts/                  # Utility scripts (start, stop, backup, etc.)
│── configs/                  # Shared global configuration files
│── services/                 # Individual services (Docker Compose based)
│── helm-charts/              # Helm charts for Kubernetes deployment
│── k8s-manifests/            # Raw Kubernetes manifests (optional)
│── notebooks/                # Jupyter notebooks (if applicable)
│── docs/                     # Documentation & architecture diagrams

🛠 Requirements

To run this stack, ensure you have the following installed:

💻 System Requirements

Development (Docker Compose Setup)

Component Recommended
CPU 4+ cores
RAM 16GB+
Storage 100GB+ SSD
OS Linux/macOS/Windows (WSL2 recommended for Windows)
Network Stable internet connection

Production (Kubernetes Setup)

Component Recommended
CPU 8+ cores
RAM 32GB+
Storage 500GB+ SSD (scalable)
OS Linux (Ubuntu/Debian/RHEL)
Kubernetes Cluster 3+ nodes
Network High-speed internal network

🚀 Quick Start

🛠 Development (Docker Compose)

To spin up the entire stack locally:

docker-compose up -d

To spin up a specific service (e.g., Presto):

docker-compose -f services/presto/docker-compose.yml up -d

To stop services:

docker-compose down

☸️ Production (Kubernetes with Helm)

Ensure you have a Kubernetes cluster and Helm installed. Deploy a specific service using Helm:

helm install presto helm-charts/presto/

Deploy the full stack using Helmfile (if configured):

helmfile apply

📦 Included Services

Service Description
Presto & Trino SQL query engine for distributed databases
ClickHouse Columnar database for analytics
Delta Lake Open-source data lake with ACID transactions
Apache Airflow Workflow orchestration for ETL pipelines
Apache NiFi Data ingestion and transformation
Grafana Visualization and monitoring
Prometheus System monitoring and alerting
ELK Stack (Elasticsearch, Logstash, Kibana) Centralized logging
Apache Druid Real-time analytics database
MySQL & PostgreSQL Relational databases for metadata and transactional storage

🔧 Configuration

Environment Variables (.env file)

This repository uses an .env file for flexible configurations. Example:

ROOT_VOLUME=/mnt/bigdata/volumes
LOGS_PATH=/mnt/bigdata/logs

Customizing Configurations

  • For Docker Compose → Edit services/<service>/config/
  • For Kubernetes → Edit helm-charts/<service>/values.yaml

📊 Monitoring & Logging

  • Metrics: Grafana + Prometheus
  • Logs: ELK Stack (Elasticsearch, Logstash, Kibana)

📖 Documentation & Resources

  • docs/ folder contains architecture diagrams and setup guides.
  • Refer to Helm Charts documentation for custom deployments.
  • See scripts/ for automation scripts (backup, monitoring, etc.).

🤝 Contributions

We welcome contributions! Feel free to submit issues or pull requests to improve the stack.


📜 License

This project is open-source and licensed under the Apache 2.0 License.

About

DataStack

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published