This repository provides a modular, open-source Big Data stack that supports both local (development) and production deployments.
- Development: Uses Docker Compose to spin up the stack locally.
- Production: Uses Helm Charts for Kubernetes-based deployment.
big-data-stack/
│── docker-compose.yml # Master Compose file for local development
│── .env # Environment variables
│── README.md # Documentation
│── scripts/ # Utility scripts (start, stop, backup, etc.)
│── configs/ # Shared global configuration files
│── services/ # Individual services (Docker Compose based)
│── helm-charts/ # Helm charts for Kubernetes deployment
│── k8s-manifests/ # Raw Kubernetes manifests (optional)
│── notebooks/ # Jupyter notebooks (if applicable)
│── docs/ # Documentation & architecture diagrams
To run this stack, ensure you have the following installed:
- Docker (latest version) - Install Docker
- Docker Compose - Install Docker Compose
- Kubernetes (for production deployment) - Install Kubernetes
- Helm (for Kubernetes deployments) - Install Helm
- kubectl (CLI for Kubernetes) - Install kubectl
- Git (for version control) - Install Git
| Component | Recommended |
|---|---|
| CPU | 4+ cores |
| RAM | 16GB+ |
| Storage | 100GB+ SSD |
| OS | Linux/macOS/Windows (WSL2 recommended for Windows) |
| Network | Stable internet connection |
| Component | Recommended |
|---|---|
| CPU | 8+ cores |
| RAM | 32GB+ |
| Storage | 500GB+ SSD (scalable) |
| OS | Linux (Ubuntu/Debian/RHEL) |
| Kubernetes Cluster | 3+ nodes |
| Network | High-speed internal network |
To spin up the entire stack locally:
docker-compose up -dTo spin up a specific service (e.g., Presto):
docker-compose -f services/presto/docker-compose.yml up -dTo stop services:
docker-compose downEnsure you have a Kubernetes cluster and Helm installed. Deploy a specific service using Helm:
helm install presto helm-charts/presto/Deploy the full stack using Helmfile (if configured):
helmfile apply| Service | Description |
|---|---|
| Presto & Trino | SQL query engine for distributed databases |
| ClickHouse | Columnar database for analytics |
| Delta Lake | Open-source data lake with ACID transactions |
| Apache Airflow | Workflow orchestration for ETL pipelines |
| Apache NiFi | Data ingestion and transformation |
| Grafana | Visualization and monitoring |
| Prometheus | System monitoring and alerting |
| ELK Stack (Elasticsearch, Logstash, Kibana) | Centralized logging |
| Apache Druid | Real-time analytics database |
| MySQL & PostgreSQL | Relational databases for metadata and transactional storage |
This repository uses an .env file for flexible configurations. Example:
ROOT_VOLUME=/mnt/bigdata/volumes
LOGS_PATH=/mnt/bigdata/logs- For Docker Compose → Edit
services/<service>/config/ - For Kubernetes → Edit
helm-charts/<service>/values.yaml
- Metrics: Grafana + Prometheus
- Logs: ELK Stack (Elasticsearch, Logstash, Kibana)
docs/folder contains architecture diagrams and setup guides.- Refer to Helm Charts documentation for custom deployments.
- See
scripts/for automation scripts (backup, monitoring, etc.).
We welcome contributions! Feel free to submit issues or pull requests to improve the stack.
This project is open-source and licensed under the Apache 2.0 License.