Skip to content

kiyeonjeon21/data-stack-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Stack Lab

A self-contained, Docker-based Modern Data Stack sandbox for learning, experimenting, and prototyping data pipelines on a local machine.

Spin up any combination of 26+ open-source data tools — from batch ELT and real-time streaming to lakehouse analytics and AI/ML — with a single docker compose command. Every tool is isolated behind a Docker Compose profile, so you only run what you need.

What you can do

  • Build batch ELT pipelines with dbt, Dagster, and Metabase
  • Run real-time streaming with Kafka, ClickHouse, and Superset
  • Set up a lakehouse with Iceberg, Trino, and MinIO (Parquet on S3)
  • Track ML experiments with MLflow and monitor LLMs with Langfuse
  • Perform CDC from PostgreSQL to Kafka via Debezium
  • Compare query engines (Trino vs PrestoDB) on the same data
  • Explore data lineage with Marquez and catalog data with OpenMetadata
  • Run ad-hoc analysis in Jupyter notebooks connected to all data sources

Design principles

  • Modular: each tool is a standalone compose file — add or remove tools without affecting others
  • Lightweight: shared PostgreSQL, Redis, and MinIO infrastructure — no per-tool database bloat
  • Reproducible: all image versions pinned, memory limits set, healthchecks enforced
  • Secure by default: all ports bound to 127.0.0.1, no external exposure
  • Recipe-driven: 8 pre-built .env presets for common stack combinations

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        compose.yaml                         │
│                     (include + profiles)                    │
├─────────┬───────────────────────────────────────────────────┤
│  infra/ │  PostgreSQL · Redis · MinIO (always running)      │
├─────────┼───────────────────────────────────────────────────┤
│ stacks/ │  Per-tool compose.yaml (selective via profiles)   │
│         │  orchestration/ airflow, dagster                  │
│         │  transformation/ dbt, sqlmesh                     │
│         │  warehouse/ clickhouse, trino, prestodb           │
│         │  visualization/ metabase, superset, evidence      │
│         │  ingestion/ meltano, airbyte                      │
│         │  streaming/ kafka, redpanda                       │
│         │  processing/ spark, flink                         │
│         │  ai/ qdrant, mlflow, langfuse                     │
│         │  cdc/ debezium                                    │
│         │  versioning/ lakefs                               │
│         │  quality/ soda                                    │
│         │  catalog/ openmetadata                            │
│         │  lineage/ marquez                                 │
│         │  storage/ iceberg                                 │
│         │  notebook/ jupyter                                │
├─────────┼───────────────────────────────────────────────────┤
│ recipes/│  Combination presets (.env files)                  │
└─────────┴───────────────────────────────────────────────────┘

Quick Start

# Infrastructure only (PostgreSQL + Redis + MinIO)
docker compose up -d

# Use a recipe
docker compose --env-file recipes/basic-analytics.env up -d

# Selective profiles
docker compose --profile dagster --profile metabase up -d

# Everything
docker compose --profile "*" up -d

# Shut down
docker compose --profile "*" down

Scenarios (Verified)

1. Batch Analytics — dbt + PostgreSQL + Metabase

make basic-analytics
docker compose exec dbt dbt seed && dbt run && dbt test

ELT pipeline with e-commerce sample data (25 customers, 50 orders, 20 products).

Layer Models Type
raw.* customers, products, orders, order_items seed (CSV → table)
stg.* stg_customers, stg_orders, stg_order_items, stg_products view
marts.* fct_orders, dim_customers, mart_revenue_daily, mart_top_products table
  • 29 dbt tests all PASS (unique, not_null, relationships, accepted_values + dbt-expectations)
  • Data quality tests via dbt-expectations: email regex, price range, row count, amount bounds
  • Metabase(:3030) SQL Lab + chart creation verified

2. Streaming Pipeline — Kafka + ClickHouse + Superset

make streaming-pipeline

Real-time pipeline with fake clickstream events (5 events/sec).

Python Producer → Kafka (clickstream-events topic)
                    → ClickHouse Kafka Engine (real-time consumption)
                       → MaterializedView → MergeTree (persistent storage)
                          → clickstream_hourly (aggregation view)
  • Kafka UI(:8081) topic/message inspection verified
  • Superset(:8088) ClickHouse SQL Lab query verified

3. Lakehouse — Iceberg + Trino + MinIO + Superset

docker compose --profile iceberg --profile trino up -d

Stores PostgreSQL mart data into Iceberg tables via Trino CTAS. Parquet files are saved in the MinIO warehouse bucket.

-- Run in Trino
CREATE TABLE iceberg.analytics.customers AS
SELECT * FROM postgresql.marts.dim_customers;
  • Trino(:8090) → Iceberg REST catalog → MinIO(S3)
  • Superset(:8088) connection via trino://trino@ds-trino:8080/iceberg verified
  • Cross-catalog queries: Trino can JOIN PostgreSQL + Iceberg simultaneously

4. Jupyter Notebooks — Interactive Analysis

docker compose --profile jupyter up -d
# http://localhost:8888 (token: datastack)

Includes 3 example notebooks per scenario:

Notebook Data Source Contents
01-batch-analytics.ipynb PostgreSQL dbt marts queries, revenue trends, top customers, product sales
02-streaming-analytics.ipynb ClickHouse Clickstream analysis, device/country traffic, popular pages
03-lakehouse-query.ipynb Trino/Iceberg Iceberg table queries, cross-catalog JOIN (PG + Iceberg)

Custom Dockerfile with warehouse drivers: psycopg2, clickhouse-connect, trino, sqlalchemy

5. Dagster-dbt Orchestration

docker compose --profile dagster up -d
# http://localhost:3000

Manages dbt models as Dagster assets via dagster-dbt integration.

dagster-code (gRPC:4000) → dbt project mount
dagster-webserver (:3000) → asset materialization UI
dagster-daemon → schedule/sensor execution
  • @dbt_assets decorator for automatic dbt model discovery (12 assets: 4 seeds + 4 staging + 4 marts)
  • PostgreSQL backend (run/event/schedule storage)

6. OpenMetadata — Data Catalog

docker compose --profile openmetadata up -d
# http://localhost:8585 (admin / admin)

Centralized metadata store for data discovery, lineage, and governance.

  • Automatic DB migration via openmetadata-migrate init service
  • Elasticsearch backend for search/indexing
  • PostgreSQL metadata storage
  • Connectors available for PostgreSQL, ClickHouse, Trino, Kafka, and more

Make Commands

make help              # List available commands
make infra             # Start infrastructure only
make basic-analytics   # Dagster + dbt + Metabase
make streaming-pipeline # Kafka + ClickHouse + Superset
make lakehouse         # Meltano + dbt + Iceberg + Trino + Superset
make ps                # Running containers
make ports             # Port map
make logs SVC=<name>   # View logs
make psql              # Connect to PostgreSQL
make clean             # Full cleanup (including volumes)

Available Profiles

Profile Description RAM
airflow Apache Airflow (webserver + scheduler + worker) ~2GB
dagster Dagster (webserver + daemon + dbt code location) ~1.5GB
dbt dbt-core (postgres adapter) ~256MB
clickhouse ClickHouse OLAP ~2GB
trino Trino query engine ~2GB
metabase Metabase BI ~1GB
superset Apache Superset BI ~1GB
meltano Meltano EL ~512MB
airbyte Airbyte (placeholder) ~4GB+
kafka Apache Kafka + Kafka UI + Producer ~1.2GB
redpanda Redpanda + Console ~1.2GB
openmetadata OpenMetadata + Elasticsearch ~1.5GB
iceberg Iceberg REST Catalog ~512MB
jupyter JupyterLab ~2GB
qdrant Qdrant vector database ~2GB
mlflow MLflow experiment tracking + model registry ~2GB
langfuse Langfuse LLM observability ~1GB
spark Apache Spark (master + worker) ~3GB
flink Apache Flink (jobmanager + taskmanager) ~2GB
presto PrestoDB query engine ~2GB
debezium Debezium CDC (Kafka Connect) ~1GB
lakefs lakeFS data versioning ~512MB
sqlmesh SQLMesh (dbt alternative) ~512MB
soda Soda Core data quality (CLI) ~256MB
marquez Marquez lineage (API + Web) ~768MB
evidence Evidence code-first BI ~512MB

Port Map

Service Port Credentials
PostgreSQL 5432 admin / admin
Redis 6379
MinIO Console 9001 minioadmin / minioadmin
Airflow 8080 admin / admin
Dagster 3000
ClickHouse 8123 default / (empty)
Trino 8090
PrestoDB 8084
Metabase 3030 (setup wizard)
Superset 8088 admin / admin
Evidence 3333
Meltano 5050
Kafka UI 8081
Redpanda Console 8082
Qdrant 6333
MLflow 5005
Langfuse 3002 (setup wizard)
Spark Master UI 8180
Flink Dashboard 8083
Debezium Connect 8085
lakeFS 8001 (setup wizard)
SQLMesh 8800
OpenMetadata 8585
Marquez Web 3001
Iceberg REST 8181
Jupyter 8888 token: datastack

Project Structure

data-stack-lab/
├── compose.yaml                 # Root entry point (include + profiles)
├── .env                         # Shared environment variables
├── Makefile                     # Convenience commands
├── CLAUDE.md                    # Claude Code project instructions
│
├── infra/                       # Shared infrastructure (always running)
│   ├── compose.yaml             # PostgreSQL, Redis, MinIO, minio-init
│   └── init-databases.sh        # Auto-create multiple databases
│
├── stacks/
│   ├── orchestration/
│   │   ├── airflow/compose.yaml
│   │   └── dagster/
│   │       ├── compose.yaml + dagster.yaml + workspace.yaml
│   │       └── code/            # Dagster-dbt code location (gRPC server)
│   ├── transformation/
│   │   ├── dbt/
│   │   │   ├── compose.yaml + Dockerfile + profiles.yml
│   │   │   └── project/         # dbt project (seeds, models, macros, tests)
│   │   └── sqlmesh/compose.yaml
│   ├── warehouse/
│   │   ├── clickhouse/compose.yaml + init/01_create_tables.sql
│   │   ├── trino/compose.yaml + catalog/*.properties
│   │   └── prestodb/compose.yaml + config/ + catalog/
│   ├── visualization/
│   │   ├── metabase/compose.yaml
│   │   ├── superset/compose.yaml + Dockerfile + requirements.txt
│   │   └── evidence/compose.yaml + Dockerfile
│   ├── ingestion/
│   │   ├── airbyte/compose.yaml (placeholder)
│   │   └── meltano/compose.yaml
│   ├── streaming/
│   │   ├── kafka/compose.yaml + producer/ (Dockerfile, producer.py)
│   │   └── redpanda/compose.yaml
│   ├── ai/
│   │   ├── qdrant/compose.yaml (vector database)
│   │   ├── mlflow/compose.yaml (experiment tracking + model registry)
│   │   └── langfuse/compose.yaml (LLM observability)
│   ├── processing/
│   │   ├── spark/compose.yaml (master + worker)
│   │   └── flink/compose.yaml (jobmanager + taskmanager)
│   ├── cdc/
│   │   └── debezium/compose.yaml (Kafka Connect)
│   ├── versioning/
│   │   └── lakefs/compose.yaml
│   ├── quality/
│   │   └── soda/compose.yaml + Dockerfile
│   ├── lineage/
│   │   └── marquez/compose.yaml (API + Web)
│   ├── catalog/
│   │   └── openmetadata/compose.yaml  # server + elasticsearch + migrate init
│   ├── storage/
│   │   └── iceberg/compose.yaml
│   └── notebook/
│       └── jupyter/
│           ├── compose.yaml + Dockerfile + requirements.txt
│           └── notebooks/       # 3 example notebooks (PG, CH, Trino)
│
├── recipes/                     # Stack combination presets
│   ├── basic-analytics.env      # dagster, dbt, metabase
│   ├── streaming-pipeline.env   # kafka, clickhouse, superset
│   ├── full-lakehouse.env       # meltano, dbt, iceberg, trino, superset
│   ├── batch-processing.env     # spark, iceberg, trino, jupyter
│   ├── realtime-cdc.env         # debezium, kafka, flink, clickhouse
│   ├── data-governance.env      # openmetadata, marquez, lakefs, soda
│   ├── ai-ml-platform.env       # qdrant, mlflow, langfuse, jupyter
│   └── full-stack.env           # everything
│
└── .claude/                     # Claude Code configuration
    ├── settings.json            # Permissions, hooks, plugins
    ├── agents/                  # docker-infra, stack-builder, data-explorer
    ├── skills/                  # /stack-up, /stack-down, /stack-status, /stack-logs
    └── rules/                   # Docker Compose conventions

Conventions

  • Container names use ds- prefix
  • All ports bind to 127.0.0.1 (no external exposure)
  • DB storage uses named volumes (macOS performance)
  • All services have deploy.resources.limits.memory
  • Healthcheck required — depends_on: condition: service_healthy
  • Docker image versions are pinned (no :latest tags)

Adding a New Stack

  1. Create stacks/<category>/<tool>/compose.yaml
  2. Add profiles: [<tool-name>] to services
  3. Add include path to root compose.yaml
  4. Add port variable to .env
  5. Update this document's Profile/Port tables

Claude Code Integration

This project includes Claude Code configuration for streamlined development:

  • Skills: /stack-up, /stack-down, /stack-status, /stack-logs
  • Agents: docker-infra (debugging), stack-builder (add tools), data-explorer (query data)
  • Rules: Docker Compose conventions auto-loaded when editing compose files
  • Plugins: data-engineering, code-review, github, pyright-lsp, and more

Requirements

  • Docker Desktop with 16GB+ RAM allocated (for running multiple stacks simultaneously)
  • Docker Compose v2.20+ (for include directive support)

Appendix: Modern Data Stack Landscape (2025/2026)

Tools evaluated during project development. Included tools are marked with ✅, future candidates with 📋.

Included in this project (26 profiles)

Category Tool Status
Orchestration Airflow, Dagster
Transformation dbt, SQLMesh
Warehouse / OLAP ClickHouse, Trino, PrestoDB
Visualization / BI Metabase, Superset, Evidence
Ingestion / EL Meltano, Airbyte (placeholder)
Streaming Kafka, Redpanda
Processing Apache Spark, Apache Flink
AI / ML Qdrant (vector DB), MLflow (experiment tracking), Langfuse (LLM observability)
CDC Debezium
Data Versioning lakeFS
Data Quality Soda Core, dbt-expectations
Data Catalog OpenMetadata
Data Lineage Marquez
Data Lake Apache Iceberg (REST catalog)
Notebook JupyterLab

Evaluated but not included

Category Tool Stars Why not included
Semantic Layer Cube.dev 19K+ High value — recommended next addition. API-first metrics layer for BI + AI
Orchestration Kestra 18K+ YAML-first paradigm. Different from Airflow/Dagster. Good future addition
Orchestration Prefect 20K+ Redundant with Airflow + Dagster
Orchestration Temporal 16K+ Not data-specific (microservice orchestration)
Embedded OLAP DuckDB 28K+ In-process engine, not a server. Best used inside Jupyter/Spark containers
Data Quality Great Expectations 10K+ Library-based. Soda Core + dbt-expectations cover the same ground
dbt Observability Elementary 2K+ dbt package — add as dbt dependency, not a standalone service
Data Contracts DataContract CLI 1.5K+ Lightweight CLI tool. Useful for CI/CD, not a running service
Feature Store Feast 5.6K+ Niche ML use case. Add when needed
Reverse ETL Multiwoven 1.6K+ Data activation. Add when needed
Vector DB Milvus 32K+ Heavy (designed for billions of vectors). Qdrant is better for lab
Vector DB Weaviate 12K+ Graph-hybrid adds complexity. Qdrant is simpler
Catalog DataHub 10K+ Heavier than OpenMetadata (5+ containers). Redundant
Catalog Nessie 1K+ Overlaps with lakeFS at different layer (catalog vs storage)
Query Engine Apache DataFusion 8K+ Library, not standalone service. Embedded in other tools
OLAP StarRocks 10K+ No ARM64 support. Incompatible with Apple Silicon
OLAP Apache Druid 14K+ 6+ containers, complex setup. ClickHouse covers OLAP
Semantic Layer MetricFlow Tightly coupled with dbt Cloud. Cube.dev is more versatile
Data Integration Apache NiFi GUI-based. Meltano/Airbyte cover ingestion

Key 2025/2026 trends

  1. AI/ML Integration — Vector databases, LLM observability, and experiment tracking are becoming core data stack components
  2. Semantic Layer — Cube.dev and similar tools provide consistent metric definitions across BI, AI, and embedded analytics
  3. Data Contracts — Formal agreements between data producers and consumers (YAML-based specs)
  4. Declarative Orchestration — YAML-first tools (Kestra) gaining traction alongside code-first (Airflow/Dagster)
  5. Lakehouse Architecture — Iceberg + object storage (MinIO/S3) replacing traditional warehouses for large-scale analytics
  6. Data Quality as Code — Shift-left testing with dbt-expectations, Soda, and Elementary

About

Docker-based Modern Data Stack sandbox — 26+ tools (Airflow, dbt, Kafka, ClickHouse, Trino, Spark, Flink, Iceberg, MLflow, and more) with one-command recipes for batch ELT, streaming, lakehouse, and AI/ML pipelines

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors