Data Stack Lab

A self-contained, Docker-based Modern Data Stack sandbox for learning, experimenting, and prototyping data pipelines on a local machine.

Spin up any combination of 26+ open-source data tools — from batch ELT and real-time streaming to lakehouse analytics and AI/ML — with a single docker compose command. Every tool is isolated behind a Docker Compose profile, so you only run what you need.

What you can do

Build batch ELT pipelines with dbt, Dagster, and Metabase
Run real-time streaming with Kafka, ClickHouse, and Superset
Set up a lakehouse with Iceberg, Trino, and MinIO (Parquet on S3)
Track ML experiments with MLflow and monitor LLMs with Langfuse
Perform CDC from PostgreSQL to Kafka via Debezium
Compare query engines (Trino vs PrestoDB) on the same data
Explore data lineage with Marquez and catalog data with OpenMetadata
Run ad-hoc analysis in Jupyter notebooks connected to all data sources

Design principles

Modular: each tool is a standalone compose file — add or remove tools without affecting others
Lightweight: shared PostgreSQL, Redis, and MinIO infrastructure — no per-tool database bloat
Reproducible: all image versions pinned, memory limits set, healthchecks enforced
Secure by default: all ports bound to 127.0.0.1, no external exposure
Recipe-driven: 8 pre-built .env presets for common stack combinations

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        compose.yaml                         │
│                     (include + profiles)                    │
├─────────┬───────────────────────────────────────────────────┤
│  infra/ │  PostgreSQL · Redis · MinIO (always running)      │
├─────────┼───────────────────────────────────────────────────┤
│ stacks/ │  Per-tool compose.yaml (selective via profiles)   │
│         │  orchestration/ airflow, dagster                  │
│         │  transformation/ dbt, sqlmesh                     │
│         │  warehouse/ clickhouse, trino, prestodb           │
│         │  visualization/ metabase, superset, evidence      │
│         │  ingestion/ meltano, airbyte                      │
│         │  streaming/ kafka, redpanda                       │
│         │  processing/ spark, flink                         │
│         │  ai/ qdrant, mlflow, langfuse                     │
│         │  cdc/ debezium                                    │
│         │  versioning/ lakefs                               │
│         │  quality/ soda                                    │
│         │  catalog/ openmetadata                            │
│         │  lineage/ marquez                                 │
│         │  storage/ iceberg                                 │
│         │  notebook/ jupyter                                │
├─────────┼───────────────────────────────────────────────────┤
│ recipes/│  Combination presets (.env files)                  │
└─────────┴───────────────────────────────────────────────────┘

Quick Start

# Infrastructure only (PostgreSQL + Redis + MinIO)
docker compose up -d

# Use a recipe
docker compose --env-file recipes/basic-analytics.env up -d

# Selective profiles
docker compose --profile dagster --profile metabase up -d

# Everything
docker compose --profile "*" up -d

# Shut down
docker compose --profile "*" down

Scenarios (Verified)

1. Batch Analytics — dbt + PostgreSQL + Metabase

make basic-analytics
docker compose exec dbt dbt seed && dbt run && dbt test

ELT pipeline with e-commerce sample data (25 customers, 50 orders, 20 products).

Layer	Models	Type
`raw.*`	customers, products, orders, order_items	seed (CSV → table)
`stg.*`	stg_customers, stg_orders, stg_order_items, stg_products	view
`marts.*`	fct_orders, dim_customers, mart_revenue_daily, mart_top_products	table

29 dbt tests all PASS (unique, not_null, relationships, accepted_values + dbt-expectations)
Data quality tests via dbt-expectations: email regex, price range, row count, amount bounds
Metabase(:3030) SQL Lab + chart creation verified

2. Streaming Pipeline — Kafka + ClickHouse + Superset

make streaming-pipeline

Real-time pipeline with fake clickstream events (5 events/sec).

Python Producer → Kafka (clickstream-events topic)
                    → ClickHouse Kafka Engine (real-time consumption)
                       → MaterializedView → MergeTree (persistent storage)
                          → clickstream_hourly (aggregation view)

Kafka UI(:8081) topic/message inspection verified
Superset(:8088) ClickHouse SQL Lab query verified

3. Lakehouse — Iceberg + Trino + MinIO + Superset

docker compose --profile iceberg --profile trino up -d

Stores PostgreSQL mart data into Iceberg tables via Trino CTAS. Parquet files are saved in the MinIO warehouse bucket.

-- Run in Trino
CREATE TABLE iceberg.analytics.customers AS
SELECT * FROM postgresql.marts.dim_customers;

Trino(:8090) → Iceberg REST catalog → MinIO(S3)
Superset(:8088) connection via trino://trino@ds-trino:8080/iceberg verified
Cross-catalog queries: Trino can JOIN PostgreSQL + Iceberg simultaneously

4. Jupyter Notebooks — Interactive Analysis

docker compose --profile jupyter up -d
# http://localhost:8888 (token: datastack)

Includes 3 example notebooks per scenario:

Notebook	Data Source	Contents
`01-batch-analytics.ipynb`	PostgreSQL	dbt marts queries, revenue trends, top customers, product sales
`02-streaming-analytics.ipynb`	ClickHouse	Clickstream analysis, device/country traffic, popular pages
`03-lakehouse-query.ipynb`	Trino/Iceberg	Iceberg table queries, cross-catalog JOIN (PG + Iceberg)

Custom Dockerfile with warehouse drivers: psycopg2, clickhouse-connect, trino, sqlalchemy

5. Dagster-dbt Orchestration

docker compose --profile dagster up -d
# http://localhost:3000

Manages dbt models as Dagster assets via dagster-dbt integration.

dagster-code (gRPC:4000) → dbt project mount
dagster-webserver (:3000) → asset materialization UI
dagster-daemon → schedule/sensor execution

@dbt_assets decorator for automatic dbt model discovery (12 assets: 4 seeds + 4 staging + 4 marts)
PostgreSQL backend (run/event/schedule storage)

6. OpenMetadata — Data Catalog

docker compose --profile openmetadata up -d
# http://localhost:8585 (admin / admin)

Centralized metadata store for data discovery, lineage, and governance.

Automatic DB migration via openmetadata-migrate init service
Elasticsearch backend for search/indexing
PostgreSQL metadata storage
Connectors available for PostgreSQL, ClickHouse, Trino, Kafka, and more

Make Commands

make help              # List available commands
make infra             # Start infrastructure only
make basic-analytics   # Dagster + dbt + Metabase
make streaming-pipeline # Kafka + ClickHouse + Superset
make lakehouse         # Meltano + dbt + Iceberg + Trino + Superset
make ps                # Running containers
make ports             # Port map
make logs SVC=<name>   # View logs
make psql              # Connect to PostgreSQL
make clean             # Full cleanup (including volumes)

Available Profiles

Profile	Description	RAM
`airflow`	Apache Airflow (webserver + scheduler + worker)	~2GB
`dagster`	Dagster (webserver + daemon + dbt code location)	~1.5GB
`dbt`	dbt-core (postgres adapter)	~256MB
`clickhouse`	ClickHouse OLAP	~2GB
`trino`	Trino query engine	~2GB
`metabase`	Metabase BI	~1GB
`superset`	Apache Superset BI	~1GB
`meltano`	Meltano EL	~512MB
`airbyte`	Airbyte (placeholder)	~4GB+
`kafka`	Apache Kafka + Kafka UI + Producer	~1.2GB
`redpanda`	Redpanda + Console	~1.2GB
`openmetadata`	OpenMetadata + Elasticsearch	~1.5GB
`iceberg`	Iceberg REST Catalog	~512MB
`jupyter`	JupyterLab	~2GB
`qdrant`	Qdrant vector database	~2GB
`mlflow`	MLflow experiment tracking + model registry	~2GB
`langfuse`	Langfuse LLM observability	~1GB
`spark`	Apache Spark (master + worker)	~3GB
`flink`	Apache Flink (jobmanager + taskmanager)	~2GB
`presto`	PrestoDB query engine	~2GB
`debezium`	Debezium CDC (Kafka Connect)	~1GB
`lakefs`	lakeFS data versioning	~512MB
`sqlmesh`	SQLMesh (dbt alternative)	~512MB
`soda`	Soda Core data quality (CLI)	~256MB
`marquez`	Marquez lineage (API + Web)	~768MB
`evidence`	Evidence code-first BI	~512MB

Port Map

Service	Port	Credentials
PostgreSQL	5432	admin / admin
Redis	6379	—
MinIO Console	9001	minioadmin / minioadmin
Airflow	8080	admin / admin
Dagster	3000	—
ClickHouse	8123	default / (empty)
Trino	8090	—
PrestoDB	8084	—
Metabase	3030	(setup wizard)
Superset	8088	admin / admin
Evidence	3333	—
Meltano	5050	—
Kafka UI	8081	—
Redpanda Console	8082	—
Qdrant	6333	—
MLflow	5005	—
Langfuse	3002	(setup wizard)
Spark Master UI	8180	—
Flink Dashboard	8083	—
Debezium Connect	8085	—
lakeFS	8001	(setup wizard)
SQLMesh	8800	—
OpenMetadata	8585	—
Marquez Web	3001	—
Iceberg REST	8181	—
Jupyter	8888	token: datastack

Project Structure

data-stack-lab/
├── compose.yaml                 # Root entry point (include + profiles)
├── .env                         # Shared environment variables
├── Makefile                     # Convenience commands
├── CLAUDE.md                    # Claude Code project instructions
│
├── infra/                       # Shared infrastructure (always running)
│   ├── compose.yaml             # PostgreSQL, Redis, MinIO, minio-init
│   └── init-databases.sh        # Auto-create multiple databases
│
├── stacks/
│   ├── orchestration/
│   │   ├── airflow/compose.yaml
│   │   └── dagster/
│   │       ├── compose.yaml + dagster.yaml + workspace.yaml
│   │       └── code/            # Dagster-dbt code location (gRPC server)
│   ├── transformation/
│   │   ├── dbt/
│   │   │   ├── compose.yaml + Dockerfile + profiles.yml
│   │   │   └── project/         # dbt project (seeds, models, macros, tests)
│   │   └── sqlmesh/compose.yaml
│   ├── warehouse/
│   │   ├── clickhouse/compose.yaml + init/01_create_tables.sql
│   │   ├── trino/compose.yaml + catalog/*.properties
│   │   └── prestodb/compose.yaml + config/ + catalog/
│   ├── visualization/
│   │   ├── metabase/compose.yaml
│   │   ├── superset/compose.yaml + Dockerfile + requirements.txt
│   │   └── evidence/compose.yaml + Dockerfile
│   ├── ingestion/
│   │   ├── airbyte/compose.yaml (placeholder)
│   │   └── meltano/compose.yaml
│   ├── streaming/
│   │   ├── kafka/compose.yaml + producer/ (Dockerfile, producer.py)
│   │   └── redpanda/compose.yaml
│   ├── ai/
│   │   ├── qdrant/compose.yaml (vector database)
│   │   ├── mlflow/compose.yaml (experiment tracking + model registry)
│   │   └── langfuse/compose.yaml (LLM observability)
│   ├── processing/
│   │   ├── spark/compose.yaml (master + worker)
│   │   └── flink/compose.yaml (jobmanager + taskmanager)
│   ├── cdc/
│   │   └── debezium/compose.yaml (Kafka Connect)
│   ├── versioning/
│   │   └── lakefs/compose.yaml
│   ├── quality/
│   │   └── soda/compose.yaml + Dockerfile
│   ├── lineage/
│   │   └── marquez/compose.yaml (API + Web)
│   ├── catalog/
│   │   └── openmetadata/compose.yaml  # server + elasticsearch + migrate init
│   ├── storage/
│   │   └── iceberg/compose.yaml
│   └── notebook/
│       └── jupyter/
│           ├── compose.yaml + Dockerfile + requirements.txt
│           └── notebooks/       # 3 example notebooks (PG, CH, Trino)
│
├── recipes/                     # Stack combination presets
│   ├── basic-analytics.env      # dagster, dbt, metabase
│   ├── streaming-pipeline.env   # kafka, clickhouse, superset
│   ├── full-lakehouse.env       # meltano, dbt, iceberg, trino, superset
│   ├── batch-processing.env     # spark, iceberg, trino, jupyter
│   ├── realtime-cdc.env         # debezium, kafka, flink, clickhouse
│   ├── data-governance.env      # openmetadata, marquez, lakefs, soda
│   ├── ai-ml-platform.env       # qdrant, mlflow, langfuse, jupyter
│   └── full-stack.env           # everything
│
└── .claude/                     # Claude Code configuration
    ├── settings.json            # Permissions, hooks, plugins
    ├── agents/                  # docker-infra, stack-builder, data-explorer
    ├── skills/                  # /stack-up, /stack-down, /stack-status, /stack-logs
    └── rules/                   # Docker Compose conventions

Conventions

Container names use ds- prefix
All ports bind to 127.0.0.1 (no external exposure)
DB storage uses named volumes (macOS performance)
All services have deploy.resources.limits.memory
Healthcheck required — depends_on: condition: service_healthy
Docker image versions are pinned (no :latest tags)

Adding a New Stack

Create stacks/<category>/<tool>/compose.yaml
Add profiles: [<tool-name>] to services
Add include path to root compose.yaml
Add port variable to .env
Update this document's Profile/Port tables

Claude Code Integration

This project includes Claude Code configuration for streamlined development:

Skills: /stack-up, /stack-down, /stack-status, /stack-logs
Agents: docker-infra (debugging), stack-builder (add tools), data-explorer (query data)
Rules: Docker Compose conventions auto-loaded when editing compose files
Plugins: data-engineering, code-review, github, pyright-lsp, and more

Requirements

Docker Desktop with 16GB+ RAM allocated (for running multiple stacks simultaneously)
Docker Compose v2.20+ (for include directive support)

Appendix: Modern Data Stack Landscape (2025/2026)

Tools evaluated during project development. Included tools are marked with ✅, future candidates with 📋.

Included in this project (26 profiles)

Category	Tool	Status
Orchestration	Airflow, Dagster	✅
Transformation	dbt, SQLMesh	✅
Warehouse / OLAP	ClickHouse, Trino, PrestoDB	✅
Visualization / BI	Metabase, Superset, Evidence	✅
Ingestion / EL	Meltano, Airbyte (placeholder)	✅
Streaming	Kafka, Redpanda	✅
Processing	Apache Spark, Apache Flink	✅
AI / ML	Qdrant (vector DB), MLflow (experiment tracking), Langfuse (LLM observability)	✅
CDC	Debezium	✅
Data Versioning	lakeFS	✅
Data Quality	Soda Core, dbt-expectations	✅
Data Catalog	OpenMetadata	✅
Data Lineage	Marquez	✅
Data Lake	Apache Iceberg (REST catalog)	✅
Notebook	JupyterLab	✅

Evaluated but not included

Category	Tool	Stars	Why not included
Semantic Layer	Cube.dev	19K+	High value — recommended next addition. API-first metrics layer for BI + AI
Orchestration	Kestra	18K+	YAML-first paradigm. Different from Airflow/Dagster. Good future addition
Orchestration	Prefect	20K+	Redundant with Airflow + Dagster
Orchestration	Temporal	16K+	Not data-specific (microservice orchestration)
Embedded OLAP	DuckDB	28K+	In-process engine, not a server. Best used inside Jupyter/Spark containers
Data Quality	Great Expectations	10K+	Library-based. Soda Core + dbt-expectations cover the same ground
dbt Observability	Elementary	2K+	dbt package — add as dbt dependency, not a standalone service
Data Contracts	DataContract CLI	1.5K+	Lightweight CLI tool. Useful for CI/CD, not a running service
Feature Store	Feast	5.6K+	Niche ML use case. Add when needed
Reverse ETL	Multiwoven	1.6K+	Data activation. Add when needed
Vector DB	Milvus	32K+	Heavy (designed for billions of vectors). Qdrant is better for lab
Vector DB	Weaviate	12K+	Graph-hybrid adds complexity. Qdrant is simpler
Catalog	DataHub	10K+	Heavier than OpenMetadata (5+ containers). Redundant
Catalog	Nessie	1K+	Overlaps with lakeFS at different layer (catalog vs storage)
Query Engine	Apache DataFusion	8K+	Library, not standalone service. Embedded in other tools
OLAP	StarRocks	10K+	No ARM64 support. Incompatible with Apple Silicon
OLAP	Apache Druid	14K+	6+ containers, complex setup. ClickHouse covers OLAP
Semantic Layer	MetricFlow	—	Tightly coupled with dbt Cloud. Cube.dev is more versatile
Data Integration	Apache NiFi	—	GUI-based. Meltano/Airbyte cover ingestion

Key 2025/2026 trends

AI/ML Integration — Vector databases, LLM observability, and experiment tracking are becoming core data stack components
Semantic Layer — Cube.dev and similar tools provide consistent metric definitions across BI, AI, and embedded analytics
Data Contracts — Formal agreements between data producers and consumers (YAML-based specs)
Declarative Orchestration — YAML-first tools (Kestra) gaining traction alongside code-first (Airflow/Dagster)
Lakehouse Architecture — Iceberg + object storage (MinIO/S3) replacing traditional warehouses for large-scale analytics
Data Quality as Code — Shift-left testing with dbt-expectations, Soda, and Elementary

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Stack Lab

What you can do

Design principles

Architecture

Quick Start

Scenarios (Verified)

1. Batch Analytics — dbt + PostgreSQL + Metabase

2. Streaming Pipeline — Kafka + ClickHouse + Superset

3. Lakehouse — Iceberg + Trino + MinIO + Superset

4. Jupyter Notebooks — Interactive Analysis

5. Dagster-dbt Orchestration

6. OpenMetadata — Data Catalog

Make Commands

Available Profiles

Port Map

Project Structure

Conventions

Adding a New Stack

Claude Code Integration

Requirements

Appendix: Modern Data Stack Landscape (2025/2026)

Included in this project (26 profiles)

Evaluated but not included

Key 2025/2026 trends

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.claude		.claude
.github		.github
infra		infra
recipes		recipes
stacks		stacks
.env		.env
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
compose.yaml		compose.yaml

Folders and files

Latest commit

History

Repository files navigation

Data Stack Lab

What you can do

Design principles

Architecture

Quick Start

Scenarios (Verified)

1. Batch Analytics — dbt + PostgreSQL + Metabase

2. Streaming Pipeline — Kafka + ClickHouse + Superset

3. Lakehouse — Iceberg + Trino + MinIO + Superset

4. Jupyter Notebooks — Interactive Analysis

5. Dagster-dbt Orchestration

6. OpenMetadata — Data Catalog

Make Commands

Available Profiles

Port Map

Project Structure

Conventions

Adding a New Stack

Claude Code Integration

Requirements

Appendix: Modern Data Stack Landscape (2025/2026)

Included in this project (26 profiles)

Evaluated but not included

Key 2025/2026 trends

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages