A self-contained, Docker-based Modern Data Stack sandbox for learning, experimenting, and prototyping data pipelines on a local machine.
Spin up any combination of 26+ open-source data tools — from batch ELT and real-time streaming to lakehouse analytics and AI/ML — with a single docker compose command. Every tool is isolated behind a Docker Compose profile, so you only run what you need.
- Build batch ELT pipelines with dbt, Dagster, and Metabase
- Run real-time streaming with Kafka, ClickHouse, and Superset
- Set up a lakehouse with Iceberg, Trino, and MinIO (Parquet on S3)
- Track ML experiments with MLflow and monitor LLMs with Langfuse
- Perform CDC from PostgreSQL to Kafka via Debezium
- Compare query engines (Trino vs PrestoDB) on the same data
- Explore data lineage with Marquez and catalog data with OpenMetadata
- Run ad-hoc analysis in Jupyter notebooks connected to all data sources
- Modular: each tool is a standalone compose file — add or remove tools without affecting others
- Lightweight: shared PostgreSQL, Redis, and MinIO infrastructure — no per-tool database bloat
- Reproducible: all image versions pinned, memory limits set, healthchecks enforced
- Secure by default: all ports bound to
127.0.0.1, no external exposure - Recipe-driven: 8 pre-built
.envpresets for common stack combinations
┌─────────────────────────────────────────────────────────────┐
│ compose.yaml │
│ (include + profiles) │
├─────────┬───────────────────────────────────────────────────┤
│ infra/ │ PostgreSQL · Redis · MinIO (always running) │
├─────────┼───────────────────────────────────────────────────┤
│ stacks/ │ Per-tool compose.yaml (selective via profiles) │
│ │ orchestration/ airflow, dagster │
│ │ transformation/ dbt, sqlmesh │
│ │ warehouse/ clickhouse, trino, prestodb │
│ │ visualization/ metabase, superset, evidence │
│ │ ingestion/ meltano, airbyte │
│ │ streaming/ kafka, redpanda │
│ │ processing/ spark, flink │
│ │ ai/ qdrant, mlflow, langfuse │
│ │ cdc/ debezium │
│ │ versioning/ lakefs │
│ │ quality/ soda │
│ │ catalog/ openmetadata │
│ │ lineage/ marquez │
│ │ storage/ iceberg │
│ │ notebook/ jupyter │
├─────────┼───────────────────────────────────────────────────┤
│ recipes/│ Combination presets (.env files) │
└─────────┴───────────────────────────────────────────────────┘
# Infrastructure only (PostgreSQL + Redis + MinIO)
docker compose up -d
# Use a recipe
docker compose --env-file recipes/basic-analytics.env up -d
# Selective profiles
docker compose --profile dagster --profile metabase up -d
# Everything
docker compose --profile "*" up -d
# Shut down
docker compose --profile "*" downmake basic-analytics
docker compose exec dbt dbt seed && dbt run && dbt testELT pipeline with e-commerce sample data (25 customers, 50 orders, 20 products).
| Layer | Models | Type |
|---|---|---|
raw.* |
customers, products, orders, order_items | seed (CSV → table) |
stg.* |
stg_customers, stg_orders, stg_order_items, stg_products | view |
marts.* |
fct_orders, dim_customers, mart_revenue_daily, mart_top_products | table |
- 29 dbt tests all PASS (unique, not_null, relationships, accepted_values + dbt-expectations)
- Data quality tests via
dbt-expectations: email regex, price range, row count, amount bounds - Metabase(:3030) SQL Lab + chart creation verified
make streaming-pipelineReal-time pipeline with fake clickstream events (5 events/sec).
Python Producer → Kafka (clickstream-events topic)
→ ClickHouse Kafka Engine (real-time consumption)
→ MaterializedView → MergeTree (persistent storage)
→ clickstream_hourly (aggregation view)
- Kafka UI(:8081) topic/message inspection verified
- Superset(:8088) ClickHouse SQL Lab query verified
docker compose --profile iceberg --profile trino up -dStores PostgreSQL mart data into Iceberg tables via Trino CTAS. Parquet files are saved in the MinIO warehouse bucket.
-- Run in Trino
CREATE TABLE iceberg.analytics.customers AS
SELECT * FROM postgresql.marts.dim_customers;- Trino(:8090) → Iceberg REST catalog → MinIO(S3)
- Superset(:8088) connection via
trino://trino@ds-trino:8080/icebergverified - Cross-catalog queries: Trino can JOIN PostgreSQL + Iceberg simultaneously
docker compose --profile jupyter up -d
# http://localhost:8888 (token: datastack)Includes 3 example notebooks per scenario:
| Notebook | Data Source | Contents |
|---|---|---|
01-batch-analytics.ipynb |
PostgreSQL | dbt marts queries, revenue trends, top customers, product sales |
02-streaming-analytics.ipynb |
ClickHouse | Clickstream analysis, device/country traffic, popular pages |
03-lakehouse-query.ipynb |
Trino/Iceberg | Iceberg table queries, cross-catalog JOIN (PG + Iceberg) |
Custom Dockerfile with warehouse drivers: psycopg2, clickhouse-connect, trino, sqlalchemy
docker compose --profile dagster up -d
# http://localhost:3000Manages dbt models as Dagster assets via dagster-dbt integration.
dagster-code (gRPC:4000) → dbt project mount
dagster-webserver (:3000) → asset materialization UI
dagster-daemon → schedule/sensor execution
@dbt_assetsdecorator for automatic dbt model discovery (12 assets: 4 seeds + 4 staging + 4 marts)- PostgreSQL backend (run/event/schedule storage)
docker compose --profile openmetadata up -d
# http://localhost:8585 (admin / admin)Centralized metadata store for data discovery, lineage, and governance.
- Automatic DB migration via
openmetadata-migrateinit service - Elasticsearch backend for search/indexing
- PostgreSQL metadata storage
- Connectors available for PostgreSQL, ClickHouse, Trino, Kafka, and more
make help # List available commands
make infra # Start infrastructure only
make basic-analytics # Dagster + dbt + Metabase
make streaming-pipeline # Kafka + ClickHouse + Superset
make lakehouse # Meltano + dbt + Iceberg + Trino + Superset
make ps # Running containers
make ports # Port map
make logs SVC=<name> # View logs
make psql # Connect to PostgreSQL
make clean # Full cleanup (including volumes)| Profile | Description | RAM |
|---|---|---|
airflow |
Apache Airflow (webserver + scheduler + worker) | ~2GB |
dagster |
Dagster (webserver + daemon + dbt code location) | ~1.5GB |
dbt |
dbt-core (postgres adapter) | ~256MB |
clickhouse |
ClickHouse OLAP | ~2GB |
trino |
Trino query engine | ~2GB |
metabase |
Metabase BI | ~1GB |
superset |
Apache Superset BI | ~1GB |
meltano |
Meltano EL | ~512MB |
airbyte |
Airbyte (placeholder) | ~4GB+ |
kafka |
Apache Kafka + Kafka UI + Producer | ~1.2GB |
redpanda |
Redpanda + Console | ~1.2GB |
openmetadata |
OpenMetadata + Elasticsearch | ~1.5GB |
iceberg |
Iceberg REST Catalog | ~512MB |
jupyter |
JupyterLab | ~2GB |
qdrant |
Qdrant vector database | ~2GB |
mlflow |
MLflow experiment tracking + model registry | ~2GB |
langfuse |
Langfuse LLM observability | ~1GB |
spark |
Apache Spark (master + worker) | ~3GB |
flink |
Apache Flink (jobmanager + taskmanager) | ~2GB |
presto |
PrestoDB query engine | ~2GB |
debezium |
Debezium CDC (Kafka Connect) | ~1GB |
lakefs |
lakeFS data versioning | ~512MB |
sqlmesh |
SQLMesh (dbt alternative) | ~512MB |
soda |
Soda Core data quality (CLI) | ~256MB |
marquez |
Marquez lineage (API + Web) | ~768MB |
evidence |
Evidence code-first BI | ~512MB |
| Service | Port | Credentials |
|---|---|---|
| PostgreSQL | 5432 | admin / admin |
| Redis | 6379 | — |
| MinIO Console | 9001 | minioadmin / minioadmin |
| Airflow | 8080 | admin / admin |
| Dagster | 3000 | — |
| ClickHouse | 8123 | default / (empty) |
| Trino | 8090 | — |
| PrestoDB | 8084 | — |
| Metabase | 3030 | (setup wizard) |
| Superset | 8088 | admin / admin |
| Evidence | 3333 | — |
| Meltano | 5050 | — |
| Kafka UI | 8081 | — |
| Redpanda Console | 8082 | — |
| Qdrant | 6333 | — |
| MLflow | 5005 | — |
| Langfuse | 3002 | (setup wizard) |
| Spark Master UI | 8180 | — |
| Flink Dashboard | 8083 | — |
| Debezium Connect | 8085 | — |
| lakeFS | 8001 | (setup wizard) |
| SQLMesh | 8800 | — |
| OpenMetadata | 8585 | — |
| Marquez Web | 3001 | — |
| Iceberg REST | 8181 | — |
| Jupyter | 8888 | token: datastack |
data-stack-lab/
├── compose.yaml # Root entry point (include + profiles)
├── .env # Shared environment variables
├── Makefile # Convenience commands
├── CLAUDE.md # Claude Code project instructions
│
├── infra/ # Shared infrastructure (always running)
│ ├── compose.yaml # PostgreSQL, Redis, MinIO, minio-init
│ └── init-databases.sh # Auto-create multiple databases
│
├── stacks/
│ ├── orchestration/
│ │ ├── airflow/compose.yaml
│ │ └── dagster/
│ │ ├── compose.yaml + dagster.yaml + workspace.yaml
│ │ └── code/ # Dagster-dbt code location (gRPC server)
│ ├── transformation/
│ │ ├── dbt/
│ │ │ ├── compose.yaml + Dockerfile + profiles.yml
│ │ │ └── project/ # dbt project (seeds, models, macros, tests)
│ │ └── sqlmesh/compose.yaml
│ ├── warehouse/
│ │ ├── clickhouse/compose.yaml + init/01_create_tables.sql
│ │ ├── trino/compose.yaml + catalog/*.properties
│ │ └── prestodb/compose.yaml + config/ + catalog/
│ ├── visualization/
│ │ ├── metabase/compose.yaml
│ │ ├── superset/compose.yaml + Dockerfile + requirements.txt
│ │ └── evidence/compose.yaml + Dockerfile
│ ├── ingestion/
│ │ ├── airbyte/compose.yaml (placeholder)
│ │ └── meltano/compose.yaml
│ ├── streaming/
│ │ ├── kafka/compose.yaml + producer/ (Dockerfile, producer.py)
│ │ └── redpanda/compose.yaml
│ ├── ai/
│ │ ├── qdrant/compose.yaml (vector database)
│ │ ├── mlflow/compose.yaml (experiment tracking + model registry)
│ │ └── langfuse/compose.yaml (LLM observability)
│ ├── processing/
│ │ ├── spark/compose.yaml (master + worker)
│ │ └── flink/compose.yaml (jobmanager + taskmanager)
│ ├── cdc/
│ │ └── debezium/compose.yaml (Kafka Connect)
│ ├── versioning/
│ │ └── lakefs/compose.yaml
│ ├── quality/
│ │ └── soda/compose.yaml + Dockerfile
│ ├── lineage/
│ │ └── marquez/compose.yaml (API + Web)
│ ├── catalog/
│ │ └── openmetadata/compose.yaml # server + elasticsearch + migrate init
│ ├── storage/
│ │ └── iceberg/compose.yaml
│ └── notebook/
│ └── jupyter/
│ ├── compose.yaml + Dockerfile + requirements.txt
│ └── notebooks/ # 3 example notebooks (PG, CH, Trino)
│
├── recipes/ # Stack combination presets
│ ├── basic-analytics.env # dagster, dbt, metabase
│ ├── streaming-pipeline.env # kafka, clickhouse, superset
│ ├── full-lakehouse.env # meltano, dbt, iceberg, trino, superset
│ ├── batch-processing.env # spark, iceberg, trino, jupyter
│ ├── realtime-cdc.env # debezium, kafka, flink, clickhouse
│ ├── data-governance.env # openmetadata, marquez, lakefs, soda
│ ├── ai-ml-platform.env # qdrant, mlflow, langfuse, jupyter
│ └── full-stack.env # everything
│
└── .claude/ # Claude Code configuration
├── settings.json # Permissions, hooks, plugins
├── agents/ # docker-infra, stack-builder, data-explorer
├── skills/ # /stack-up, /stack-down, /stack-status, /stack-logs
└── rules/ # Docker Compose conventions
- Container names use
ds-prefix - All ports bind to
127.0.0.1(no external exposure) - DB storage uses named volumes (macOS performance)
- All services have
deploy.resources.limits.memory - Healthcheck required —
depends_on: condition: service_healthy - Docker image versions are pinned (no
:latesttags)
- Create
stacks/<category>/<tool>/compose.yaml - Add
profiles: [<tool-name>]to services - Add
includepath to rootcompose.yaml - Add port variable to
.env - Update this document's Profile/Port tables
This project includes Claude Code configuration for streamlined development:
- Skills:
/stack-up,/stack-down,/stack-status,/stack-logs - Agents:
docker-infra(debugging),stack-builder(add tools),data-explorer(query data) - Rules: Docker Compose conventions auto-loaded when editing compose files
- Plugins: data-engineering, code-review, github, pyright-lsp, and more
- Docker Desktop with 16GB+ RAM allocated (for running multiple stacks simultaneously)
- Docker Compose v2.20+ (for
includedirective support)
Tools evaluated during project development. Included tools are marked with ✅, future candidates with 📋.
| Category | Tool | Status |
|---|---|---|
| Orchestration | Airflow, Dagster | ✅ |
| Transformation | dbt, SQLMesh | ✅ |
| Warehouse / OLAP | ClickHouse, Trino, PrestoDB | ✅ |
| Visualization / BI | Metabase, Superset, Evidence | ✅ |
| Ingestion / EL | Meltano, Airbyte (placeholder) | ✅ |
| Streaming | Kafka, Redpanda | ✅ |
| Processing | Apache Spark, Apache Flink | ✅ |
| AI / ML | Qdrant (vector DB), MLflow (experiment tracking), Langfuse (LLM observability) | ✅ |
| CDC | Debezium | ✅ |
| Data Versioning | lakeFS | ✅ |
| Data Quality | Soda Core, dbt-expectations | ✅ |
| Data Catalog | OpenMetadata | ✅ |
| Data Lineage | Marquez | ✅ |
| Data Lake | Apache Iceberg (REST catalog) | ✅ |
| Notebook | JupyterLab | ✅ |
| Category | Tool | Stars | Why not included |
|---|---|---|---|
| Semantic Layer | Cube.dev | 19K+ | High value — recommended next addition. API-first metrics layer for BI + AI |
| Orchestration | Kestra | 18K+ | YAML-first paradigm. Different from Airflow/Dagster. Good future addition |
| Orchestration | Prefect | 20K+ | Redundant with Airflow + Dagster |
| Orchestration | Temporal | 16K+ | Not data-specific (microservice orchestration) |
| Embedded OLAP | DuckDB | 28K+ | In-process engine, not a server. Best used inside Jupyter/Spark containers |
| Data Quality | Great Expectations | 10K+ | Library-based. Soda Core + dbt-expectations cover the same ground |
| dbt Observability | Elementary | 2K+ | dbt package — add as dbt dependency, not a standalone service |
| Data Contracts | DataContract CLI | 1.5K+ | Lightweight CLI tool. Useful for CI/CD, not a running service |
| Feature Store | Feast | 5.6K+ | Niche ML use case. Add when needed |
| Reverse ETL | Multiwoven | 1.6K+ | Data activation. Add when needed |
| Vector DB | Milvus | 32K+ | Heavy (designed for billions of vectors). Qdrant is better for lab |
| Vector DB | Weaviate | 12K+ | Graph-hybrid adds complexity. Qdrant is simpler |
| Catalog | DataHub | 10K+ | Heavier than OpenMetadata (5+ containers). Redundant |
| Catalog | Nessie | 1K+ | Overlaps with lakeFS at different layer (catalog vs storage) |
| Query Engine | Apache DataFusion | 8K+ | Library, not standalone service. Embedded in other tools |
| OLAP | StarRocks | 10K+ | No ARM64 support. Incompatible with Apple Silicon |
| OLAP | Apache Druid | 14K+ | 6+ containers, complex setup. ClickHouse covers OLAP |
| Semantic Layer | MetricFlow | — | Tightly coupled with dbt Cloud. Cube.dev is more versatile |
| Data Integration | Apache NiFi | — | GUI-based. Meltano/Airbyte cover ingestion |
- AI/ML Integration — Vector databases, LLM observability, and experiment tracking are becoming core data stack components
- Semantic Layer — Cube.dev and similar tools provide consistent metric definitions across BI, AI, and embedded analytics
- Data Contracts — Formal agreements between data producers and consumers (YAML-based specs)
- Declarative Orchestration — YAML-first tools (Kestra) gaining traction alongside code-first (Airflow/Dagster)
- Lakehouse Architecture — Iceberg + object storage (MinIO/S3) replacing traditional warehouses for large-scale analytics
- Data Quality as Code — Shift-left testing with dbt-expectations, Soda, and Elementary