A comprehensive data infrastructure platform for the SURIMI project, built on the EDITO platform of the European Commission.
SURIMI DataLab provides:
- 📊 CSV & Shapefile ingestion – Automated detection and processing of tabular and geospatial data files
- 🗺️ Geospatial support – Shapefile (
.shp) ingestion with geometry stored as WKT, queryable via Trino - 🏷️ Metadata management – DataHub for data discovery, lineage, and ownership
- 🔍 SQL analytics – Trino for unified queries across all data sources
- 📈 BI dashboards – Superset for data visualization
- 📓 Notebooks – JupyterHub for exploratory analysis
- 🔄 Orchestration – Airflow for workflow automation
# 1. Review .env for ports and credentials (optional)
# 2. Build Superset with the Trino driver (first run only)
docker-compose build superset
# 3. Start all services (MinIO bootstrap runs automatically via minio-init)
docker-compose --profile debug up -d
# 4. Access the platform
# MinIO: http://localhost:9001 (minioadmin/minioadmin)
# Trino: http://localhost:8080
# Superset: http://localhost:8088 (login via Keycloak SSO)
# DataHub: http://localhost:9002
# Airflow: http://localhost:8083
# JupyterHub: http://localhost:8000 (login via Keycloak SSO)
# Keycloak: http://localhost:8180 (admin/keycloak)See QUICKSTART.md for detailed setup instructions.
SURIMI DataLab implements a modern data lakehouse architecture with single-bucket design:
CSV Files / Shapefiles (Raw Data)
↓
Upload to: bucket/raw/
↓
Airflow DAG: File Monitoring & Validation
↓
Move to: bucket/staging/ (temporary)
↓
Convert to Parquet → bucket/hive/schema/table/ (permanent)
- CSV: columns preserved as-is
- Shapefile: attributes + geometry (WKT string)
↓
Archive originals → bucket/processed/
↓
Hive Metastore Catalog
↓
Trino SQL Interface
├─ Superset Dashboards
├─ JupyterHub Notebooks
└─ DataHub Metadata (with geospatial tags & CRS info)
Parquet + Hive + Trino + Postgres (current deployment):
Airflow DAG -> Parquet in MinIO (bucket/hive/<schema>/<table>/)
-> Hive Metastore (metadata in Postgres)
-> Trino Hive connector (SQL queries)
-> Superset / JupyterHub / DataHub
Storage Layout:
bucket/raw/- Upload location (temporary)bucket/staging/- CSV processing (temporary)bucket/processed/- CSV + README archive (permanent)bucket/hive/- Parquet tables organized by schema (created automatically by the pipeline)
| Service | Port | Purpose |
|---|---|---|
| MinIO | 9000, 9001 | S3-compatible object storage |
| Trino | 8080 | SQL query engine |
| Superset | 8088 | BI & visualization |
| DataHub | 3000 | Metadata management |
| Airflow | 8083 | Workflow orchestration |
| JupyterHub | 8000 | Notebooks & analysis |
| Keycloak | 8180 | Identity provider — SSO for all services (--profile debug) |
| Hive Metastore | 9083 | Table catalog |
| PostgreSQL | 5432 | Metadata database |
| Elasticsearch | 9200 | Search & indexing |
| Neo4j | 7474, 7687 | Graph database for lineage |
JupyterHub spawns an isolated Docker container (built from jupyterhub/notebook.Dockerfile) for each user. The image is based on jupyter/datascience-notebook with extra packages pre-installed (trino, boto3, s3fs, pyarrow, sqlalchemy).
/home/jovyan/work/ ← rw — per-user named volume, persists across sessions
/home/jovyan/work/shared-ro/ ← ro — shared reference notebooks (all users)
/home/jovyan/work/shared-rw/ ← rw — collaborative notebooks (all users) [optional]
Personal work is stored in a Docker named volume (jupyterhub-user-<username>). Deleting a user container does not delete their notebooks — the volume is kept until explicitly removed.
Before each server starts, the hub automatically provisions a dedicated MinIO user and injects credentials directly as environment variables:
| Variable | Description |
|---|---|
MINIO_ACCESS_KEY |
Per-user access key |
MINIO_SECRET_KEY |
HMAC-derived secret key |
MINIO_ENDPOINT |
MinIO host (minio:9000) |
MINIO_BUCKET |
Bucket name |
Use them in notebooks:
import os, boto3
s3 = boto3.client(
"s3",
endpoint_url=f"http://{os.environ['MINIO_ENDPOINT']}",
aws_access_key_id=os.environ["MINIO_ACCESS_KEY"],
aws_secret_access_key=os.environ["MINIO_SECRET_KEY"],
)See jupyterhub/README.md for full configuration details.
📚 Complete Documentation Index - Full documentation roadmap
- QUICKSTART.md – Complete setup guide (get running in 30 minutes)
- Operations Guide – Day-to-day operations and commands
- Ingestion Modes – MERGE/REPLACE/APPEND modes, primary keys, configuration
- Schema Naming – Intelligent schema naming from folder structure
- Shapefile Ingestion – Geospatial data support (
.shp→ Parquet → Hive → DataHub) - Single Bucket Architecture – Single bucket architecture guide
- Architecture Guide – Technical deep dive with code examples
- CLAUDE.md – Development guide for Claude Code AI assistant
- Superset Setup – Analytics dashboards setup
- DataHub Setup – Metadata management setup
- DataHub Integration – DataHub integration guide
- Deployment Checklist – Production deployment checklist
- Contributing Guide – Development workflow and guidelines
- Automatic detection of new CSV and Shapefile files in MinIO
- Shapefile support –
.shpbundles (.shp,.dbf,.shx,.prj, etc.) ingested as Hive tables with geometry stored as WKT - README-driven schema validation or auto-detection
- MERGE/UPSERT mode with primary key deduplication
- Intelligent schema naming from folder structure
- Multilingual support (Greek, German, French, English)
- Audit trail with README archival
- Centralized data catalog in DataHub
- Automatic lineage tracking
- Data ownership and tagging
- Data quality metrics
- SQL queries via Trino
- Pre-built dashboards in Superset
- JupyterHub notebooks for exploration
- Python/Pandas integration
- Airflow DAGs for automated pipelines
- Data quality checks
- Error handling and alerting
- Workflow scheduling
The Airflow DAG automatically detects and processes ESRI Shapefiles uploaded to the raw/ folder in MinIO, following the same pipeline as CSV files.
| Extension | Role | Required |
|---|---|---|
.shp |
Geometry data | Yes |
.dbf |
Attribute table | Yes |
.shx |
Shape index | Yes |
.prj |
Projection / CRS info | Recommended |
.cpg |
Character encoding | Optional |
.sbn, .sbx |
Spatial index | Optional |
All files sharing the same base name in the same folder are treated as one shapefile bundle.
- Upload the complete shapefile bundle to MinIO under
data/raw/<folder>/:data/raw/DataLakeFile/rivers.shp data/raw/DataLakeFile/rivers.dbf data/raw/DataLakeFile/rivers.shx data/raw/DataLakeFile/rivers.prj - Wait for the Airflow DAG to run (hourly), or trigger it manually:
docker exec airflow-scheduler airflow dags trigger comprehensive_csv_ingestion - The DAG will:
- Auto-detect schema from the
.dbfattribute table - Convert geometry to WKT string (stored as
VARCHARin Hive) - Upload Parquet to
data/hive/<schema>/<table>/ - Create a Hive external table queryable via Trino
- Emit metadata to DataHub with
geospatial,shapefiletags, CRS, and geometry type
- Auto-detect schema from the
-- List all tables in schema derived from folder name
SHOW TABLES FROM hive.DataLakeFile;
-- Query geometry and attributes
SELECT geometry, name, area
FROM hive.DataLakeFile.rivers
LIMIT 10;
-- Filter by geometry content (WKT string)
SELECT *
FROM hive.DataLakeFile.rivers
WHERE geometry LIKE 'MULTIPOLYGON%';The folder structure determines the Hive schema, identical to CSV files:
raw/DataLakeFile/rivers.shp→ schemaDataLakeFile, tableriversraw/regions/europe/borders.shp→ schemaregions, tableborders
Shapefiles appear in DataHub with:
- Tags:
geospatial,shapefile,surimi,parquet - Custom properties:
crs(coordinate reference system),geometry_type,source_format: shapefile
SURIMI is a European Commission project building on the EDITO platform, focused on:
- Healthcare data interoperability
- Caregiver and patient management
- Multilingual data handling
- Privacy-preserving analytics
See the troubleshooting section in QUICKSTART.md for common issues.
SURIMI DataLab – Data Infrastructure for the SURIMI Project
Built with ❤️ on EDITO