SURIMI DataLab

A comprehensive data infrastructure platform for the SURIMI project, built on the EDITO platform of the European Commission.

SURIMI DataLab provides:

📊 CSV & Shapefile ingestion – Automated detection and processing of tabular and geospatial data files
🗺️ Geospatial support – Shapefile (.shp) ingestion with geometry stored as WKT, queryable via Trino
🏷️ Metadata management – DataHub for data discovery, lineage, and ownership
🔍 SQL analytics – Trino for unified queries across all data sources
📈 BI dashboards – Superset for data visualization
📓 Notebooks – JupyterHub for exploratory analysis
🔄 Orchestration – Airflow for workflow automation

Quick Start

# 1. Review .env for ports and credentials (optional)

# 2. Build Superset with the Trino driver (first run only)
docker-compose build superset

# 3. Start all services (MinIO bootstrap runs automatically via minio-init)
docker-compose --profile debug up -d

# 4. Access the platform
# MinIO:      http://localhost:9001  (minioadmin/minioadmin)
# Trino:      http://localhost:8080
# Superset:   http://localhost:8088  (login via Keycloak SSO)
# DataHub:    http://localhost:9002
# Airflow:    http://localhost:8083
# JupyterHub: http://localhost:8000  (login via Keycloak SSO)
# Keycloak:   http://localhost:8180  (admin/keycloak)

See QUICKSTART.md for detailed setup instructions.

Architecture

SURIMI DataLab implements a modern data lakehouse architecture with single-bucket design:

CSV Files / Shapefiles (Raw Data)
    ↓
Upload to: bucket/raw/
    ↓
Airflow DAG: File Monitoring & Validation
    ↓
Move to: bucket/staging/ (temporary)
    ↓
Convert to Parquet → bucket/hive/schema/table/ (permanent)
  - CSV: columns preserved as-is
  - Shapefile: attributes + geometry (WKT string)
    ↓
Archive originals → bucket/processed/
    ↓
Hive Metastore Catalog
    ↓
Trino SQL Interface
    ├─ Superset Dashboards
    ├─ JupyterHub Notebooks
    └─ DataHub Metadata (with geospatial tags & CRS info)

Parquet + Hive + Trino + Postgres (current deployment):

Airflow DAG -> Parquet in MinIO (bucket/hive/<schema>/<table>/)
             -> Hive Metastore (metadata in Postgres)
             -> Trino Hive connector (SQL queries)
             -> Superset / JupyterHub / DataHub

Storage Layout:

bucket/raw/ - Upload location (temporary)
bucket/staging/ - CSV processing (temporary)
bucket/processed/ - CSV + README archive (permanent)
bucket/hive/ - Parquet tables organized by schema (created automatically by the pipeline)

Services

Service	Port	Purpose
MinIO	9000, 9001	S3-compatible object storage
Trino	8080	SQL query engine
Superset	8088	BI & visualization
DataHub	3000	Metadata management
Airflow	8083	Workflow orchestration
JupyterHub	8000	Notebooks & analysis
Keycloak	8180	Identity provider — SSO for all services (`--profile debug`)
Hive Metastore	9083	Table catalog
PostgreSQL	5432	Metadata database
Elasticsearch	9200	Search & indexing
Neo4j	7474, 7687	Graph database for lineage

JupyterHub — Multi-User Notebooks

JupyterHub spawns an isolated Docker container (built from jupyterhub/notebook.Dockerfile) for each user. The image is based on jupyter/datascience-notebook with extra packages pre-installed (trino, boto3, s3fs, pyarrow, sqlalchemy).

Volume layout per user

/home/jovyan/work/           ← rw  — per-user named volume, persists across sessions
/home/jovyan/work/shared-ro/ ← ro  — shared reference notebooks (all users)
/home/jovyan/work/shared-rw/ ← rw  — collaborative notebooks (all users) [optional]

Personal work is stored in a Docker named volume (jupyterhub-user-<username>). Deleting a user container does not delete their notebooks — the volume is kept until explicitly removed.

MinIO credentials

Before each server starts, the hub automatically provisions a dedicated MinIO user and injects credentials directly as environment variables:

Variable	Description
`MINIO_ACCESS_KEY`	Per-user access key
`MINIO_SECRET_KEY`	HMAC-derived secret key
`MINIO_ENDPOINT`	MinIO host (`minio:9000`)
`MINIO_BUCKET`	Bucket name

Use them in notebooks:

import os, boto3
s3 = boto3.client(
    "s3",
    endpoint_url=f"http://{os.environ['MINIO_ENDPOINT']}",
    aws_access_key_id=os.environ["MINIO_ACCESS_KEY"],
    aws_secret_access_key=os.environ["MINIO_SECRET_KEY"],
)

See jupyterhub/README.md for full configuration details.

Documentation

📚 Complete Documentation Index - Full documentation roadmap

Getting Started

QUICKSTART.md – Complete setup guide (get running in 30 minutes)
Operations Guide – Day-to-day operations and commands

Data Ingestion (Airflow DAG)

Ingestion Modes – MERGE/REPLACE/APPEND modes, primary keys, configuration
Schema Naming – Intelligent schema naming from folder structure
Shapefile Ingestion – Geospatial data support (.shp → Parquet → Hive → DataHub)
Single Bucket Architecture – Single bucket architecture guide

Technical Guides

Architecture Guide – Technical deep dive with code examples
CLAUDE.md – Development guide for Claude Code AI assistant

Component Setup

Superset Setup – Analytics dashboards setup
DataHub Setup – Metadata management setup
DataHub Integration – DataHub integration guide

Deployment

Deployment Checklist – Production deployment checklist
Contributing Guide – Development workflow and guidelines

Features

Data Ingestion

Automatic detection of new CSV and Shapefile files in MinIO
Shapefile support – .shp bundles (.shp, .dbf, .shx, .prj, etc.) ingested as Hive tables with geometry stored as WKT
README-driven schema validation or auto-detection
MERGE/UPSERT mode with primary key deduplication
Intelligent schema naming from folder structure
Multilingual support (Greek, German, French, English)
Audit trail with README archival

Metadata Management

Centralized data catalog in DataHub
Automatic lineage tracking
Data ownership and tagging
Data quality metrics

Analytics

SQL queries via Trino
Pre-built dashboards in Superset
JupyterHub notebooks for exploration
Python/Pandas integration

Orchestration

Airflow DAGs for automated pipelines
Data quality checks
Error handling and alerting
Workflow scheduling

Shapefile Ingestion

The Airflow DAG automatically detects and processes ESRI Shapefiles uploaded to the raw/ folder in MinIO, following the same pipeline as CSV files.

Supported file components

Extension	Role	Required
`.shp`	Geometry data	Yes
`.dbf`	Attribute table	Yes
`.shx`	Shape index	Yes
`.prj`	Projection / CRS info	Recommended
`.cpg`	Character encoding	Optional
`.sbn`, `.sbx`	Spatial index	Optional

All files sharing the same base name in the same folder are treated as one shapefile bundle.

How to ingest a shapefile

Upload the complete shapefile bundle to MinIO under data/raw/<folder>/:

data/raw/DataLakeFile/rivers.shp
data/raw/DataLakeFile/rivers.dbf
data/raw/DataLakeFile/rivers.shx
data/raw/DataLakeFile/rivers.prj

Wait for the Airflow DAG to run (hourly), or trigger it manually:

docker exec airflow-scheduler airflow dags trigger comprehensive_csv_ingestion

The DAG will:
- Auto-detect schema from the .dbf attribute table
- Convert geometry to WKT string (stored as VARCHAR in Hive)
- Upload Parquet to data/hive/<schema>/<table>/
- Create a Hive external table queryable via Trino
- Emit metadata to DataHub with geospatial, shapefile tags, CRS, and geometry type

Querying shapefile data in Trino

-- List all tables in schema derived from folder name
SHOW TABLES FROM hive.DataLakeFile;

-- Query geometry and attributes
SELECT geometry, name, area
FROM hive.DataLakeFile.rivers
LIMIT 10;

-- Filter by geometry content (WKT string)
SELECT *
FROM hive.DataLakeFile.rivers
WHERE geometry LIKE 'MULTIPOLYGON%';

Schema naming

The folder structure determines the Hive schema, identical to CSV files:

raw/DataLakeFile/rivers.shp → schema DataLakeFile, table rivers
raw/regions/europe/borders.shp → schema regions, table borders

DataHub metadata

Shapefiles appear in DataHub with:

Tags: geospatial, shapefile, surimi, parquet
Custom properties: crs (coordinate reference system), geometry_type, source_format: shapefile

SURIMI Project

SURIMI is a European Commission project building on the EDITO platform, focused on:

Healthcare data interoperability
Caregiver and patient management
Multilingual data handling
Privacy-preserving analytics

Getting Help

See the troubleshooting section in QUICKSTART.md for common issues.

SURIMI DataLab – Data Infrastructure for the SURIMI Project
Built with ❤️ on EDITO

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude		.claude
airflow		airflow
datahub		datahub
docs		docs
hive-metastore		hive-metastore
jupyterhub		jupyterhub
keycloak		keycloak
minio		minio
nginx		nginx
notebooks		notebooks
postgres		postgres
scripts		scripts
superset		superset
test_files		test_files
trino/etc		trino/etc
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
FILES_MANIFEST.txt		FILES_MANIFEST.txt
FINAL_DESIGN_SUMMARY.txt		FINAL_DESIGN_SUMMARY.txt
IMPLEMENTATION_SUMMARY.txt		IMPLEMENTATION_SUMMARY.txt
QUICKSTART.md		QUICKSTART.md
README.md		README.md
architecture-diagram.html		architecture-diagram.html
docker-compose.datahub.quickstart.yml		docker-compose.datahub.quickstart.yml
docker-compose.yml		docker-compose.yml
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SURIMI DataLab

Quick Start

Architecture

Services

JupyterHub — Multi-User Notebooks

Volume layout per user

MinIO credentials

Documentation

Getting Started

Data Ingestion (Airflow DAG)

Technical Guides

Component Setup

Deployment

Features

Data Ingestion

Metadata Management

Analytics

Orchestration

Shapefile Ingestion

Supported file components

How to ingest a shapefile

Querying shapefile data in Trino

Schema naming

DataHub metadata

SURIMI Project

Getting Help

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SURIMI DataLab

Quick Start

Architecture

Services

JupyterHub — Multi-User Notebooks

Volume layout per user

MinIO credentials

Documentation

Getting Started

Data Ingestion (Airflow DAG)

Technical Guides

Component Setup

Deployment

Features

Data Ingestion

Metadata Management

Analytics

Orchestration

Shapefile Ingestion

Supported file components

How to ingest a shapefile

Querying shapefile data in Trino

Schema naming

DataHub metadata

SURIMI Project

Getting Help

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages