Skip to content

Official-EwE/surimi-datagui

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SURIMI DataLab

A comprehensive data infrastructure platform for the SURIMI project, built on the EDITO platform of the European Commission.

SURIMI DataLab provides:

  • 📊 CSV & Shapefile ingestion – Automated detection and processing of tabular and geospatial data files
  • 🗺️ Geospatial support – Shapefile (.shp) ingestion with geometry stored as WKT, queryable via Trino
  • 🏷️ Metadata management – DataHub for data discovery, lineage, and ownership
  • 🔍 SQL analytics – Trino for unified queries across all data sources
  • 📈 BI dashboards – Superset for data visualization
  • 📓 Notebooks – JupyterHub for exploratory analysis
  • 🔄 Orchestration – Airflow for workflow automation

Quick Start

# 1. Review .env for ports and credentials (optional)

# 2. Build Superset with the Trino driver (first run only)
docker-compose build superset

# 3. Start all services (MinIO bootstrap runs automatically via minio-init)
docker-compose --profile debug up -d

# 4. Access the platform
# MinIO:      http://localhost:9001  (minioadmin/minioadmin)
# Trino:      http://localhost:8080
# Superset:   http://localhost:8088  (login via Keycloak SSO)
# DataHub:    http://localhost:9002
# Airflow:    http://localhost:8083
# JupyterHub: http://localhost:8000  (login via Keycloak SSO)
# Keycloak:   http://localhost:8180  (admin/keycloak)

See QUICKSTART.md for detailed setup instructions.

Architecture

SURIMI DataLab implements a modern data lakehouse architecture with single-bucket design:

CSV Files / Shapefiles (Raw Data)
    ↓
Upload to: bucket/raw/
    ↓
Airflow DAG: File Monitoring & Validation
    ↓
Move to: bucket/staging/ (temporary)
    ↓
Convert to Parquet → bucket/hive/schema/table/ (permanent)
  - CSV: columns preserved as-is
  - Shapefile: attributes + geometry (WKT string)
    ↓
Archive originals → bucket/processed/
    ↓
Hive Metastore Catalog
    ↓
Trino SQL Interface
    ├─ Superset Dashboards
    ├─ JupyterHub Notebooks
    └─ DataHub Metadata (with geospatial tags & CRS info)

Parquet + Hive + Trino + Postgres (current deployment):

Airflow DAG -> Parquet in MinIO (bucket/hive/<schema>/<table>/)
             -> Hive Metastore (metadata in Postgres)
             -> Trino Hive connector (SQL queries)
             -> Superset / JupyterHub / DataHub

Storage Layout:

  • bucket/raw/ - Upload location (temporary)
  • bucket/staging/ - CSV processing (temporary)
  • bucket/processed/ - CSV + README archive (permanent)
  • bucket/hive/ - Parquet tables organized by schema (created automatically by the pipeline)

Services

Service Port Purpose
MinIO 9000, 9001 S3-compatible object storage
Trino 8080 SQL query engine
Superset 8088 BI & visualization
DataHub 3000 Metadata management
Airflow 8083 Workflow orchestration
JupyterHub 8000 Notebooks & analysis
Keycloak 8180 Identity provider — SSO for all services (--profile debug)
Hive Metastore 9083 Table catalog
PostgreSQL 5432 Metadata database
Elasticsearch 9200 Search & indexing
Neo4j 7474, 7687 Graph database for lineage

JupyterHub — Multi-User Notebooks

JupyterHub spawns an isolated Docker container (built from jupyterhub/notebook.Dockerfile) for each user. The image is based on jupyter/datascience-notebook with extra packages pre-installed (trino, boto3, s3fs, pyarrow, sqlalchemy).

Volume layout per user

/home/jovyan/work/           ← rw  — per-user named volume, persists across sessions
/home/jovyan/work/shared-ro/ ← ro  — shared reference notebooks (all users)
/home/jovyan/work/shared-rw/ ← rw  — collaborative notebooks (all users) [optional]

Personal work is stored in a Docker named volume (jupyterhub-user-<username>). Deleting a user container does not delete their notebooks — the volume is kept until explicitly removed.

MinIO credentials

Before each server starts, the hub automatically provisions a dedicated MinIO user and injects credentials directly as environment variables:

Variable Description
MINIO_ACCESS_KEY Per-user access key
MINIO_SECRET_KEY HMAC-derived secret key
MINIO_ENDPOINT MinIO host (minio:9000)
MINIO_BUCKET Bucket name

Use them in notebooks:

import os, boto3
s3 = boto3.client(
    "s3",
    endpoint_url=f"http://{os.environ['MINIO_ENDPOINT']}",
    aws_access_key_id=os.environ["MINIO_ACCESS_KEY"],
    aws_secret_access_key=os.environ["MINIO_SECRET_KEY"],
)

See jupyterhub/README.md for full configuration details.

Documentation

📚 Complete Documentation Index - Full documentation roadmap

Getting Started

Data Ingestion (Airflow DAG)

Technical Guides

Component Setup

Deployment

Features

Data Ingestion

  • Automatic detection of new CSV and Shapefile files in MinIO
  • Shapefile support.shp bundles (.shp, .dbf, .shx, .prj, etc.) ingested as Hive tables with geometry stored as WKT
  • README-driven schema validation or auto-detection
  • MERGE/UPSERT mode with primary key deduplication
  • Intelligent schema naming from folder structure
  • Multilingual support (Greek, German, French, English)
  • Audit trail with README archival

Metadata Management

  • Centralized data catalog in DataHub
  • Automatic lineage tracking
  • Data ownership and tagging
  • Data quality metrics

Analytics

  • SQL queries via Trino
  • Pre-built dashboards in Superset
  • JupyterHub notebooks for exploration
  • Python/Pandas integration

Orchestration

  • Airflow DAGs for automated pipelines
  • Data quality checks
  • Error handling and alerting
  • Workflow scheduling

Shapefile Ingestion

The Airflow DAG automatically detects and processes ESRI Shapefiles uploaded to the raw/ folder in MinIO, following the same pipeline as CSV files.

Supported file components

Extension Role Required
.shp Geometry data Yes
.dbf Attribute table Yes
.shx Shape index Yes
.prj Projection / CRS info Recommended
.cpg Character encoding Optional
.sbn, .sbx Spatial index Optional

All files sharing the same base name in the same folder are treated as one shapefile bundle.

How to ingest a shapefile

  1. Upload the complete shapefile bundle to MinIO under data/raw/<folder>/:
    data/raw/DataLakeFile/rivers.shp
    data/raw/DataLakeFile/rivers.dbf
    data/raw/DataLakeFile/rivers.shx
    data/raw/DataLakeFile/rivers.prj
    
  2. Wait for the Airflow DAG to run (hourly), or trigger it manually:
    docker exec airflow-scheduler airflow dags trigger comprehensive_csv_ingestion
  3. The DAG will:
    • Auto-detect schema from the .dbf attribute table
    • Convert geometry to WKT string (stored as VARCHAR in Hive)
    • Upload Parquet to data/hive/<schema>/<table>/
    • Create a Hive external table queryable via Trino
    • Emit metadata to DataHub with geospatial, shapefile tags, CRS, and geometry type

Querying shapefile data in Trino

-- List all tables in schema derived from folder name
SHOW TABLES FROM hive.DataLakeFile;

-- Query geometry and attributes
SELECT geometry, name, area
FROM hive.DataLakeFile.rivers
LIMIT 10;

-- Filter by geometry content (WKT string)
SELECT *
FROM hive.DataLakeFile.rivers
WHERE geometry LIKE 'MULTIPOLYGON%';

Schema naming

The folder structure determines the Hive schema, identical to CSV files:

  • raw/DataLakeFile/rivers.shp → schema DataLakeFile, table rivers
  • raw/regions/europe/borders.shp → schema regions, table borders

DataHub metadata

Shapefiles appear in DataHub with:

  • Tags: geospatial, shapefile, surimi, parquet
  • Custom properties: crs (coordinate reference system), geometry_type, source_format: shapefile

SURIMI Project

SURIMI is a European Commission project building on the EDITO platform, focused on:

  • Healthcare data interoperability
  • Caregiver and patient management
  • Multilingual data handling
  • Privacy-preserving analytics

Getting Help

See the troubleshooting section in QUICKSTART.md for common issues.


SURIMI DataLab – Data Infrastructure for the SURIMI Project
Built with ❤️ on EDITO

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors