Skip to content

5GZORRO/datalake

Repository files navigation

5GZORRO Datalake

Introduction

This repository contains code and other files to implement the 5GZORRO datalake.

The main datalake API functionality is provided in the directory python-flask-server, much of which was generated by swagger.codegen.

The API itself is specified in datalake_swagger.yaml.

This code is proof-of-concept.

Prerequisites

System Requirements

The datalake server itself can run on a single VM (or bare metal) with the following resources.

  • 2 vCPUs
  • 4 GB RAM
  • 10 GB storage

The datalake server was developed with python3.6.

Dependencies

The datalake server requires that there first be running:

Installation

To set up minio:

wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
mkdir /minio/data
export MINIO_VOLUMES="/var/lib/minio"
export MINIO_ACCESS_KEY=user
export MINIO_SECRET_KEY=password
./minio server /minio/data

Kubernetes should use Docker container management (rather than containerd) for argo to work properly.

For kubernetes, it is possible to run a simulated minikube cluster. To install minikube see: https://minikube.sigs.k8s.io/docs/start/.

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube

To set up Argo and standard argo-events:

kubectl create namespace argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/v2.12.0-rc3/manifests/install.yaml
kubectl create rolebinding default-admin --clusterrole=admin --serviceaccount=default:default
kubectl create namespace argo-events
kubectl apply -f https://raw.githubusercontent.com/argoproj/argo-events/v1.1.0/manifests/install.yaml
kubectl apply -n argo-events -f https://raw.githubusercontent.com/argoproj/argo-events/v1.1.0/examples/eventbus/native.yaml

In Argo, it is necessary to define the dl-argo-events namespace.

kubectl create namespace dl-argo-events
cd datalake/config
kubectl apply -f ./install.yaml
kubectl apply -n dl-argo-events -f https://raw.githubusercontent.com/argoproj/argo-events/v1.1.0/examples/eventbus/native.yaml

To see the Argo GUI, run argo server at the command line, and then connect via a web browser to http://localhost:2746.

In Kubernetes, it is necessary to define the datalake namespace.

kubectl create namespace datalake

Run script to periodically clean up old datalake argo jobs.

cd /datalake/experiments
nohup ./loop_argo_del.sh >/dev/null 2>&1 &

To set up postgres, see instructions at https://www.postgresqltutorial.com/install-postgresql-linux/ and https://www.postgresql.org/download/linux/ubuntu/.

Allow access from outside servers by following the insructions in https://stackoverflow.com/questions/38466190/cant-connect-to-postgresql-on-port-5432.

Then perform the following:

sudo -i -u postgres
psql
\l
CREATE DATABASE datalake;
\c datalake
DROP TABLE IF EXISTS datalake_metrics;
CREATE TABLE datalake_metrics(
           seq_id SERIAL PRIMARY KEY,
		 resourceID VARCHAR,
		 referenceID VARCHAR,
		 transactionID VARCHAR,
		 productID VARCHAR,
		 instanceID VARCHAR,
		 metricName VARCHAR,
		 metricValue VARCHAR,
		 timestamp VARCHAR,
		 storageLocation VARCHAR
);

create user datalake_user with encrypted password 'datalake_pw';
grant all privileges on database datalake to datalake_user;
grant usage on schema public to datalake_user;
grant all privileges on table datalake_metrics to datalake_user;
grant all privileges on sequence datalake_metrics_seq_id_seq to datalake_user;

Before bringing up the datalake python-flask-server:

  • The ingest pipeline must be compiled and dockerized with a name of ingest.
  • The metrics_index pipeline must be compiled and dockerized with a name of metrics_index.
  • The catalog service must be compiled and dockerized with name dl_catalog_server.

The ingest, metrics_index, and dl_catalog_server containers are pulled from the 5gzorro/datalake repository. In order to enable their access, supply the following secrets to kuberentes.

cd datalake/config
kubectl apply -f ./docker-secret.yaml

This is a POC implementation. Authentication is not implemented.

TODO: Proper permissions have to be set up to use the argo-events (argo-events-resource-admin-role).

Usage

In the python-flask-server directory, fill in the proper values in env and follow the instructions in the README file.

Maintainers

Kalman Meth - meth@il.ibm.com

License

This 5GZORRO component is published under Apache 2.0 license.

About

This repository contains code and other files to implement the 5GZORRO datalake.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages