Skip to content

ishneet42/datacenter-stream-engine

Repository files navigation

Distributed Stream Processing Platform

A production-grade stream processing system inspired by Apache Flink, implementing exactly-once semantics, fault tolerance, and high-throughput data processing with Python.


Teammates

Uditanshu Tomar (uditanshu.tomar@colorado.edu), Ishneet Chadha (ishneet.chadha@colorado.edu)

How to Run

Prerequisites

  • Docker & Docker Compose (for local deployment)
  • Python 3.9+ (for development)
  • Google Cloud SDK (only for GCP deployment)
  • kubectl (only for GCP deployment)

Option 1: Quick Start with Docker Compose (Recommended)

The easiest way to run the entire platform locally.

  1. Navigate to deployment directory:

    cd deployment
  2. Start all services:

    docker-compose up -d
  3. Wait for services to be ready (~30 seconds):

    # Check if services are running
    docker-compose ps
  4. Access the Web Dashboard: Open http://localhost:5000 in your browser.

  5. Verify Cluster Health:

    curl http://localhost:8081/cluster/metrics
  6. Stop the cluster:

    docker-compose down

Option 2: Development Setup (Run Components Individually)

For development and debugging, you can run components individually.

  1. Setup development environment:

    ./scripts/setup_dev.sh
  2. Activate virtual environment:

    source venv/bin/activate
  3. Start dependencies (PostgreSQL, Kafka, Zookeeper):

    cd deployment
    docker-compose up -d postgres zookeeper kafka
  4. Start JobManager:

    python -m jobmanager.api
    # JobManager API will be available at http://localhost:8081
  5. Start TaskManager (in a separate terminal):

    source venv/bin/activate
    python -m taskmanager.task_executor
  6. Start Web GUI (in another terminal):

    source venv/bin/activate
    cd gui
    python app.py
    # GUI will be available at http://localhost:5000

Option 3: Run on Google Cloud Platform (GKE)

Deploy the platform to a Google Kubernetes Engine cluster.

  1. Configure GCP Project:

    export GCP_PROJECT_ID="your-project-id"
    gcloud config set project $GCP_PROJECT_ID
  2. Run Deployment Script: This script will setup GKE, build images, and deploy all services.

    ./deploy_to_gcp.sh
  3. Access Services:

    # Get External IP of the GUI
    kubectl get svc -n stream-processing gui
    
    # Access JobManager API
    kubectl get svc -n stream-processing jobmanager

Running Jobs

1. Run the Demo (Web GUI)

  1. Start the platform using Docker Compose (see Option 1 above).
  2. Open the Dashboard at http://localhost:5000.
  3. Click "Start Demo" in the "Control Panel".
  4. Watch real-time metrics update as the DemoWeatherProcessing job runs.
  5. See data flowing in the "Live Data Stream" panel.

2. Submit a Custom Job (CLI)

Example: Word Count

# 1. Generate the job file
python examples/word_count.py
# This creates word_count_job.pkl

# 2. Submit to the cluster
curl -X POST http://localhost:8081/jobs/submit \
  -F "job_file=@word_count_job.pkl"

# 3. Note the job_id from the response

Monitor the Job:

# Check job status
curl http://localhost:8081/jobs/{job_id}/status

# Get job metrics
curl http://localhost:8081/jobs/{job_id}/metrics

# List all jobs
curl http://localhost:8081/jobs

3. Run Example Jobs

The examples/ directory contains several example jobs:

# Word Count - Simple text processing
python examples/word_count.py

# Simple Pipeline - Map and filter operations
python examples/simple_pipeline.py

# Windowed Aggregation - Time-based aggregations
python examples/windowed_aggregation.py

# Stateful Deduplication - Remove duplicate records
python examples/stateful_deduplication.py

# Stream Join - Join two data streams
python examples/stream_join.py

# Data Generators - Generate test data
python examples/data_generator_iot.py
python examples/data_generator_ecommerce.py
python examples/data_generator_financial.py

Each example generates a .pkl file that can be submitted to the cluster.


Architecture

  • JobManager (Master): Coordinates execution, manages resources, and handles checkpoints.
  • TaskManager (Worker): Executes tasks in parallel slots.
  • Kafka: Handles data ingestion and inter-operator communication.
  • gRPC: Used for internal control plane communication.
  • RocksDB: Embedded state backend for stateful operations.
  • GCS/S3: Distributed storage for fault-tolerance checkpoints.

Features

  • Exactly-Once Processing: Distributed snapshots (Chandy-Lamport).
  • Fault Tolerance: Automatic failure recovery.
  • High Throughput: Operator chaining & flow control.
  • Stateful Operations: Windowing, Aggregations, Joins.
  • Observability: Prometheus metrics & Grafana dashboards.

Project Structure

stream-processing-platform/
├── jobmanager/              # Control Plane (Scheduler, API)
├── taskmanager/             # Data Plane (Execution, State)
├── common/                  # Shared Utils (Proto, Config)
├── gui/                     # Web Dashboard
├── examples/                # Example Jobs
├── deployment/              # Docker & K8s Configs
└── scripts/                 # Deployment Scripts

Configuration

Key environment variables in deployment/docker-compose.yml:

  • TASK_SLOTS: Number of concurrent tasks per TaskManager (Default: 4).
  • CHECKPOINT_INTERVAL: Frequency of checkpoints in ms (Default: 10000).
  • STATE_BACKEND: rocksdb or memory.
  • GCS_CHECKPOINT_PATH: GCS bucket for checkpoints.

Monitoring

When running with Docker Compose, monitoring services are automatically available:

Troubleshooting

Services not starting

# Check service logs
docker-compose logs jobmanager
docker-compose logs taskmanager
docker-compose logs kafka

# Check if ports are already in use
netstat -an | grep -E "5000|8081|9092|5432"

Job submission fails

# Verify JobManager is running
curl http://localhost:8081/health

# Check if Kafka is accessible
docker-compose exec kafka kafka-topics --list --bootstrap-server localhost:9092

Development mode issues

# Regenerate gRPC stubs
bash scripts/generate_proto.sh

# Reinstall dependencies
pip install -r jobmanager/requirements.txt
pip install -r taskmanager/requirements.txt

Built with: Python, FastAPI, gRPC, Kafka, RocksDB, Docker, Kubernetes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors