A production-grade stream processing system inspired by Apache Flink, implementing exactly-once semantics, fault tolerance, and high-throughput data processing with Python.
Uditanshu Tomar (uditanshu.tomar@colorado.edu), Ishneet Chadha (ishneet.chadha@colorado.edu)
- Docker & Docker Compose (for local deployment)
- Python 3.9+ (for development)
- Google Cloud SDK (only for GCP deployment)
- kubectl (only for GCP deployment)
The easiest way to run the entire platform locally.
-
Navigate to deployment directory:
cd deployment -
Start all services:
docker-compose up -d
-
Wait for services to be ready (~30 seconds):
# Check if services are running docker-compose ps -
Access the Web Dashboard: Open http://localhost:5000 in your browser.
-
Verify Cluster Health:
curl http://localhost:8081/cluster/metrics
-
Stop the cluster:
docker-compose down
For development and debugging, you can run components individually.
-
Setup development environment:
./scripts/setup_dev.sh
-
Activate virtual environment:
source venv/bin/activate -
Start dependencies (PostgreSQL, Kafka, Zookeeper):
cd deployment docker-compose up -d postgres zookeeper kafka -
Start JobManager:
python -m jobmanager.api # JobManager API will be available at http://localhost:8081 -
Start TaskManager (in a separate terminal):
source venv/bin/activate python -m taskmanager.task_executor -
Start Web GUI (in another terminal):
source venv/bin/activate cd gui python app.py # GUI will be available at http://localhost:5000
Deploy the platform to a Google Kubernetes Engine cluster.
-
Configure GCP Project:
export GCP_PROJECT_ID="your-project-id" gcloud config set project $GCP_PROJECT_ID
-
Run Deployment Script: This script will setup GKE, build images, and deploy all services.
./deploy_to_gcp.sh
-
Access Services:
# Get External IP of the GUI kubectl get svc -n stream-processing gui # Access JobManager API kubectl get svc -n stream-processing jobmanager
- Start the platform using Docker Compose (see Option 1 above).
- Open the Dashboard at http://localhost:5000.
- Click "Start Demo" in the "Control Panel".
- Watch real-time metrics update as the
DemoWeatherProcessingjob runs. - See data flowing in the "Live Data Stream" panel.
Example: Word Count
# 1. Generate the job file
python examples/word_count.py
# This creates word_count_job.pkl
# 2. Submit to the cluster
curl -X POST http://localhost:8081/jobs/submit \
-F "job_file=@word_count_job.pkl"
# 3. Note the job_id from the responseMonitor the Job:
# Check job status
curl http://localhost:8081/jobs/{job_id}/status
# Get job metrics
curl http://localhost:8081/jobs/{job_id}/metrics
# List all jobs
curl http://localhost:8081/jobsThe examples/ directory contains several example jobs:
# Word Count - Simple text processing
python examples/word_count.py
# Simple Pipeline - Map and filter operations
python examples/simple_pipeline.py
# Windowed Aggregation - Time-based aggregations
python examples/windowed_aggregation.py
# Stateful Deduplication - Remove duplicate records
python examples/stateful_deduplication.py
# Stream Join - Join two data streams
python examples/stream_join.py
# Data Generators - Generate test data
python examples/data_generator_iot.py
python examples/data_generator_ecommerce.py
python examples/data_generator_financial.pyEach example generates a .pkl file that can be submitted to the cluster.
- JobManager (Master): Coordinates execution, manages resources, and handles checkpoints.
- TaskManager (Worker): Executes tasks in parallel slots.
- Kafka: Handles data ingestion and inter-operator communication.
- gRPC: Used for internal control plane communication.
- RocksDB: Embedded state backend for stateful operations.
- GCS/S3: Distributed storage for fault-tolerance checkpoints.
- Exactly-Once Processing: Distributed snapshots (Chandy-Lamport).
- Fault Tolerance: Automatic failure recovery.
- High Throughput: Operator chaining & flow control.
- Stateful Operations: Windowing, Aggregations, Joins.
- Observability: Prometheus metrics & Grafana dashboards.
stream-processing-platform/
├── jobmanager/ # Control Plane (Scheduler, API)
├── taskmanager/ # Data Plane (Execution, State)
├── common/ # Shared Utils (Proto, Config)
├── gui/ # Web Dashboard
├── examples/ # Example Jobs
├── deployment/ # Docker & K8s Configs
└── scripts/ # Deployment Scripts
Key environment variables in deployment/docker-compose.yml:
TASK_SLOTS: Number of concurrent tasks per TaskManager (Default: 4).CHECKPOINT_INTERVAL: Frequency of checkpoints in ms (Default: 10000).STATE_BACKEND:rocksdbormemory.GCS_CHECKPOINT_PATH: GCS bucket for checkpoints.
When running with Docker Compose, monitoring services are automatically available:
- Grafana Dashboard: http://localhost:3000
- Username:
admin - Password:
admin
- Username:
- Prometheus Metrics: http://localhost:9090
# Check service logs
docker-compose logs jobmanager
docker-compose logs taskmanager
docker-compose logs kafka
# Check if ports are already in use
netstat -an | grep -E "5000|8081|9092|5432"# Verify JobManager is running
curl http://localhost:8081/health
# Check if Kafka is accessible
docker-compose exec kafka kafka-topics --list --bootstrap-server localhost:9092# Regenerate gRPC stubs
bash scripts/generate_proto.sh
# Reinstall dependencies
pip install -r jobmanager/requirements.txt
pip install -r taskmanager/requirements.txtBuilt with: Python, FastAPI, gRPC, Kafka, RocksDB, Docker, Kubernetes.