Skip to content

JaynilJaiswal/veloStream

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VeloStream: High-Throughput Decoupled ETL Pipeline

Go Python Apache Kafka ClickHouse Apache Airflow Grafana Docker

VeloStream is a high-performance, fully orchestrated data ingestion and processing pipeline. It demonstrates handling massive data spikes by decoupling a blazing-fast data producer from a slower data warehouse consumer using a distributed message broker.

🚀 Key Achievements

  • Peak Ingestion Rate: Achieved ~398,000 messages/sec throughput utilizing asynchronous batching in Go.
  • Massive Scale: Successfully processed and validated 44,000,000+ NYC Taxi Trip records.
  • Resilient Architecture: Leveraged Kafka to absorb a peak consumer lag of 31.3 Million records, protecting the downstream OLAP database from write-throttling or crashing during data spikes.
  • Infrastructure as Code (IaC): 100% automated provisioning of Kafka topics, ClickHouse tables, and Grafana dashboards.

🏗️ Architecture Overview

  1. Producer (Go): Parses large Parquet datasets and streams data concurrently into Kafka using optimized batching (segmentio/kafka-go).
  2. Message Broker (Apache Kafka - KRaft): Acts as the shock-absorber, holding millions of records in queue during burst ingestion.
  3. Consumer (Python): Subscribes to the Kafka topic, deserializes payloads, and executes bulk inserts into ClickHouse.
  4. Data Warehouse (ClickHouse): Stores the final dataset in a highly compressed MergeTree table for rapid analytical querying.
  5. Orchestration (Apache Airflow): Manages the entire lifecycle—provisioning infrastructure, managing concurrent background processes, monitoring queue drain states (Consumer Lag), and validating final data integrity.
  6. Observability (Prometheus & Grafana): Provides real-time metrics on throughput and queue health via automated IaC dashboarding.

📂 Project Structure

veloStream/
├── airflow/dags/          # Airflow DAGs for ETL orchestration and queue monitoring
├── infrastructure/        # Docker Compose, Kafka KRaft configs, and ClickHouse setup
│   └── observability/     # IaC provisioned Grafana dashboards and Prometheus configs
├── src/
│   ├── producer/          # High-throughput Go data generator
│   └── consumer/          # Python ETL script for ClickHouse ingestion
└── assets/                # Benchmark screenshots and architecture diagrams

🛠️ Quick Start

1. Prerequisites

  • Docker & Docker Compose
  • Go 1.21+
  • Python 3.10+
  • Apache Airflow (Local Standalone)

2. Boot the Infrastructure

Spin up Kafka, ClickHouse, Prometheus, and Grafana.

cd infrastructure
docker compose up -d

3. Launch the Orchestrator

Start Airflow to manage the pipeline execution.

airflow standalone

4. Run the Benchmark

  1. Navigate to the Airflow UI at http://localhost:8080.
  2. Unpause and trigger the velostream_benchmark_run DAG.
  3. Open Grafana at http://localhost:3000 to watch the real-time throughput metrics.
  4. Airflow will automatically provision the database tables, start the consumer, blast the data with the Go producer, monitor the Kafka lag until it hits 0, safely spin down the consumer, and validate the final 44,000,000 record count in ClickHouse.

📊 Observability & Metrics

benchmark_results.png

The project includes a pre-configured, zero-touch Grafana dashboard (http://localhost:3000) that tracks the "Golden Signals" of the streaming pipeline:

  • Producer Throughput: Validates the Go application's ability to saturate network I/O.
  • Consumer Lag: Visualizes Kafka's role as a buffer, showing the exact backlog the Python consumer is working through.
  • Total Records: Tracks the absolute ingestion volume in real-time.

🧠 Technical Deep Dive: Why Kafka?

Without Kafka, directly connecting the fast Go producer to the analytical database would result in severe database throttling, high latency, or dropped packets. Because Python is inherently restricted by the Global Interpreter Lock (GIL) and network round-trips for database inserts, it cannot keep pace with Go's raw processing speed.

By introducing Kafka, the architecture safely decouples the two services. Go dumps the data as fast as the CPU allows, exiting in seconds, while the Python script safely drains the 30-million-record queue at its maximum sustainable speed.

About

A high-performance, decoupled ETL streaming pipeline capable of ingesting ~400k msg/sec. Built with Go, Python, Apache Kafka, and ClickHouse.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors