VeloStream: High-Throughput Decoupled ETL Pipeline

VeloStream is a high-performance, fully orchestrated data ingestion and processing pipeline. It demonstrates handling massive data spikes by decoupling a blazing-fast data producer from a slower data warehouse consumer using a distributed message broker.

🚀 Key Achievements

Peak Ingestion Rate: Achieved ~398,000 messages/sec throughput utilizing asynchronous batching in Go.
Massive Scale: Successfully processed and validated 44,000,000+ NYC Taxi Trip records.
Resilient Architecture: Leveraged Kafka to absorb a peak consumer lag of 31.3 Million records, protecting the downstream OLAP database from write-throttling or crashing during data spikes.
Infrastructure as Code (IaC): 100% automated provisioning of Kafka topics, ClickHouse tables, and Grafana dashboards.

🏗️ Architecture Overview

Producer (Go): Parses large Parquet datasets and streams data concurrently into Kafka using optimized batching (segmentio/kafka-go).
Message Broker (Apache Kafka - KRaft): Acts as the shock-absorber, holding millions of records in queue during burst ingestion.
Consumer (Python): Subscribes to the Kafka topic, deserializes payloads, and executes bulk inserts into ClickHouse.
Data Warehouse (ClickHouse): Stores the final dataset in a highly compressed MergeTree table for rapid analytical querying.
Orchestration (Apache Airflow): Manages the entire lifecycle—provisioning infrastructure, managing concurrent background processes, monitoring queue drain states (Consumer Lag), and validating final data integrity.
Observability (Prometheus & Grafana): Provides real-time metrics on throughput and queue health via automated IaC dashboarding.

📂 Project Structure

veloStream/
├── airflow/dags/          # Airflow DAGs for ETL orchestration and queue monitoring
├── infrastructure/        # Docker Compose, Kafka KRaft configs, and ClickHouse setup
│   └── observability/     # IaC provisioned Grafana dashboards and Prometheus configs
├── src/
│   ├── producer/          # High-throughput Go data generator
│   └── consumer/          # Python ETL script for ClickHouse ingestion
└── assets/                # Benchmark screenshots and architecture diagrams

🛠️ Quick Start

1. Prerequisites

Docker & Docker Compose
Go 1.21+
Python 3.10+
Apache Airflow (Local Standalone)

2. Boot the Infrastructure

Spin up Kafka, ClickHouse, Prometheus, and Grafana.

cd infrastructure
docker compose up -d

3. Launch the Orchestrator

Start Airflow to manage the pipeline execution.

airflow standalone

4. Run the Benchmark

Navigate to the Airflow UI at http://localhost:8080.
Unpause and trigger the velostream_benchmark_run DAG.
Open Grafana at http://localhost:3000 to watch the real-time throughput metrics.
Airflow will automatically provision the database tables, start the consumer, blast the data with the Go producer, monitor the Kafka lag until it hits 0, safely spin down the consumer, and validate the final 44,000,000 record count in ClickHouse.

📊 Observability & Metrics

The project includes a pre-configured, zero-touch Grafana dashboard (http://localhost:3000) that tracks the "Golden Signals" of the streaming pipeline:

Producer Throughput: Validates the Go application's ability to saturate network I/O.
Consumer Lag: Visualizes Kafka's role as a buffer, showing the exact backlog the Python consumer is working through.
Total Records: Tracks the absolute ingestion volume in real-time.

🧠 Technical Deep Dive: Why Kafka?

Without Kafka, directly connecting the fast Go producer to the analytical database would result in severe database throttling, high latency, or dropped packets. Because Python is inherently restricted by the Global Interpreter Lock (GIL) and network round-trips for database inserts, it cannot keep pace with Go's raw processing speed.

By introducing Kafka, the architecture safely decouples the two services. Go dumps the data as fast as the CPU allows, exiting in seconds, while the Python script safely drains the 30-million-record queue at its maximum sustainable speed.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.vscode		.vscode
airflow/dags		airflow/dags
assets		assets
infrastructure		infrastructure
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VeloStream: High-Throughput Decoupled ETL Pipeline

🚀 Key Achievements

🏗️ Architecture Overview

📂 Project Structure

🛠️ Quick Start

1. Prerequisites

2. Boot the Infrastructure

3. Launch the Orchestrator

4. Run the Benchmark

📊 Observability & Metrics

🧠 Technical Deep Dive: Why Kafka?

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VeloStream: High-Throughput Decoupled ETL Pipeline

🚀 Key Achievements

🏗️ Architecture Overview

📂 Project Structure

🛠️ Quick Start

1. Prerequisites

2. Boot the Infrastructure

3. Launch the Orchestrator

4. Run the Benchmark

📊 Observability & Metrics

🧠 Technical Deep Dive: Why Kafka?

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages