A scalable streaming architecture integrating asynchronous ML inference with stateful Flink processing and Paimon storage for real-time observability.
Project Description:
A streaming data platform designed to capture, process and visualize the pulse of real-time social discourse and trends across the decentralized social platform (Bluesky).
Ultimately, this project seeks to operationalize unstructured social data, transforming raw high-velocity streams into structured insights to identify social trends, bot networks, anomalous content signals and others without compromising the ingestion speed. A high-throughput streaming pipeline powered by Apache Flink and Kafka that ingests live Bluesky Jetstream events (posts) into a scalable Apache Paimon data lakehouse, leveraging asynchronous I/O to decouple heavy inference from stream processing, reducing analytical latency and flink processing complexity for robust ecosystem monitoring. Also, a fun project to sharpen the system architecting and data engineering skill personally.
- Executive Summary
- System Overview & Architecture
- Data Modeling & Storage Strategy
- Orchestration & Data Quality
- Key Engineering Challenges
- How to Run
- Lesson Learned & Roadmap
- Author
- Appendices
- Attribution
In an era where social discourse on decentralized platforms like Bluesky moves faster than traditional analytics can track, it is crucial to protect the society from being misleaded by fake and toxic articles and news, a unified real-time pane of glass to distinguish organic trends from artificial noise is required. While for data intelligence and Trust & Safety teams, it is crucial to detect toxicity surges, bot networks, and viral anomalies the moment they emerge, not hours later. With market sentiment and brand reputation now inextricably linked to real-time social signals, this platform bridges that gap, ingesting the "firehose" of Bluesky Jetstream to provide sub-second insights into content velocity, sentiment shifts, and platform health.
The Goal: Operationalize the unstructured chaos of high-velocity social streams by instantly correlating raw events with deep ML inference (Toxicity, NER, Bot Detection-WIP) to visualize the "heartbeat" of the decentralized web.
The Solution: A scalable Streaming Lakehouse pipeline that ingests websocket data and published through Kafka, decouples heavy ML inference using an Async Nginx Gateway, aggregates stateful metrics with Apache Flink, and commits data to an Apache Paimon lakehouse for unified real-time and historical analysis.
Key Metric: Reduces actionable insight latency from hours (batch processing) to sub-seconds, while preventing stream backpressure during high-load ML inference bursts via asynchronous non-blocking I/O.
Key Architectural Decision:
- Streaming Lakehouse: "Kappa 2.0" with cheap storage (S3 compatible-Minio) and open table format (Apache Paimon) with high-frequency streaming updates/deletes/appends capabilities , treating all data as a continuous stream. By using Apache Flink as the unified engine, eliminating complexity of maintaining separate batch and speed layers.
- Async I/O Pattern: To mitigate slow ML inference from blocking high-throughput stream ingestion and stuck checkpoint barriers with an intermediate scalable worker decouples Kafka stream from the ML Gateway, allowing non-blocking HTTP requests to the model server.
| Component | Technology | Why it was chosen |
|---|---|---|
| Ingestion | Bluesky Jetstream | Decentralized, free and open source for experimental purpose. Provides a lightweight, firehose-style websocket API that delivers JSON events in real-time, simulate implementation of a real streaming data platform. |
| Data Bus | Apache Kafka | Decouples the volatile ingestion layer from the heavy compute layer; guarantees at-least-once delivery and acts as the persistent buffer for backpressure handling. |
| ML Inference | Nginx + Python | Using an Nginx load balancer to distribute load requests across multiple model containers (Toxicity, NER), ensuring single-model latency doesn't stall the pipeline. |
| Processing | Apache Flink | A single unified engine for its stateful processing capabilities (Windowing) and robust checkpointing mechanism, ensuring exactly-once processing semantics during aggregations. |
| Data Lakehouse | Paimon + MinIO | Enables "Streaming Data Warehouse" capabilities; allows Flink to write streaming updates directly to S3 (MinIO) that are immediately queryable, improve latency over static Iceberg snapshots. |
| Serving | Trino | A federated query engine that allows standard SQL analysis over the Paimon Lakehouse files stored in MinIO without requiring data movement to a separate warehouse. |
| Observability | LGTM + OpenTel Stack | OpenTelemetry, Loki, Grafana, Tempo, Prometheus: Provides a "Single Pane of Glass" to correlate logs, metrics, and distributed traces, essential for debugging latency across the async ML layer. |
The Data Flow:
-
Ingestion Layer: A specialized
bluesky-ingestionservice connects to the Jetstream Websocket, normalizing raw events into JSON and pushing them to the primary Kafka topic (bluesky.stream.raw). -
Async Enrichment Layer: The
enrichment-workerconsumes raw events and performs non-blocking asynchronous HTTP calls to thenginx-ml-gateway. This gateway routes text to specific models (Toxic-BERT, NER), augments the payload with inference scores, and produces to a new topic (bluesky.stream.enriched). -
Stream Aggregation: Apache Flink reads enriched stream from Kafka, applying sliding window aggregations. Open Telemetry acts as the instrumentation tool to collect defined business and operational metrics.
-
Data Lakehouse (Storage): Raw and aggregated results are streamed directly into Apache Paimon tables on MinIO S3, organizing data into partitions optimized for queries.
-
Serving & Analytics: Results scrapped from OpenTelemetry by Prometheus are queried for dashboards in Grafana, and Trino serves as the interactive query layer, enabling data analysts/scientist to run SQL queries against Paimon tables.
Schema Design: One Big Table (OBT)
- Instead of traditional highly-normalized Star Scheme (which requires expensive joins), I implemented a Wide-Column (OBT) design. The
enriched_poststable contains all dimensions (Author, Text, Sentiment, Toxicity, Entities) in a single row, optimizing it for the columnar scan performance of Trino.
Storage
- Streaming Lakehouse (Apache Paimon on S3)
enriched_posts(Append-Only log):- Engine:
Parimon Append-Only - Purpose: Stores the raw, immutable history of every single analyzed post.
- Optimization: Configured with
bucket=4to distribute data across files for parallel reading. - Format:
parquet(Columnar) for highly efficient compression of text fields likeentities_json
- Engine:
- Operational Metrics Store (Prometheus)
- Serves as the "speed layer" for immediate alerting and system observability.
WIP
| Challenge | Engineering Solution Implemented |
|---|---|
| Checkpoint Starvation | Kafka Loopback Pattern: Initially, running synchronous ML inference directly inside Flink operators blocked the processing threads. This prevented Checkpoint Barriers from flowing downstream, causing constant checkpoint timeouts and eventual job collapse. Initially, Flink's Async I/O API (Latest Flink 2.2.0) is considered, but Paimon-Flink does not support for Flink 2.2.0 yet by time of writing. So I resolved this by decoupling the architecture: a separate Python worker consumes raw data, performs the inference, and "loops back" the result to an enriched Kafka topic. Flink now only consumes the pre-processed stream, ensuring low-latency state access and stable checkpointing. |
| ML Inference Latency | Async Gateway Pattern: High-velocity streams often overwhelm CPU-bound ML models (BERT/NER). I implemented an Async Nginx Gateway (nginx-ml-gateway) that load-balances requests across model replicas. By using asynchronous I/O in the enrichment worker, I decoupled the stream throughput from model inference time, preventing backpressure propagation to Kafka. |
| Small File Problem | LSM Tree Compaction: Streaming raw events directly to S3 typically generates millions of tiny files, killing query performance. I utilized Apache Paimon’s LSM (Log-Structured Merge) tree architecture with auto-compaction (compaction.min.file-num=10). This automatically merges streaming deltas into optimized Parquet files on MinIO without pausing ingestion. Another option to schedule "rewrite" job to perform compaction over small files if Iceberg were used instead. |
| Model Cold Starts | Build-Time Model Baking: Downloading large NLP models at runtime caused unacceptable container startup latency (30s+). I optimized the deployment by using uv for ultra-fast dependency caching and pre-downloading models directly into the Docker image layer during the build phase. This deliberately sacrifices CI/CD build speed to reduce production scaling time and eliminate network dependencies during auto-scale events. |
This project is fully containerized. Prerequisites: Docker Engine & 16GB+ RAM.
- Clone the repo.
- Configure the dependencies file, and place them based on the project structure.
- Spin up the stack.
make build
make up- Access and login into UI (Grafana Dashboard): http://localhost:3000 (default credentials)
Lesson:
| Lesson | Description |
|---|---|
| Synchronous Inference vs. Decoupled Loopback | Embedding blocking ML computations directly in Flink operators halts checkpoint barriers, causing job instability and timeouts. The "Loopback Pattern" (Kafka → Worker → Kafka → Flink) isolates heavy compute from stateful processing, ensuring Flink maintains sub-second latency and consistent state snapshots regardless of model load. Other options are Flink's Async I/O (Avaiable ver 2.2.0+), Unaligned Checkpoints configuration, Embedding of compressed model (with ONNX) in Flink Processing and others. |
| Standard Parquet Sink vs. Paimon LSM | Direct streaming writes to S3 typically result in the "small file problem," crippling downstream Trino query performance. Apache Paimon’s Log-Structured Merge (LSM) tree architecture was critical to enable high-throughput streaming appends with automatic background compaction, bridging the gap between real-time writing and fast OLAP reading. |
| Runtime Model Download vs. Build-Time Baking | Downloading large NLP models during container startup creates unpredictable latency (30s+) and network dependency during auto-scaling events. "Baking" models into the Docker image (using uv and docker with caching) deliberately sacrifices CI build speed to guarantee deterministic, instant startup in production—a vital trade-off for resilient MLOps. |
Roadmap:
Phase 2: Reliability & Hardening
- To complete alerting manager and Loki integration for full observability, monitoring and alerting.
- To implement CI/CD pipelines for automated testing and chaos testing.
- To implement schema registry into the stack for serialization of kafka messages and future schema evolution without breaking the pipelines.
- To implement DLQ within Flink topology to capture and isolate abnormal events without stalling the main pipeline.
- To integrate Great Expectations into ingestion worker for runtime data quality checks.
Phase 3: Infrastructure & Scale
- To use Terraform for IaaC.
- To migrate from Docker Compose to Kubernetes
- To implement S3 Sink Connector for infinite retention of raw Jetstream logs (Kafka) into MinI, decoupling retention from broker disk limits.
Phase 4: Productionization and Optimization
- To implement a self-service configuration layer for dynamically inject keywords/patterns into running Flink job without redeployment.
- To enhance Flink with Complex Event Processing.
- To enable Bot Detection, Embeddings model for advanced features.
- To implement Graph analytics for user interation network and community clustering.
- To ingest events such as "likes", "reposts", "follow" etc. for enhanced analytics.
I am always looking to elevate my work, and I view every project as a stepping stone. I genuinely welcome and seek out constructive criticism and improvement suggestions on my approach, code architecture, or documentation. Let's make this code better together—feel free to connect with your thoughts!
This project is open source and available under the MIT License.
Dashboard 1: Bluesky Analytics - Content Intelligence
- This dashboard focuses on "Social Listening" and content analysis. It visualizes the metrics such as top mentioned entities, viral potential and others of data flowing through the system.
Dashboard 2: Bluesky Ingestion Ops
- This dashboard tracks the "front door" of the data pipeline, monitoring how fast and effectively data is being received from external sources via WebSockets.
Dashboard 3: ML Enrichment Ops
- This dashboard monitors the Machine Learning microservice that enriches the data. It confirms this is the bottleneck identified in the Kafka dashboard. (Limited by the setup of machine (Rely on low-end CPU) for ML processing.)
Dashboard 4:Kafka Cluster Ops
Dashboard 5: Flink Ops
- This monitors the stream processing engine (Apache Flink). It shows the health of the jobs processing the data streams.
Dashboard 6: MinIO Object Storage Ops
- This dashboard monitors the health and capacity of the object storage layer (MinIO). It highlights a potentially critical capacity and performance issue.
Dashboard 7: Nginx Gateway Ops
- This dashboard monitors the web server/reverse proxy (NGINX) that likely sits at the very edge of the network.
Architecture Diagram is generated with the help by Google Gemini.