Bluesky StreamPulse : Async ML & Real-Time Analytics with Streaming Lakehouse

A scalable streaming architecture integrating asynchronous ML inference with stateful Flink processing and Paimon storage for real-time observability.

Project Description:
A streaming data platform designed to capture, process and visualize the pulse of real-time social discourse and trends across the decentralized social platform (Bluesky).

Ultimately, this project seeks to operationalize unstructured social data, transforming raw high-velocity streams into structured insights to identify social trends, bot networks, anomalous content signals and others without compromising the ingestion speed. A high-throughput streaming pipeline powered by Apache Flink and Kafka that ingests live Bluesky Jetstream events (posts) into a scalable Apache Paimon data lakehouse, leveraging asynchronous I/O to decouple heavy inference from stream processing, reducing analytical latency and flink processing complexity for robust ecosystem monitoring. Also, a fun project to sharpen the system architecting and data engineering skill personally.

In an era where social discourse on decentralized platforms like Bluesky moves faster than traditional analytics can track, it is crucial to protect the society from being misleaded by fake and toxic articles and news, a unified real-time pane of glass to distinguish organic trends from artificial noise is required. While for data intelligence and Trust & Safety teams, it is crucial to detect toxicity surges, bot networks, and viral anomalies the moment they emerge, not hours later. With market sentiment and brand reputation now inextricably linked to real-time social signals, this platform bridges that gap, ingesting the "firehose" of Bluesky Jetstream to provide sub-second insights into content velocity, sentiment shifts, and platform health.

The Goal: Operationalize the unstructured chaos of high-velocity social streams by instantly correlating raw events with deep ML inference (Toxicity, NER, Bot Detection-WIP) to visualize the "heartbeat" of the decentralized web.

The Solution: A scalable Streaming Lakehouse pipeline that ingests websocket data and published through Kafka, decouples heavy ML inference using an Async Nginx Gateway, aggregates stateful metrics with Apache Flink, and commits data to an Apache Paimon lakehouse for unified real-time and historical analysis.

Key Metric: Reduces actionable insight latency from hours (batch processing) to sub-seconds, while preventing stream backpressure during high-load ML inference bursts via asynchronous non-blocking I/O.

System Overview & Architecture

Key Architectural Decision:

Streaming Lakehouse: "Kappa 2.0" with cheap storage (S3 compatible-Minio) and open table format (Apache Paimon) with high-frequency streaming updates/deletes/appends capabilities , treating all data as a continuous stream. By using Apache Flink as the unified engine, eliminating complexity of maintaining separate batch and speed layers.
Async I/O Pattern: To mitigate slow ML inference from blocking high-throughput stream ingestion and stuck checkpoint barriers with an intermediate scalable worker decouples Kafka stream from the ML Gateway, allowing non-blocking HTTP requests to the model server.

Component	Technology	Why it was chosen
Ingestion	Bluesky Jetstream	Decentralized, free and open source for experimental purpose. Provides a lightweight, firehose-style websocket API that delivers JSON events in real-time, simulate implementation of a real streaming data platform.
Data Bus	Apache Kafka	Decouples the volatile ingestion layer from the heavy compute layer; guarantees at-least-once delivery and acts as the persistent buffer for backpressure handling.
ML Inference	Nginx + Python	Using an Nginx load balancer to distribute load requests across multiple model containers (Toxicity, NER), ensuring single-model latency doesn't stall the pipeline.
Processing	Apache Flink	A single unified engine for its stateful processing capabilities (Windowing) and robust checkpointing mechanism, ensuring exactly-once processing semantics during aggregations.
Data Lakehouse	Paimon + MinIO	Enables "Streaming Data Warehouse" capabilities; allows Flink to write streaming updates directly to S3 (MinIO) that are immediately queryable, improve latency over static Iceberg snapshots.
Serving	Trino	A federated query engine that allows standard SQL analysis over the Paimon Lakehouse files stored in MinIO without requiring data movement to a separate warehouse.
Observability	LGTM + OpenTel Stack	OpenTelemetry, Loki, Grafana, Tempo, Prometheus: Provides a "Single Pane of Glass" to correlate logs, metrics, and distributed traces, essential for debugging latency across the async ML layer.

The Data Flow:

Ingestion Layer: A specialized bluesky-ingestion service connects to the Jetstream Websocket, normalizing raw events into JSON and pushing them to the primary Kafka topic (bluesky.stream.raw).
Async Enrichment Layer: The enrichment-worker consumes raw events and performs non-blocking asynchronous HTTP calls to the nginx-ml-gateway. This gateway routes text to specific models (Toxic-BERT, NER), augments the payload with inference scores, and produces to a new topic (bluesky.stream.enriched).
Stream Aggregation: Apache Flink reads enriched stream from Kafka, applying sliding window aggregations. Open Telemetry acts as the instrumentation tool to collect defined business and operational metrics.
Data Lakehouse (Storage): Raw and aggregated results are streamed directly into Apache Paimon tables on MinIO S3, organizing data into partitions optimized for queries.
Serving & Analytics: Results scrapped from OpenTelemetry by Prometheus are queried for dashboards in Grafana, and Trino serves as the interactive query layer, enabling data analysts/scientist to run SQL queries against Paimon tables.

Data Modeling & Storage Strategy

Schema Design: One Big Table (OBT)

Instead of traditional highly-normalized Star Scheme (which requires expensive joins), I implemented a Wide-Column (OBT) design. The enriched_posts table contains all dimensions (Author, Text, Sentiment, Toxicity, Entities) in a single row, optimizing it for the columnar scan performance of Trino.

Storage

Streaming Lakehouse (Apache Paimon on S3)

enriched_posts (Append-Only log):
- Engine: Parimon Append-Only
- Purpose: Stores the raw, immutable history of every single analyzed post.
- Optimization: Configured with bucket=4 to distribute data across files for parallel reading.
- Format: parquet (Columnar) for highly efficient compression of text fields like entities_json

Operational Metrics Store (Prometheus)
- Serves as the "speed layer" for immediate alerting and system observability.

Orchestration & Data Quality

WIP

Key Engineering Challenges

Challenge	Engineering Solution Implemented
Checkpoint Starvation	Kafka Loopback Pattern: Initially, running synchronous ML inference directly inside Flink operators blocked the processing threads. This prevented Checkpoint Barriers from flowing downstream, causing constant checkpoint timeouts and eventual job collapse. Initially, Flink's Async I/O API (Latest Flink 2.2.0) is considered, but Paimon-Flink does not support for Flink 2.2.0 yet by time of writing. So I resolved this by decoupling the architecture: a separate Python worker consumes raw data, performs the inference, and "loops back" the result to an enriched Kafka topic. Flink now only consumes the pre-processed stream, ensuring low-latency state access and stable checkpointing.
ML Inference Latency	Async Gateway Pattern: High-velocity streams often overwhelm CPU-bound ML models (BERT/NER). I implemented an Async Nginx Gateway (`nginx-ml-gateway`) that load-balances requests across model replicas. By using asynchronous I/O in the enrichment worker, I decoupled the stream throughput from model inference time, preventing backpressure propagation to Kafka.
Small File Problem	LSM Tree Compaction: Streaming raw events directly to S3 typically generates millions of tiny files, killing query performance. I utilized Apache Paimon’s LSM (Log-Structured Merge) tree architecture with auto-compaction (compaction.min.file-num=10). This automatically merges streaming deltas into optimized Parquet files on MinIO without pausing ingestion. Another option to schedule "rewrite" job to perform compaction over small files if Iceberg were used instead.
Model Cold Starts	Build-Time Model Baking: Downloading large NLP models at runtime caused unacceptable container startup latency (30s+). I optimized the deployment by using uv for ultra-fast dependency caching and pre-downloading models directly into the Docker image layer during the build phase. This deliberately sacrifices CI/CD build speed to reduce production scaling time and eliminate network dependencies during auto-scale events.

How to Run

This project is fully containerized. Prerequisites: Docker Engine & 16GB+ RAM.

Clone the repo.
Configure the dependencies file, and place them based on the project structure.
- Create .env: Refer to the demo example.
- Create secrets directory: To store the secret keys for Docker. Refer to demo example.
Spin up the stack.

make build
make up

Access and login into UI (Grafana Dashboard): http://localhost:3000 (default credentials)

Lesson Learned & Roadmap

Lesson:

Lesson	Description
Synchronous Inference vs. Decoupled Loopback	Embedding blocking ML computations directly in Flink operators halts checkpoint barriers, causing job instability and timeouts. The "Loopback Pattern" (Kafka → Worker → Kafka → Flink) isolates heavy compute from stateful processing, ensuring Flink maintains sub-second latency and consistent state snapshots regardless of model load. Other options are Flink's Async I/O (Avaiable ver 2.2.0+), Unaligned Checkpoints configuration, Embedding of compressed model (with ONNX) in Flink Processing and others.
Standard Parquet Sink vs. Paimon LSM	Direct streaming writes to S3 typically result in the "small file problem," crippling downstream Trino query performance. Apache Paimon’s Log-Structured Merge (LSM) tree architecture was critical to enable high-throughput streaming appends with automatic background compaction, bridging the gap between real-time writing and fast OLAP reading.
Runtime Model Download vs. Build-Time Baking	Downloading large NLP models during container startup creates unpredictable latency (30s+) and network dependency during auto-scaling events. "Baking" models into the Docker image (using uv and docker with caching) deliberately sacrifices CI build speed to guarantee deterministic, instant startup in production—a vital trade-off for resilient MLOps.

Roadmap:

Phase 2: Reliability & Hardening

To complete alerting manager and Loki integration for full observability, monitoring and alerting.
To implement CI/CD pipelines for automated testing and chaos testing.
To implement schema registry into the stack for serialization of kafka messages and future schema evolution without breaking the pipelines.
To implement DLQ within Flink topology to capture and isolate abnormal events without stalling the main pipeline.
To integrate Great Expectations into ingestion worker for runtime data quality checks.

Phase 3: Infrastructure & Scale

To use Terraform for IaaC.
To migrate from Docker Compose to Kubernetes
To implement S3 Sink Connector for infinite retention of raw Jetstream logs (Kafka) into MinI, decoupling retention from broker disk limits.

Phase 4: Productionization and Optimization

To implement a self-service configuration layer for dynamically inject keywords/patterns into running Flink job without redeployment.
To enhance Flink with Complex Event Processing.
To enable Bot Detection, Embeddings model for advanced features.
To implement Graph analytics for user interation network and community clustering.
To ingest events such as "likes", "reposts", "follow" etc. for enhanced analytics.

Author

Chen Lam

I am always looking to elevate my work, and I view every project as a stepping stone. I genuinely welcome and seek out constructive criticism and improvement suggestions on my approach, code architecture, or documentation. Let's make this code better together—feel free to connect with your thoughts!

License

This project is open source and available under the MIT License.

Appendices : Overview of the Project Output

Dashboard 1: Bluesky Analytics - Content Intelligence

This dashboard focuses on "Social Listening" and content analysis. It visualizes the metrics such as top mentioned entities, viral potential and others of data flowing through the system.

Dashboard 2: Bluesky Ingestion Ops

This dashboard tracks the "front door" of the data pipeline, monitoring how fast and effectively data is being received from external sources via WebSockets.

Dashboard 3: ML Enrichment Ops

This dashboard monitors the Machine Learning microservice that enriches the data. It confirms this is the bottleneck identified in the Kafka dashboard. (Limited by the setup of machine (Rely on low-end CPU) for ML processing.)

Dashboard 4:Kafka Cluster Ops

This dashboard manages the central message bus (Apache Kafka).

Dashboard 5: Flink Ops

This monitors the stream processing engine (Apache Flink). It shows the health of the jobs processing the data streams.

Dashboard 6: MinIO Object Storage Ops

This dashboard monitors the health and capacity of the object storage layer (MinIO). It highlights a potentially critical capacity and performance issue.

Dashboard 7: Nginx Gateway Ops

This dashboard monitors the web server/reverse proxy (NGINX) that likely sits at the very edge of the network.

Attribution

Architecture Diagram is generated with the help by Google Gemini.

Name		Name	Last commit message	Last commit date
Latest commit History 600 Commits
.github/workflows		.github/workflows
infrastructures		infrastructures
references		references
reports/figures		reports/figures
services		services
sql		sql
.dockerignore		.dockerignore
.gitignore		.gitignore
Makefile		Makefile
README-example_.env.md		README-example_.env.md
README-project-structure.md		README-project-structure.md
README-secrets.md		README-secrets.md
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bluesky StreamPulse : Async ML & Real-Time Analytics with Streaming Lakehouse

A scalable streaming architecture integrating asynchronous ML inference with stateful Flink processing and Paimon storage for real-time observability.

Table of Contents