GitHub - codeshardlabs/user-feed-cdc: Implement user feed using debezium based change data capture

User Feed using Debezium Change Data Capture

Problem Statement

This project implements a scalable user feed system that processes user activities in real-time while ensuring at least-once semantics and high availability. The system handles user interactions like follows, posts, comments and likes through a multi-stage data pipeline.

Workflow

Overview

This project implements a real-time data pipeline using Debezium for Change Data Capture (CDC) from PostgreSQL. At its core, Debezium monitors PostgreSQL's Write Ahead Log (WAL) for any data modifications, which requires configuring PostgreSQL with logical WAL level. When changes occur in the source PostgreSQL tables, Debezium captures these changes from the WAL and streams them to dedicated Kafka topics. A Kafka consumer service continuously polls these topics for new messages, processing them using a SchemaAdapterStrategyFactory that leverages Factory and Strategy patterns to transform the data into Cassandra's schema format. The transformed data is then batch inserted into Cassandra, chosen specifically for its distributed architecture and high write throughput capabilities. Cassandra's Log-Structured Merge (LSM) tree data structure optimizes write performance by first storing data in memory before flushing to disk, making it ideal for data ingestion scenarios.

Technical Components

Source System (PostgreSQL)

Primary database storing user activities and interactions
Configured with logical replication (wal_level=logical) to enable Change Data Capture
Handles transactional writes for user activities through REST API endpoints
Tables are configured with primary keys and appropriate indexes

Change Data Capture (Debezium)

Monitors PostgreSQL Write-Ahead Logs (WAL) for DML changes
Ensures reliable capture of all data modifications
Publishes changes to dedicated Kafka topics

Message Queue (Apache Kafka)

Provides durable storage of change events
Maintains strict ordering within partitions
Enables parallel processing through multiple partitions
Topics configured with appropriate retention and replication

Polling System (Kafka Consumer)

Polls kafka topic for new message.
When we receive new message/event, it modifies the data according to the reqd. format.
Saves the data in cassandra DB.

Cache Layer (Redis)

In-memory caching of frequently accessed feeds
Provides sub-millisecond read latency for hot data
Falls back to Cassandra for cache misses

Sink System (Apache Cassandra)

Log-Structured Merge (LSM) tree based storage
Optimized for high-throughput writes
Partitioned by user_id for efficient reads
Stores feed items in time-sorted order
Supports efficient pagination queries

Data Flow

User performs activity (follow/post/comment/like) via REST API
Activity is recorded in PostgreSQL tables
Debezium captures change events from PostgreSQL WAL
Events are published to corresponding Kafka topics
Kafka consumer polls the kafka topic for new event messages and if there is one, it modifies and ingests data to Cassandra DB.
Redis caches frequently accessed feed segments
Feed API serves requests from cache with Cassandra fallback

Possible Activity types to be listed on user feed

Follow User ("FOLLOWERS")
User Create New Post ("SHARDS")
User Comment on Post ("COMMENTS")
User Likes a post ("LIKES")

Local Development

Start all the services

#!/bin/bash
docker-compose down -v
docker-compose pull
docker-compose up -d 
docker-compose ps

Local server

Create new virtual environment

python3 -m venv venv

Activate that environment

source venv/bin/activate

Run the development server

uvicorn main:app --reload

Test the setup using a shell script

#!/bin/bash
chmod +x test_setup.sh
./test-setup.sh

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
debezium-connectors		debezium-connectors
public		public
services		services
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
cache.py		cache.py
cassandra-init.cql		cassandra-init.cql
config.py		config.py
connection_state.py		connection_state.py
docker-compose.yml		docker-compose.yml
enums.py		enums.py
env.py		env.py
event_processor.py		event_processor.py
logger.py		logger.py
main.py		main.py
postgres-init.sql		postgres-init.sql
requirements.txt		requirements.txt
strategy.py		strategy.py
test_setup.sh		test_setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

User Feed using Debezium Change Data Capture

Problem Statement

Workflow

Overview

Technical Components

Data Flow

Possible Activity types to be listed on user feed

Local Development

Start all the services

Local server

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

codeshardlabs/user-feed-cdc

Folders and files

Latest commit

History

Repository files navigation

User Feed using Debezium Change Data Capture

Problem Statement

Workflow

Overview

Technical Components

Data Flow

Possible Activity types to be listed on user feed

Local Development

Start all the services

Local server

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages