This project implements a scalable user feed system that processes user activities in real-time while ensuring at least-once semantics and high availability. The system handles user interactions like follows, posts, comments and likes through a multi-stage data pipeline.
This project implements a real-time data pipeline using Debezium for Change Data Capture (CDC) from PostgreSQL. At its core, Debezium monitors PostgreSQL's Write Ahead Log (WAL) for any data modifications, which requires configuring PostgreSQL with logical WAL level. When changes occur in the source PostgreSQL tables, Debezium captures these changes from the WAL and streams them to dedicated Kafka topics. A Kafka consumer service continuously polls these topics for new messages, processing them using a SchemaAdapterStrategyFactory that leverages Factory and Strategy patterns to transform the data into Cassandra's schema format. The transformed data is then batch inserted into Cassandra, chosen specifically for its distributed architecture and high write throughput capabilities. Cassandra's Log-Structured Merge (LSM) tree data structure optimizes write performance by first storing data in memory before flushing to disk, making it ideal for data ingestion scenarios.
- Source System (PostgreSQL)
- Primary database storing user activities and interactions
- Configured with logical replication (wal_level=logical) to enable Change Data Capture
- Handles transactional writes for user activities through REST API endpoints
- Tables are configured with primary keys and appropriate indexes
- Change Data Capture (Debezium)
- Monitors PostgreSQL Write-Ahead Logs (WAL) for DML changes
- Ensures reliable capture of all data modifications
- Publishes changes to dedicated Kafka topics
- Message Queue (Apache Kafka)
- Provides durable storage of change events
- Maintains strict ordering within partitions
- Enables parallel processing through multiple partitions
- Topics configured with appropriate retention and replication
- Polling System (Kafka Consumer)
- Polls kafka topic for new message.
- When we receive new message/event, it modifies the data according to the reqd. format.
- Saves the data in cassandra DB.
- Cache Layer (Redis)
- In-memory caching of frequently accessed feeds
- Provides sub-millisecond read latency for hot data
- Falls back to Cassandra for cache misses
- Sink System (Apache Cassandra)
- Log-Structured Merge (LSM) tree based storage
- Optimized for high-throughput writes
- Partitioned by user_id for efficient reads
- Stores feed items in time-sorted order
- Supports efficient pagination queries
- User performs activity (follow/post/comment/like) via REST API
- Activity is recorded in PostgreSQL tables
- Debezium captures change events from PostgreSQL WAL
- Events are published to corresponding Kafka topics
- Kafka consumer polls the kafka topic for new event messages and if there is one, it modifies and ingests data to Cassandra DB.
- Redis caches frequently accessed feed segments
- Feed API serves requests from cache with Cassandra fallback
- Follow User ("FOLLOWERS")
- User Create New Post ("SHARDS")
- User Comment on Post ("COMMENTS")
- User Likes a post ("LIKES")
#!/bin/bash
docker-compose down -v
docker-compose pull
docker-compose up -d
docker-compose ps - Create new virtual environment
python3 -m venv venv- Activate that environment
source venv/bin/activate- Run the development server
uvicorn main:app --reload- Test the setup using a shell script
#!/bin/bash
chmod +x test_setup.sh
./test-setup.sh