Skip to content

kunalpjain/cdc-cache

Repository files navigation

CDC-Cache

CDC-Cache is a research prototype for CDC-triggered result-cache invalidation for distributed OLAP queries. It connects PostgreSQL logical replication, Debezium, Kafka, Redis, Trino, and small Go services to evaluate table-level cache invalidation on TPC-H SF10.

Architecture

The stack contains:

  • PostgreSQL 16 with TPC-H SF10 tables and logical replication enabled
  • Debezium Server 2.6 streaming PostgreSQL WAL events to Kafka
  • Apache Kafka 3.7 in single-node KRaft mode
  • Trino 450 querying PostgreSQL
  • Redis 7 storing result-cache entries, table-version counters, and streams
  • Proxy, a Go HTTP cache-aside query proxy
  • Bridge, a Go Kafka consumer that increments Redis table-version counters
  • Shadow, a Go validator that checks sampled cache hits against PostgreSQL

Source writes flow through:

PostgreSQL -> Debezium -> Kafka -> Bridge -> Redis version counters

Reads flow through:

client -> Proxy -> Redis cache or Trino -> PostgreSQL

Repository Layout

postgres/      TPC-H schema, indexes, data loader, stack verification
debezium/      Debezium Server configuration
trino/         Trino coordinator and PostgreSQL catalog configuration
proxy/         Go HTTP result-cache proxy
bridge/        Go Kafka-to-Redis invalidation bridge
shadow/        Go cache-hit validator
replay/        Trace generation and live replay harness
analysis/      CDC-race labeling, latency joins, statistics, plots
tests/         Integration and unit tests
figures300/    Summary CSVs and multiple-comparison output for 300s sweeps

Requirements

  • Docker Desktop or Docker Engine with Compose
  • Go 1.22+
  • Python 3.12+
  • gcc, make, git, and sed for TPC-H dbgen

The Python dependencies are listed in pyproject.toml.

Quick Start

Start the stack:

make up

Load TPC-H SF10 into PostgreSQL:

make load

Verify the Trino/PostgreSQL stack:

make verify-stack

Build Go services:

make build

Run integration tests from the repository root after the stack is up and SF10 is loaded:

python3 -m pytest tests/integration

Running a Sweep

Build primary-key pools from the loaded database:

python3 replay/pk_pool.py --n 100000

Generate traces:

python3 replay/trace_gen.py --seed 1 --pattern poisson --duration 300 --rate 10
python3 replay/trace_gen.py --seed 1 --pattern mmpp --duration 300 --rate 10
python3 replay/trace_gen.py --seed 1 --pattern zipf --duration 300 --rate 10

Replay a trace:

python3 replay/replay.py \
  --trace traces/trace_seed1_poisson_300s.parquet \
  --run-id sweep300_seed1_poisson

Aggregate a 5-seed x 3-pattern sweep:

python3 analysis/sweep_analysis.py \
  --runs-dir runs \
  --out-dir figures300 \
  --run-prefix sweep300_seed

Data and Outputs

Large generated artifacts are intentionally not committed:

  • raw TPC-H .tbl data
  • Docker volumes
  • generated traces
  • raw replay run directories
  • sampled primary-key pools

The committed figures300/summary_table.csv and figures300/multcomp_results.txt summarize the 300-second SF10 sweep used by the paper.

License

This project is licensed under the Apache License 2.0. See LICENSE.

About

CDC-triggered result-cache invalidation for distributed OLAP engines, with Postgres, Debezium, Kafka, Redis, Trino, and TPC-H SF10 evaluation scripts.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors