CDC-Cache is a research prototype for CDC-triggered result-cache invalidation for distributed OLAP queries. It connects PostgreSQL logical replication, Debezium, Kafka, Redis, Trino, and small Go services to evaluate table-level cache invalidation on TPC-H SF10.
The stack contains:
- PostgreSQL 16 with TPC-H SF10 tables and logical replication enabled
- Debezium Server 2.6 streaming PostgreSQL WAL events to Kafka
- Apache Kafka 3.7 in single-node KRaft mode
- Trino 450 querying PostgreSQL
- Redis 7 storing result-cache entries, table-version counters, and streams
- Proxy, a Go HTTP cache-aside query proxy
- Bridge, a Go Kafka consumer that increments Redis table-version counters
- Shadow, a Go validator that checks sampled cache hits against PostgreSQL
Source writes flow through:
PostgreSQL -> Debezium -> Kafka -> Bridge -> Redis version counters
Reads flow through:
client -> Proxy -> Redis cache or Trino -> PostgreSQL
postgres/ TPC-H schema, indexes, data loader, stack verification
debezium/ Debezium Server configuration
trino/ Trino coordinator and PostgreSQL catalog configuration
proxy/ Go HTTP result-cache proxy
bridge/ Go Kafka-to-Redis invalidation bridge
shadow/ Go cache-hit validator
replay/ Trace generation and live replay harness
analysis/ CDC-race labeling, latency joins, statistics, plots
tests/ Integration and unit tests
figures300/ Summary CSVs and multiple-comparison output for 300s sweeps
- Docker Desktop or Docker Engine with Compose
- Go 1.22+
- Python 3.12+
gcc,make,git, andsedfor TPC-Hdbgen
The Python dependencies are listed in pyproject.toml.
Start the stack:
make upLoad TPC-H SF10 into PostgreSQL:
make loadVerify the Trino/PostgreSQL stack:
make verify-stackBuild Go services:
make buildRun integration tests from the repository root after the stack is up and SF10 is loaded:
python3 -m pytest tests/integrationBuild primary-key pools from the loaded database:
python3 replay/pk_pool.py --n 100000Generate traces:
python3 replay/trace_gen.py --seed 1 --pattern poisson --duration 300 --rate 10
python3 replay/trace_gen.py --seed 1 --pattern mmpp --duration 300 --rate 10
python3 replay/trace_gen.py --seed 1 --pattern zipf --duration 300 --rate 10Replay a trace:
python3 replay/replay.py \
--trace traces/trace_seed1_poisson_300s.parquet \
--run-id sweep300_seed1_poissonAggregate a 5-seed x 3-pattern sweep:
python3 analysis/sweep_analysis.py \
--runs-dir runs \
--out-dir figures300 \
--run-prefix sweep300_seedLarge generated artifacts are intentionally not committed:
- raw TPC-H
.tbldata - Docker volumes
- generated traces
- raw replay run directories
- sampled primary-key pools
The committed figures300/summary_table.csv and
figures300/multcomp_results.txt summarize the 300-second SF10 sweep used by
the paper.
This project is licensed under the Apache License 2.0. See LICENSE.