Feature request or enhancement
Which use case/requirement will be addressed by the proposed feature?
Add an optional, pluggable cache to the ReselectColumnsPostProcessor so that re-selected column
values can be reused across change events instead of issuing a database query for every event.
Background
The reselect post-processor re-queries column values when an event carries the unavailable value
placeholder (e.g. unchanged PostgreSQL TOAST columns, Oracle LOBs) or a null the user wants
refreshed. Today every qualifying event triggers a SELECT against the source database.
For workloads with large out-of-line columns that change rarely but sit on frequently-updated rows,
this produces a high volume of repeated, identical re-selection queries for the same (table, key, column) tuple. On managed databases (e.g. Aurora endpoints) this adds measurable load and latency to
the streaming pipeline, even though the re-selected value has not changed between events.
Proposed behaviour
- Cache exposed behind a strategy interface (
ReselectColumnCache) so alternative backends (e.g.
an embedded key/value store such as RocksDB, or a user-custom implementation) can be plugged in
without changing the post-processor. A default in-memory implementation is provided and selected via
reselect.cache.type.
- The cache is organised hierarchically as
table -> row -> column. The post-processor resolves a
row-scoped handle once per event (forRow(tableId, keyValues)), so the table and row-key identity
are serialized a single time and reused for every column of that row — important for tables with
multiple LOB/TOAST columns.
- On a qualifying event, look up each required column in the cache; only the misses are queried,
in a single SELECT. Hits are served from the cache.
- Invalidate-on-modify: whenever an event carries a real (non-placeholder) value for a column,
its cache entry is evicted. This guarantees a later placeholder event for the same row never serves
a value that has since changed — so cache correctness does not depend on the TTL.
- The default in-memory cache is bounded by size (LRU on the number of cached rows, via the existing
io.debezium.util.BoundedConcurrentHashMap) and by a per-value TTL that only bounds memory retention
for cold rows.
Configuration (all optional, cache disabled by default)
| Property |
Default |
Description |
reselect.cache.enabled |
false |
Enable the re-selection cache. |
reselect.cache.type |
io.debezium.processors.reselect.cache.MemoryReselectColumnCache |
The ReselectColumnCache implementation class. |
reselect.cache.max.size |
10000 |
Maximum number of cached rows (LRU eviction). Default in-memory cache only. |
reselect.cache.ttl.ms |
600000 |
Entry time-to-live in milliseconds. Default in-memory cache only. |
When reselect.cache.enabled=false (the default) the post-processor behaves exactly as before — no
behavioural change for existing users.
Why this is safe
Re-selection normally exists to fetch the latest committed state, so naive caching would risk
emitting stale data. The invalidate-on-modify rule closes that gap: any event that actually changes a
column refreshes/evicts its cached value before the next placeholder event can read it. The TTL is
therefore a memory bound, not the correctness mechanism, which is why a relatively long default
(10 minutes) is acceptable.
Scope / non-goals
- No new third-party dependency — the default cache reuses
io.debezium.util.BoundedConcurrentHashMap
already used across the connector-common module.
- No distributed/shared cache by default — the in-memory implementation is per-task and in-process.
Shared/persistent backends can be added as separate ReselectColumnCache strategies.
Implementation ideas (optional)
- Strategy interface
io.debezium.processors.reselect.cache.ReselectColumnCache with a row-scoped
RowCache handle, plus a default MemoryReselectColumnCache, in debezium-connector-common.
- The in-memory implementation uses a hierarchical
row -> (column -> value) structure so the row
identity is serialized once per event and shared across all of that row's columns.
- Cached values are the raw JDBC values; the existing
getConvertedValue(...) conversion is
applied uniformly to both cache hits and freshly queried values, so caching does not interfere with
value conversion.
byte[] key values are rendered by content (not identity) so binary primary keys produce stable,
collision-free cache keys.
- Happy to adjust naming, defaults, or split the change as the maintainers prefer.
Feature request or enhancement
Which use case/requirement will be addressed by the proposed feature?
Add an optional, pluggable cache to the
ReselectColumnsPostProcessorso that re-selected columnvalues can be reused across change events instead of issuing a database query for every event.
Background
The reselect post-processor re-queries column values when an event carries the unavailable value
placeholder (e.g. unchanged PostgreSQL TOAST columns, Oracle LOBs) or a
nullthe user wantsrefreshed. Today every qualifying event triggers a
SELECTagainst the source database.For workloads with large out-of-line columns that change rarely but sit on frequently-updated rows,
this produces a high volume of repeated, identical re-selection queries for the same
(table, key, column)tuple. On managed databases (e.g. Aurora endpoints) this adds measurable load and latency tothe streaming pipeline, even though the re-selected value has not changed between events.
Proposed behaviour
ReselectColumnCache) so alternative backends (e.g.an embedded key/value store such as RocksDB, or a user-custom implementation) can be plugged in
without changing the post-processor. A default in-memory implementation is provided and selected via
reselect.cache.type.table -> row -> column. The post-processor resolves arow-scoped handle once per event (
forRow(tableId, keyValues)), so the table and row-key identityare serialized a single time and reused for every column of that row — important for tables with
multiple LOB/TOAST columns.
in a single
SELECT. Hits are served from the cache.its cache entry is evicted. This guarantees a later placeholder event for the same row never serves
a value that has since changed — so cache correctness does not depend on the TTL.
io.debezium.util.BoundedConcurrentHashMap) and by a per-value TTL that only bounds memory retentionfor cold rows.
Configuration (all optional, cache disabled by default)
reselect.cache.enabledfalsereselect.cache.typeio.debezium.processors.reselect.cache.MemoryReselectColumnCacheReselectColumnCacheimplementation class.reselect.cache.max.size10000reselect.cache.ttl.ms600000When
reselect.cache.enabled=false(the default) the post-processor behaves exactly as before — nobehavioural change for existing users.
Why this is safe
Re-selection normally exists to fetch the latest committed state, so naive caching would risk
emitting stale data. The invalidate-on-modify rule closes that gap: any event that actually changes a
column refreshes/evicts its cached value before the next placeholder event can read it. The TTL is
therefore a memory bound, not the correctness mechanism, which is why a relatively long default
(10 minutes) is acceptable.
Scope / non-goals
io.debezium.util.BoundedConcurrentHashMapalready used across the connector-common module.
Shared/persistent backends can be added as separate
ReselectColumnCachestrategies.Implementation ideas (optional)
io.debezium.processors.reselect.cache.ReselectColumnCachewith a row-scopedRowCachehandle, plus a defaultMemoryReselectColumnCache, indebezium-connector-common.row -> (column -> value)structure so the rowidentity is serialized once per event and shared across all of that row's columns.
getConvertedValue(...)conversion isapplied uniformly to both cache hits and freshly queried values, so caching does not interfere with
value conversion.
byte[]key values are rendered by content (not identity) so binary primary keys produce stable,collision-free cache keys.