Skip to content

KAFKA-10025: guard RocksDBMetricsRecorder value provider reads against store close#22717

Open
Aangbaeck wants to merge 1 commit into
apache:trunkfrom
Aangbaeck:KAFKA-10025-rocksdb-metrics-recorder-uaf
Open

KAFKA-10025: guard RocksDBMetricsRecorder value provider reads against store close#22717
Aangbaeck wants to merge 1 commit into
apache:trunkfrom
Aangbaeck:KAFKA-10025-rocksdb-metrics-recorder-uaf

Conversation

@Aangbaeck

Copy link
Copy Markdown

What

RocksDBMetricsRecorder reads native RocksDB value providers (RocksDB and Statistics)
in three places:

  • record()statistics.getAndResetTickerCount(...) / getHistogramData(...)
  • the property gauges (gaugeToComputeSumOfProperties) — db.getAggregatedLongProperty(...)
  • the block-cache gauges (gaugeToComputeBlockCacheMetrics) — db.getLongProperty(...)

These run with no mutual exclusion against removeValueProviders(...). RocksDBStore.close()
calls removeValueProviders(name) and then closes (frees) the native RocksDB and
Statistics
. Because the reads and the removal are not mutually exclusive, a metrics read
that is in flight when a store is closed (during a rebalance / task migration) can
dereference a native handle that close() is concurrently freeing — a native
use-after-free / SIGSEGV
.

Two observed crash frames, same root cause:

  • record() path → Statistics::getAndResetTickerCount — this is KAFKA-10025 (open since 2020).
  • gauge path → rocksdb::DBImpl::GetAggregatedIntProperty — observed in production under a
    from-zero state rebuild, where warmup/probing rebalances close stores continuously while a
    metrics reporter / JMX scrape evaluates the (INFO-level) RocksDB property gauges.

Note the gauge metrics are registered at RecordingLevel.INFO, so they are active and
scraped even when metrics.recording.level=INFO; only the record() (statistics) path is
gated to DEBUG. So the crash is reachable at INFO.

Why the current code is unsafe

storeToValueProviders is a ConcurrentHashMap, which makes the map operations
thread-safe, but does not prevent a reader that has already obtained a
DbAndCacheAndStatistics from calling a native method on its db/statistics after (or
while) RocksDBStore.close() frees them. There is no happens-before between "recorder reads
the provider" and "store closes the provider".

Fix

Introduce a single lock (valueProvidersLock) in RocksDBMetricsRecorder and hold it
around every read of the value providers (record(), both gauge lambdas) and around the
map mutations (addValueProviders, removeValueProviders). Since RocksDBStore.close()
already calls removeValueProviders(...) before it frees the native db/statistics,
removeValueProviders(...) acquiring the lock guarantees:

  • any in-flight read completes before the segment is removed and the native handles freed, and
  • any read that starts after removal no longer sees the segment,

so no read can ever dereference a freed handle.

No lock-ordering risk: RocksDBMetricsRecordingTrigger holds no lock while calling
record(), and the guarded reads never call back into RocksDBStore. RocksDBStore.close()
takes the store monitor then this lock; opens take the same order — consistent, no cycle.

Testing

  • New test RocksDBMetricsRecorderGaugesTest#shouldNotRemoveValueProvidersWhileGaugeIsReadingThem:
    blocks a gauge evaluation inside getAggregatedLongProperty (holding the lock) and asserts
    removeValueProviders(...) cannot return until the read completes — i.e. the use-after-free
    window is closed. Verified it fails without the fix (AssertionError: …the use-after-free window is open) and passes with it.
  • Existing RocksDBMetricsRecorderTest / RocksDBMetricsRecorderGaugesTest, checkstyle and
    spotbugs all pass.

End-to-end confirmation (outside this PR, in a standalone Docker harness): a real Kafka
Streams app on the released kafka-streams 8.2.1-ce, at metrics.recording.level=INFO, with
a JMX-style metrics scrape (reading the INFO-level RocksDB property gauges) plus forced
rebalances, SIGSEGVs in rocksdb::DBImpl::GetAggregatedIntProperty+0x83 on a scrape thread
after ~27M gauge reads. Running the same binary with only RocksDBMetricsRecorder replaced
by the patched class (classpath shadow, load-verified) survived 600s / ~336M gauge reads / 25
rebalances with zero crashes. The exact native crash was also reproduced in a pure rocksdbjni
harness by racing getAggregatedLongProperty against a concurrent DB close+reopen.

…t store close

RocksDBMetricsRecorder reads native RocksDB value providers (RocksDB and
Statistics) in record() (getAndResetTickerCount / getHistogramData), in the
property gauges (RocksDB.getAggregatedLongProperty) and in the block-cache
gauges (RocksDB.getLongProperty). These reads had no mutual exclusion against
removeValueProviders().

RocksDBStore.close() calls removeValueProviders() and then closes (frees) the
native RocksDB and Statistics. Because the reads and the removal were not
mutually exclusive, a metrics read that is in flight when a store is closed
(e.g. during a rebalance / task migration) can dereference a native handle that
close() is concurrently freeing, causing a native use-after-free / SIGSEGV.
storeToValueProviders being a ConcurrentHashMap only makes the map operations
safe; it does not prevent a reader that already holds a DbAndCacheAndStatistics
from calling into a db/statistics that close() frees.

Two observed crash frames, same root cause:
 - record() path  -> Statistics::getAndResetTickerCount (this ticket)
 - gauge path     -> rocksdb::DBImpl::GetAggregatedIntProperty (property gauges,
   registered at RecordingLevel.INFO, so reachable via a metrics reporter/JMX
   scrape even at metrics.recording.level=INFO)

Fix: hold a single lock around every read of the value providers (record() and
both gauge lambdas) and around the map mutations (addValueProviders /
removeValueProviders). Since RocksDBStore.close() calls removeValueProviders()
before freeing the native handles, acquiring the lock there waits for any
in-flight read to finish and prevents any later read from seeing the segment,
so no read can dereference a freed handle. The recording trigger holds no lock
while calling record(), and the guarded reads never call back into
RocksDBStore, so no lock-ordering cycle is introduced.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added triage PRs from the community streams labels Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

streams triage PRs from the community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant