Lock-free fast path for buffer Pin/Unpin under concurrent readers#59
Open
krleonid wants to merge 2 commits into
Open
Lock-free fast path for buffer Pin/Unpin under concurrent readers#59krleonid wants to merge 2 commits into
krleonid wants to merge 2 commits into
Conversation
When a block is already loaded and has active readers (readers > 0), the current code still acquires the per-block mutex for both Pin and Unpin operations. Under concurrent workloads where multiple connections scan the same wide table, this mutex becomes a severe bottleneck — all threads serialize on the same lock for every segment access. This change adds an optimistic fast path: Pin: If state == BLOCK_LOADED and readers > 0, atomically increment readers via compare-and-swap without acquiring the mutex. This is safe because readers > 0 guarantees no concurrent unload (CanUnload checks readers > 0 before allowing eviction). Unpin: Atomically decrement readers. If the result is > 0, return immediately without the mutex — no state transition can happen. Only when readers hits 0 do we fall back to the mutex path to handle eviction queue insertion or unloading. Benchmark results (SELECT * FROM 250-column table WHERE itemId IN <1000 ids>, 95K rows, 9284 segments, all data in buffer pool): Single connection: ~95ms → ~92ms (no regression) 10 concurrent: 562ms → 193ms (66% faster) 20 concurrent: 1177ms → 497ms (58% faster) The improvement scales with concurrency because the fast path eliminates mutex contention entirely for the common case (hot blocks with multiple concurrent readers). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GetDataSize() is called after op.End() in EndOperator, so its cost is not captured in CPU_TIME. For wide tables with deeply nested types this can be significant. Log a warning to detect when it contributes to the gap between LATENCY and CPU_TIME. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When multiple connections concurrently scan the same table (common in our workload), every segment access requires Pin/Unpin of its block handle. The current implementation acquires a per-block mutex for both operations, even when the block is already loaded and has active readers — a state where no transition can occur.
For wide tables (250 columns, ~9K segments), this creates severe mutex contention. A query that takes 95ms single-threaded degrades to 1.2s with 20 concurrent connections, all serializing on the same block locks.
Production evidence (from
mdb-engine-shadowlogs):CPU_TIME: 0.227svsLATENCY: 1.216s— ~1s unaccounted gapBLOCKED_THREAD_TIME: 0,TOTAL_BYTES_READ: 0— all in memory, no I/O waitSolution
Add an optimistic lock-free fast path using atomic operations:
Pin fast path: If
state == BLOCK_LOADEDandreaders > 0, atomically increment readers via CAS. Safe becausereaders > 0prevents any concurrent unload (CanUnload()checks this).Unpin fast path: Atomically decrement readers. If result > 0, return immediately — no state transition possible. Only fall back to mutex when readers hits 0 (eviction queue / unload logic needed).
Changes
block_handle.hpp: AddTryIncrementReadersIfPositive()(CAS loop) andDecrementReadersAtomic()(fetch_sub)standard_buffer_manager.cpp: Add fast path before mutex acquisition inPin()andUnpin()Benchmark Results
Test:
SELECT * FROM 250-column table WHERE itemId IN (1000 ids)— 95K rows, 9284 segments, all data in buffer pool (128GB memory, no eviction).The optimization is zero-cost at low concurrency and eliminates the scalability cliff at high concurrency.
Safety Argument
The fast path is safe due to the following invariant:
readers > 0→ block cannot be unloaded (enforced byCanUnload()). Incrementing from positive is always safe — the block remains loaded throughout.acquireon successful CAS (Pin) ensures subsequent reads of the buffer pointer see the loaded data.releaseon fetch_sub (Unpin) ensures all writes to the buffer are visible before the reader count decreases.Test plan
test/sql/storage/tests)🤖 Generated with Claude Code