Lock-free fast path for buffer Pin/Unpin under concurrent readers by krleonid · Pull Request #59 · krleonid/duckdb

krleonid · 2026-05-26T05:56:28Z

Problem

When multiple connections concurrently scan the same table (common in our workload), every segment access requires Pin/Unpin of its block handle. The current implementation acquires a per-block mutex for both operations, even when the block is already loaded and has active readers — a state where no transition can occur.

For wide tables (250 columns, ~9K segments), this creates severe mutex contention. A query that takes 95ms single-threaded degrades to 1.2s with 20 concurrent connections, all serializing on the same block locks.

Production evidence (from mdb-engine-shadow logs):

CPU_TIME: 0.227s vs LATENCY: 1.216s — ~1s unaccounted gap
BLOCKED_THREAD_TIME: 0, TOTAL_BYTES_READ: 0 — all in memory, no I/O wait
The gap comes from threads stalling on mutex acquisition under cgroup CPU scheduling

Solution

Add an optimistic lock-free fast path using atomic operations:

Pin fast path: If state == BLOCK_LOADED and readers > 0, atomically increment readers via CAS. Safe because readers > 0 prevents any concurrent unload (CanUnload() checks this).

Unpin fast path: Atomically decrement readers. If result > 0, return immediately — no state transition possible. Only fall back to mutex when readers hits 0 (eviction queue / unload logic needed).

Changes

block_handle.hpp: Add TryIncrementReadersIfPositive() (CAS loop) and DecrementReadersAtomic() (fetch_sub)
standard_buffer_manager.cpp: Add fast path before mutex acquisition in Pin() and Unpin()

Benchmark Results

Test: SELECT * FROM 250-column table WHERE itemId IN (1000 ids) — 95K rows, 9284 segments, all data in buffer pool (128GB memory, no eviction).

Connections	Before (avg)	After (avg)	Improvement
1	95ms	92ms	~same
3	138ms	139ms	~same
10	562ms	193ms	66% faster
20	1177ms	497ms	58% faster

The optimization is zero-cost at low concurrency and eliminates the scalability cliff at high concurrency.

Safety Argument

The fast path is safe due to the following invariant:

Pin: readers > 0 → block cannot be unloaded (enforced by CanUnload()). Incrementing from positive is always safe — the block remains loaded throughout.
Unpin: Decrementing from N>1 to N-1>0 → no state transition occurs. Only the transition to 0 triggers eviction queue / unload logic, which uses the mutex path.
Memory ordering: acquire on successful CAS (Pin) ensures subsequent reads of the buffer pointer see the loaded data. release on fetch_sub (Unpin) ensures all writes to the buffer are visible before the reader count decreases.

Test plan

Benchmark: single connection — no regression
Benchmark: 3/10/20 concurrent connections — significant improvement
Full release build compiles and links cleanly
Run DuckDB test suite (test/sql/storage/ tests)
Stress test: concurrent Pin/Unpin with eviction (readers hitting 0 under load)

🤖 Generated with Claude Code

When a block is already loaded and has active readers (readers > 0), the current code still acquires the per-block mutex for both Pin and Unpin operations. Under concurrent workloads where multiple connections scan the same wide table, this mutex becomes a severe bottleneck — all threads serialize on the same lock for every segment access. This change adds an optimistic fast path: Pin: If state == BLOCK_LOADED and readers > 0, atomically increment readers via compare-and-swap without acquiring the mutex. This is safe because readers > 0 guarantees no concurrent unload (CanUnload checks readers > 0 before allowing eviction). Unpin: Atomically decrement readers. If the result is > 0, return immediately without the mutex — no state transition can happen. Only when readers hits 0 do we fall back to the mutex path to handle eviction queue insertion or unloading. Benchmark results (SELECT * FROM 250-column table WHERE itemId IN <1000 ids>, 95K rows, 9284 segments, all data in buffer pool): Single connection: ~95ms → ~92ms (no regression) 10 concurrent: 562ms → 193ms (66% faster) 20 concurrent: 1177ms → 497ms (58% faster) The improvement scales with concurrency because the fast path eliminates mutex contention entirely for the common case (hot blocks with multiple concurrent readers). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

GetDataSize() is called after op.End() in EndOperator, so its cost is not captured in CPU_TIME. For wide tables with deeply nested types this can be significant. Log a warning to detect when it contributes to the gap between LATENCY and CPU_TIME. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

krleonid and others added 2 commits May 26, 2026 08:55

krleonid changed the base branch from main to v1.5-stable-artjom-ab077c5 May 26, 2026 07:28

krleonid changed the base branch from v1.5-stable-artjom-ab077c5 to main May 26, 2026 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lock-free fast path for buffer Pin/Unpin under concurrent readers#59

Lock-free fast path for buffer Pin/Unpin under concurrent readers#59
krleonid wants to merge 2 commits into
mainfrom
feature/lock-free-pin-unpin-fast-path

krleonid commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

krleonid commented May 26, 2026

Problem

Solution

Changes

Benchmark Results

Safety Argument

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant