Skip to content

Lock-free fast path for buffer Pin/Unpin under concurrent readers#59

Open
krleonid wants to merge 2 commits into
mainfrom
feature/lock-free-pin-unpin-fast-path
Open

Lock-free fast path for buffer Pin/Unpin under concurrent readers#59
krleonid wants to merge 2 commits into
mainfrom
feature/lock-free-pin-unpin-fast-path

Conversation

@krleonid
Copy link
Copy Markdown
Owner

Problem

When multiple connections concurrently scan the same table (common in our workload), every segment access requires Pin/Unpin of its block handle. The current implementation acquires a per-block mutex for both operations, even when the block is already loaded and has active readers — a state where no transition can occur.

For wide tables (250 columns, ~9K segments), this creates severe mutex contention. A query that takes 95ms single-threaded degrades to 1.2s with 20 concurrent connections, all serializing on the same block locks.

Production evidence (from mdb-engine-shadow logs):

  • CPU_TIME: 0.227s vs LATENCY: 1.216s — ~1s unaccounted gap
  • BLOCKED_THREAD_TIME: 0, TOTAL_BYTES_READ: 0 — all in memory, no I/O wait
  • The gap comes from threads stalling on mutex acquisition under cgroup CPU scheduling

Solution

Add an optimistic lock-free fast path using atomic operations:

Pin fast path: If state == BLOCK_LOADED and readers > 0, atomically increment readers via CAS. Safe because readers > 0 prevents any concurrent unload (CanUnload() checks this).

Unpin fast path: Atomically decrement readers. If result > 0, return immediately — no state transition possible. Only fall back to mutex when readers hits 0 (eviction queue / unload logic needed).

Changes

  • block_handle.hpp: Add TryIncrementReadersIfPositive() (CAS loop) and DecrementReadersAtomic() (fetch_sub)
  • standard_buffer_manager.cpp: Add fast path before mutex acquisition in Pin() and Unpin()

Benchmark Results

Test: SELECT * FROM 250-column table WHERE itemId IN (1000 ids) — 95K rows, 9284 segments, all data in buffer pool (128GB memory, no eviction).

Connections Before (avg) After (avg) Improvement
1 95ms 92ms ~same
3 138ms 139ms ~same
10 562ms 193ms 66% faster
20 1177ms 497ms 58% faster

The optimization is zero-cost at low concurrency and eliminates the scalability cliff at high concurrency.

Safety Argument

The fast path is safe due to the following invariant:

  1. Pin: readers > 0 → block cannot be unloaded (enforced by CanUnload()). Incrementing from positive is always safe — the block remains loaded throughout.
  2. Unpin: Decrementing from N>1 to N-1>0 → no state transition occurs. Only the transition to 0 triggers eviction queue / unload logic, which uses the mutex path.
  3. Memory ordering: acquire on successful CAS (Pin) ensures subsequent reads of the buffer pointer see the loaded data. release on fetch_sub (Unpin) ensures all writes to the buffer are visible before the reader count decreases.

Test plan

  • Benchmark: single connection — no regression
  • Benchmark: 3/10/20 concurrent connections — significant improvement
  • Full release build compiles and links cleanly
  • Run DuckDB test suite (test/sql/storage/ tests)
  • Stress test: concurrent Pin/Unpin with eviction (readers hitting 0 under load)

🤖 Generated with Claude Code

krleonid and others added 2 commits May 26, 2026 08:55
When a block is already loaded and has active readers (readers > 0),
the current code still acquires the per-block mutex for both Pin and
Unpin operations. Under concurrent workloads where multiple connections
scan the same wide table, this mutex becomes a severe bottleneck —
all threads serialize on the same lock for every segment access.

This change adds an optimistic fast path:

Pin: If state == BLOCK_LOADED and readers > 0, atomically increment
readers via compare-and-swap without acquiring the mutex. This is safe
because readers > 0 guarantees no concurrent unload (CanUnload checks
readers > 0 before allowing eviction).

Unpin: Atomically decrement readers. If the result is > 0, return
immediately without the mutex — no state transition can happen. Only
when readers hits 0 do we fall back to the mutex path to handle
eviction queue insertion or unloading.

Benchmark results (SELECT * FROM 250-column table WHERE itemId IN <1000 ids>,
95K rows, 9284 segments, all data in buffer pool):

  Single connection:  ~95ms → ~92ms (no regression)
  10 concurrent:     562ms → 193ms (66% faster)
  20 concurrent:    1177ms → 497ms (58% faster)

The improvement scales with concurrency because the fast path eliminates
mutex contention entirely for the common case (hot blocks with multiple
concurrent readers).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GetDataSize() is called after op.End() in EndOperator, so its cost is
not captured in CPU_TIME. For wide tables with deeply nested types this
can be significant. Log a warning to detect when it contributes to the
gap between LATENCY and CPU_TIME.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@krleonid krleonid changed the base branch from main to v1.5-stable-artjom-ab077c5 May 26, 2026 07:28
@krleonid krleonid changed the base branch from v1.5-stable-artjom-ab077c5 to main May 26, 2026 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant