Reduce server p95 write+read latency by ~12% (avg p95 2.65 → 2.33 ms) by mweiden · Pull Request #110 · mweiden/cass

mweiden · 2026-04-26T15:20:43Z

Summary

Driven by an autoresearch (/ratchet) loop on scripts/perf_compare.sh --cass-only over thread sweeps {1,2,4,8,16,32,64} at OPS=5000, RF=3, write/read CL=QUORUM. Metric: average of all 95th-percentile latency lines (READ + WRITE) emitted by examples/perf_client.rs across the seven thread counts.

Baseline: 2.6546 ms avg p95
After: 2.3307 ms avg p95
Improvement: −0.324 ms, −12.2%

Two of the four kept changes contributed nearly all of the gain (WAL mutex −7.6%, tracing filter −4%); the other two are smaller wins on the same hot path. Five other hypotheses were tested and discarded after regressions or no signal — see "Tried and rejected" at the bottom.

Kept changes

1. WAL: `tokio::Mutex<WalState>` → `std::sync::Mutex` (`src/wal.rs`)

Trial result: 2.6546 → 2.4536 ms (−7.6%, single largest win)

Wal::append is on the critical path of every write. Its critical section just does data.extend_from_slice(...) + data.push(b'\n') — fully synchronous, never awaits. With the previous tokio::sync::Mutex, every concurrent appender paid an async-mutex acquire (atomic CAS + potential wait queue + cooperative-yield logic) for what is microsecond-scale work. Under 64 concurrent client threads this serialized appends through async machinery designed for long critical sections.

flush_lock (which IS held across an await on storage.append) stays as tokio::Mutex. Only state switched. All WAL integration tests (wal_recovery_test, wal_error_test, wal_stress_test) pass.

2. MemTable: per-shard `tokio::sync::RwLock` → `std::sync::RwLock` (`src/memtable.rs`)

Trial result: 2.4536 → 2.4293 ms (~−1%, marginal but consistent)

Same pattern as the WAL: MemTable is sharded by murmur3 of the key, but each shard's lock was a tokio::sync::RwLock despite all critical sections being synchronous (HashMap::insert, .get(...).cloned(), etc.). All call sites in lib.rs already drop the guard before any await. memtable_test and downstream query_persistence_test / lsm_flush_test pass.

3. `apply_hints`: read-lock fast path (`src/cluster.rs`)

Trial result: 2.4293 → 2.4275 ms (within noise on its own)

Cluster::healthy_nodes calls apply_hints(&node) for every healthy replica on every coordinator request. The previous implementation unconditionally took a write lock on self.hints and called map.remove(node) — even though in steady state (no peer failures) the hints map is empty and the remove is a no-op. With QUORUM/RF=3 and 64 concurrent client threads, every request was serializing 3 write-lock acquires through one mutex.

The fix is a one-line short-circuit:

if !self.hints.read().await.contains_key(node) {
    return;
}

Read locks don't serialize with each other, so concurrent coordinator requests can all confirm "no hints" in parallel. Correctness is preserved: if a hint arrives between the read check and the would-be write, it'll be picked up by the next request — hint delivery is best-effort by design. hinted_handoff_test and cluster_execute_test pass.

4. Tracing: default filter to `warn` when `CASS_DISABLE_TRACING=1` (`src/telemetry.rs`)

Trial result: 2.4275 → 2.3307 ms (−4.0%, clean signal — both trials inside ±0.03 ms)

When tracing is disabled (the perf-test mode), init_tracing was still installing a fmt subscriber with EnvFilter::new("info"). The codebase has 30+ #[instrument] annotations on hot paths (every cluster entry point, every gRPC handler, every Paxos step, …), and #[instrument] defaults to INFO level — meaning all of those spans were getting allocated, having their fields recorded via field::display(&sql) (a per-request formatted-string allocation), and dropped, on every request, even though no exporter was consuming them.

The change: in the tracing_disabled() branch, use EnvFilter::new("warn") as the default. INFO-level spans now fail the level check before Span::new allocates. RUST_LOG is still honored if explicitly set, so debugging is unaffected. The OTel-enabled branch is untouched.

This was the second-biggest win and the cleanest signal — both trials landed at 2.353 / 2.308 ms, a tight cluster well below the previous best.

Harness / setup

Two commits added to make the perf testing reproducible:

scripts/_ratchet_score.sh — wrapper that builds the docker image (so source changes take effect — perf_compare.sh itself only does docker compose up -d without --build), runs perf_compare.sh --cass-only, and prints score: <avg p95 ms> on stdout.
.gitignore — added .ratchet/ (autoresearch state) and perf-results/*.png (regenerated artifact). The .png was previously tracked, which polluted every perf run with a "modified" file.

Tried and rejected

For posterity, since these failures are also informative:

iter	hypothesis	outcome
3	Cap `tokio::main(worker_threads = 2)` to reduce 5-node oversubscription	Regressed to 2.555. 2 workers × 5 nodes underprovisioned for tonic+gRPC concurrency.
5	`tcp_nodelay(true)` on `Server::builder()` and `Endpoint`	Regressed to 2.534, though trial spread (2.45–2.62) was wide. HTTP/2 multiplexing may already absorb Nagle.
7	Short-circuit `extract_remote_context_from_metadata` / `inject_context` when tracing disabled	No signal (2.346 vs 2.331). The default propagator (when none is set) is already a fast no-op closure.
8	`SCHEMA_CACHE` `tokio::RwLock` → `std::sync::RwLock`	No signal (2.349 vs 2.331). Read-only steady state already cheap on `tokio::RwLock`.
9	Enable thin-LTO in `[profile.release]`	Regressed to 2.390. Tight trials (2.38, 2.40) suggest real regression — workload is network/coordination-bound, not CPU-bound, so cross-crate inlining doesn't help and may pessimize icache.

The pattern that worked: removing async/observability machinery from hot paths that don't need it. The pattern that didn't: micro-optimizing things the workload was already not bottlenecked on.

Test plan

cargo check --bins
WAL tests: cargo test --test wal_recovery_test --test wal_error_test --test wal_stress_test
MemTable tests: cargo test --test memtable_test --test query_persistence_test --test lsm_flush_test
Cluster/hint tests: cargo test --test hinted_handoff_test --test cluster_execute_test
10 full scripts/perf_compare.sh --cass-only runs (5 nodes, RF=3, QUORUM, OPS=5000, threads ∈ {1,2,4,8,16,32,64}) — best 2.331 ms, no crashes
CI green
Reviewer spot-check that .ratchet/ exclusion in .gitignore is desired (the harness state isn't reusable across machines so probably yes)

…ction)

…e contention

…an allocation

mweiden added 6 commits April 25, 2026 22:24

Add ratchet harness wrapper for perf p95 scoring

01776cd

Ignore .ratchet/ harness state and perf-results/*.png artifacts

1ff4537

WAL: replace tokio::Mutex<WalState> with std::Mutex (sync critical se…

03f80a4

…ction)

MemTable: replace tokio::RwLock<InnerTable> with std::sync::RwLock

d8a1635

Cluster: read-lock fast path in apply_hints to avoid per-request writ…

a824abc

…e contention

Default tracing filter to WARN when CASS_DISABLE_TRACING=1 to skip sp…

76e9f2e

…an allocation

mweiden merged commit 09b9d06 into main Apr 26, 2026
1 check passed

mweiden deleted the matt-weiden/server-latency-perf branch April 26, 2026 22:05

mweiden mentioned this pull request Apr 27, 2026

Server perf: +7.4% throughput via lock conversions, LTO, parse reuse #111

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce server p95 write+read latency by ~12% (avg p95 2.65 → 2.33 ms)#110

Reduce server p95 write+read latency by ~12% (avg p95 2.65 → 2.33 ms)#110
mweiden merged 6 commits intomainfrom
matt-weiden/server-latency-perf

mweiden commented Apr 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mweiden commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Kept changes

1. WAL: tokio::Mutex<WalState> → std::sync::Mutex (src/wal.rs)

2. MemTable: per-shard tokio::sync::RwLock → std::sync::RwLock (src/memtable.rs)

3. apply_hints: read-lock fast path (src/cluster.rs)

4. Tracing: default filter to warn when CASS_DISABLE_TRACING=1 (src/telemetry.rs)

Harness / setup

Tried and rejected

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mweiden commented Apr 26, 2026 •

edited

Loading

1. WAL: `tokio::Mutex<WalState>` → `std::sync::Mutex` (`src/wal.rs`)

2. MemTable: per-shard `tokio::sync::RwLock` → `std::sync::RwLock` (`src/memtable.rs`)

3. `apply_hints`: read-lock fast path (`src/cluster.rs`)

4. Tracing: default filter to `warn` when `CASS_DISABLE_TRACING=1` (`src/telemetry.rs`)