Skip to content

Reduce server p95 write+read latency by ~12% (avg p95 2.65 → 2.33 ms)#110

Merged
mweiden merged 6 commits intomainfrom
matt-weiden/server-latency-perf
Apr 26, 2026
Merged

Reduce server p95 write+read latency by ~12% (avg p95 2.65 → 2.33 ms)#110
mweiden merged 6 commits intomainfrom
matt-weiden/server-latency-perf

Conversation

@mweiden
Copy link
Copy Markdown
Owner

@mweiden mweiden commented Apr 26, 2026

Summary

Driven by an autoresearch (/ratchet) loop on scripts/perf_compare.sh --cass-only over thread sweeps {1,2,4,8,16,32,64} at OPS=5000, RF=3, write/read CL=QUORUM. Metric: average of all 95th-percentile latency lines (READ + WRITE) emitted by examples/perf_client.rs across the seven thread counts.

  • Baseline: 2.6546 ms avg p95
  • After: 2.3307 ms avg p95
  • Improvement: −0.324 ms, −12.2%

Two of the four kept changes contributed nearly all of the gain (WAL mutex −7.6%, tracing filter −4%); the other two are smaller wins on the same hot path. Five other hypotheses were tested and discarded after regressions or no signal — see "Tried and rejected" at the bottom.

Kept changes

1. WAL: tokio::Mutex<WalState>std::sync::Mutex (src/wal.rs)

Trial result: 2.6546 → 2.4536 ms (−7.6%, single largest win)

Wal::append is on the critical path of every write. Its critical section just does data.extend_from_slice(...) + data.push(b'\n') — fully synchronous, never awaits. With the previous tokio::sync::Mutex, every concurrent appender paid an async-mutex acquire (atomic CAS + potential wait queue + cooperative-yield logic) for what is microsecond-scale work. Under 64 concurrent client threads this serialized appends through async machinery designed for long critical sections.

flush_lock (which IS held across an await on storage.append) stays as tokio::Mutex. Only state switched. All WAL integration tests (wal_recovery_test, wal_error_test, wal_stress_test) pass.

2. MemTable: per-shard tokio::sync::RwLockstd::sync::RwLock (src/memtable.rs)

Trial result: 2.4536 → 2.4293 ms (~−1%, marginal but consistent)

Same pattern as the WAL: MemTable is sharded by murmur3 of the key, but each shard's lock was a tokio::sync::RwLock despite all critical sections being synchronous (HashMap::insert, .get(...).cloned(), etc.). All call sites in lib.rs already drop the guard before any await. memtable_test and downstream query_persistence_test / lsm_flush_test pass.

3. apply_hints: read-lock fast path (src/cluster.rs)

Trial result: 2.4293 → 2.4275 ms (within noise on its own)

Cluster::healthy_nodes calls apply_hints(&node) for every healthy replica on every coordinator request. The previous implementation unconditionally took a write lock on self.hints and called map.remove(node) — even though in steady state (no peer failures) the hints map is empty and the remove is a no-op. With QUORUM/RF=3 and 64 concurrent client threads, every request was serializing 3 write-lock acquires through one mutex.

The fix is a one-line short-circuit:

if !self.hints.read().await.contains_key(node) {
    return;
}

Read locks don't serialize with each other, so concurrent coordinator requests can all confirm "no hints" in parallel. Correctness is preserved: if a hint arrives between the read check and the would-be write, it'll be picked up by the next request — hint delivery is best-effort by design. hinted_handoff_test and cluster_execute_test pass.

4. Tracing: default filter to warn when CASS_DISABLE_TRACING=1 (src/telemetry.rs)

Trial result: 2.4275 → 2.3307 ms (−4.0%, clean signal — both trials inside ±0.03 ms)

When tracing is disabled (the perf-test mode), init_tracing was still installing a fmt subscriber with EnvFilter::new("info"). The codebase has 30+ #[instrument] annotations on hot paths (every cluster entry point, every gRPC handler, every Paxos step, …), and #[instrument] defaults to INFO level — meaning all of those spans were getting allocated, having their fields recorded via field::display(&sql) (a per-request formatted-string allocation), and dropped, on every request, even though no exporter was consuming them.

The change: in the tracing_disabled() branch, use EnvFilter::new("warn") as the default. INFO-level spans now fail the level check before Span::new allocates. RUST_LOG is still honored if explicitly set, so debugging is unaffected. The OTel-enabled branch is untouched.

This was the second-biggest win and the cleanest signal — both trials landed at 2.353 / 2.308 ms, a tight cluster well below the previous best.

Harness / setup

Two commits added to make the perf testing reproducible:

  • scripts/_ratchet_score.sh — wrapper that builds the docker image (so source changes take effect — perf_compare.sh itself only does docker compose up -d without --build), runs perf_compare.sh --cass-only, and prints score: <avg p95 ms> on stdout.
  • .gitignore — added .ratchet/ (autoresearch state) and perf-results/*.png (regenerated artifact). The .png was previously tracked, which polluted every perf run with a "modified" file.

Tried and rejected

For posterity, since these failures are also informative:

iter hypothesis outcome
3 Cap tokio::main(worker_threads = 2) to reduce 5-node oversubscription Regressed to 2.555. 2 workers × 5 nodes underprovisioned for tonic+gRPC concurrency.
5 tcp_nodelay(true) on Server::builder() and Endpoint Regressed to 2.534, though trial spread (2.45–2.62) was wide. HTTP/2 multiplexing may already absorb Nagle.
7 Short-circuit extract_remote_context_from_metadata / inject_context when tracing disabled No signal (2.346 vs 2.331). The default propagator (when none is set) is already a fast no-op closure.
8 SCHEMA_CACHE tokio::RwLockstd::sync::RwLock No signal (2.349 vs 2.331). Read-only steady state already cheap on tokio::RwLock.
9 Enable thin-LTO in [profile.release] Regressed to 2.390. Tight trials (2.38, 2.40) suggest real regression — workload is network/coordination-bound, not CPU-bound, so cross-crate inlining doesn't help and may pessimize icache.

The pattern that worked: removing async/observability machinery from hot paths that don't need it. The pattern that didn't: micro-optimizing things the workload was already not bottlenecked on.

Test plan

  • cargo check --bins
  • WAL tests: cargo test --test wal_recovery_test --test wal_error_test --test wal_stress_test
  • MemTable tests: cargo test --test memtable_test --test query_persistence_test --test lsm_flush_test
  • Cluster/hint tests: cargo test --test hinted_handoff_test --test cluster_execute_test
  • 10 full scripts/perf_compare.sh --cass-only runs (5 nodes, RF=3, QUORUM, OPS=5000, threads ∈ {1,2,4,8,16,32,64}) — best 2.331 ms, no crashes
  • CI green
  • Reviewer spot-check that .ratchet/ exclusion in .gitignore is desired (the harness state isn't reusable across machines so probably yes)

@mweiden mweiden merged commit 09b9d06 into main Apr 26, 2026
1 check passed
@mweiden mweiden deleted the matt-weiden/server-latency-perf branch April 26, 2026 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant