Reduce server p95 write+read latency by ~12% (avg p95 2.65 → 2.33 ms)#110
Merged
Reduce server p95 write+read latency by ~12% (avg p95 2.65 → 2.33 ms)#110
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Driven by an autoresearch (
/ratchet) loop onscripts/perf_compare.sh --cass-onlyover thread sweeps {1,2,4,8,16,32,64} at OPS=5000, RF=3, write/read CL=QUORUM. Metric: average of all 95th-percentile latency lines (READ + WRITE) emitted byexamples/perf_client.rsacross the seven thread counts.Two of the four kept changes contributed nearly all of the gain (WAL mutex −7.6%, tracing filter −4%); the other two are smaller wins on the same hot path. Five other hypotheses were tested and discarded after regressions or no signal — see "Tried and rejected" at the bottom.
Kept changes
1. WAL:
tokio::Mutex<WalState>→std::sync::Mutex(src/wal.rs)Trial result: 2.6546 → 2.4536 ms (−7.6%, single largest win)
Wal::appendis on the critical path of every write. Its critical section just doesdata.extend_from_slice(...)+data.push(b'\n')— fully synchronous, never awaits. With the previoustokio::sync::Mutex, every concurrent appender paid an async-mutex acquire (atomic CAS + potential wait queue + cooperative-yield logic) for what is microsecond-scale work. Under 64 concurrent client threads this serialized appends through async machinery designed for long critical sections.flush_lock(which IS held across anawaitonstorage.append) stays astokio::Mutex. Onlystateswitched. All WAL integration tests (wal_recovery_test,wal_error_test,wal_stress_test) pass.2. MemTable: per-shard
tokio::sync::RwLock→std::sync::RwLock(src/memtable.rs)Trial result: 2.4536 → 2.4293 ms (~−1%, marginal but consistent)
Same pattern as the WAL:
MemTableis sharded by murmur3 of the key, but each shard's lock was atokio::sync::RwLockdespite all critical sections being synchronous (HashMap::insert,.get(...).cloned(), etc.). All call sites inlib.rsalready drop the guard before any await.memtable_testand downstreamquery_persistence_test/lsm_flush_testpass.3.
apply_hints: read-lock fast path (src/cluster.rs)Trial result: 2.4293 → 2.4275 ms (within noise on its own)
Cluster::healthy_nodescallsapply_hints(&node)for every healthy replica on every coordinator request. The previous implementation unconditionally took awritelock onself.hintsand calledmap.remove(node)— even though in steady state (no peer failures) the hints map is empty and the remove is a no-op. With QUORUM/RF=3 and 64 concurrent client threads, every request was serializing 3 write-lock acquires through one mutex.The fix is a one-line short-circuit:
Read locks don't serialize with each other, so concurrent coordinator requests can all confirm "no hints" in parallel. Correctness is preserved: if a hint arrives between the read check and the would-be write, it'll be picked up by the next request — hint delivery is best-effort by design.
hinted_handoff_testandcluster_execute_testpass.4. Tracing: default filter to
warnwhenCASS_DISABLE_TRACING=1(src/telemetry.rs)Trial result: 2.4275 → 2.3307 ms (−4.0%, clean signal — both trials inside ±0.03 ms)
When tracing is disabled (the perf-test mode),
init_tracingwas still installing afmtsubscriber withEnvFilter::new("info"). The codebase has 30+#[instrument]annotations on hot paths (every cluster entry point, every gRPC handler, every Paxos step, …), and#[instrument]defaults to INFO level — meaning all of those spans were getting allocated, having their fields recorded viafield::display(&sql)(a per-request formatted-string allocation), and dropped, on every request, even though no exporter was consuming them.The change: in the
tracing_disabled()branch, useEnvFilter::new("warn")as the default. INFO-level spans now fail the level check beforeSpan::newallocates. RUST_LOG is still honored if explicitly set, so debugging is unaffected. The OTel-enabled branch is untouched.This was the second-biggest win and the cleanest signal — both trials landed at 2.353 / 2.308 ms, a tight cluster well below the previous best.
Harness / setup
Two commits added to make the perf testing reproducible:
scripts/_ratchet_score.sh— wrapper that builds the docker image (so source changes take effect —perf_compare.shitself only doesdocker compose up -dwithout--build), runsperf_compare.sh --cass-only, and printsscore: <avg p95 ms>on stdout..gitignore— added.ratchet/(autoresearch state) andperf-results/*.png(regenerated artifact). The.pngwas previously tracked, which polluted every perf run with a "modified" file.Tried and rejected
For posterity, since these failures are also informative:
tokio::main(worker_threads = 2)to reduce 5-node oversubscriptiontcp_nodelay(true)onServer::builder()andEndpointextract_remote_context_from_metadata/inject_contextwhen tracing disabledSCHEMA_CACHEtokio::RwLock→std::sync::RwLocktokio::RwLock.[profile.release]The pattern that worked: removing async/observability machinery from hot paths that don't need it. The pattern that didn't: micro-optimizing things the workload was already not bottlenecked on.
Test plan
cargo check --binscargo test --test wal_recovery_test --test wal_error_test --test wal_stress_testcargo test --test memtable_test --test query_persistence_test --test lsm_flush_testcargo test --test hinted_handoff_test --test cluster_execute_testscripts/perf_compare.sh --cass-onlyruns (5 nodes, RF=3, QUORUM, OPS=5000, threads ∈ {1,2,4,8,16,32,64}) — best 2.331 ms, no crashes.ratchet/exclusion in.gitignoreis desired (the harness state isn't reusable across machines so probably yes)