Server perf: +7.4% throughput via lock conversions, LTO, parse reuse#111
Merged
Server perf: +7.4% throughput via lock conversions, LTO, parse reuse#111
Conversation
β¦ skip duplicate parse on local replica (retry)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Throughput optimization run targeting the average operation rate across
perf_compare.sh(1/2/4/8/16/32/64-thread READ+WRITE phases). Final result: 8270 β 8886 ops/s (+7.4%), p95 latency 2.55 β 2.34 ms as a side effect.Conducted via
/ratchet:ratchet(10 iterations, 2 trials each, kept on improvement). Four changes were kept; five hypotheses were tested and reverted. Commits below are in keep order; each one is independently reviewable and was validated in isolation against the prior best.Changes
1.
2c6b0d6βscripts/_ratchet_score.sh: emit Latency + Operations scoresTooling change. The prior version of the harness emitted only an averaged p95 latency. This version also emits average ops/s across all (R+W) Γ (1/2/4/8/16/32/64) phases, so future ratchet runs can configure either metric without re-editing the script. Throughput parsing is anchored on
^Op rateto avoid matchingPartition rate/Row rate.Why it matters: lets the same harness drive throughput-maximizing or latency-minimizing runs by changing only the ratchet config.
2.
04f7f09βDockerfile: bump Rust 1.85 β 1.89Build fix. PR #110's
wal.rsDrop impl introduced let-chain syntax (if let Some(task) = β¦ && let Err(err) = task.join()), which stabilized in Rust 1.88. The Dockerfile was still pinned at 1.85, sodocker compose buildfailed with E0658 before the perf run could start.Why it matters: unblocks the existing
wal.rscode in container builds. Local development was unaffected because the dev toolchain is 1.89.3.
8debed1βCluster:healthmap βstd::sync::RwLockThe cluster's per-replica liveness map was a
tokio::sync::RwLock<HashMap<String, Instant>>. Every coordinator request callsis_alive(replica)once per replica (RF=3 β 3 reads per request), and the gossip task writes to it once per second. The critical sections are tiny β a singleHashMap::get+Instant::elapsedfor reads, a singleinsertfor writes β and never hold the lock across an.await.Why it matters: swapping to
std::sync::RwLockremoves the async-lock state machine and the implicit yield point per acquisition. Same pattern as the recentMemTableandWALconversions.Measured: +3.2% (8270 β 8534 ops/s).
4.
5209e9bβCluster:hintsmap βstd::sync::RwLockSame shape as health.
apply_hints(replica)runs once per replica per coordinator request and hits a read fast-path (!hints.contains_key(node) β return). In steady state with no failures, every request takes this read lock and immediately drops it. Writes only occur when a replica fails or a hint is delivered β rare.Why it matters: removes the async-lock overhead from another per-replica per-request hot path. Critical sections do not hold across
.await.Measured: +1.6% on top of the prior change (8534 β 8672 ops/s).
5.
ae48687βCargo.toml: enablelto = "thin"for release buildsCargo's release profile defaults to
lto = false. Thin LTO enables cross-crate inlining and dead-code elimination, which is particularly impactful here because the hot path (Cluster::executeβSqlEngine::parse_queryβsqlparser) crosses multiple crate boundaries on every request.Build cost is meaningful in absolute terms (one-time link is slower) but is fully cushioned by Docker's
target/cache mount in the multi-stage Dockerfile, so iteration time was not impacted.Why it matters: the simplest single change in the run with the broadest reach β every code path benefits, including the SQL parser and the gRPC stack.
Measured: +1.9% (8672 β 8838 ops/s).
6.
3594598βCluster: skip duplicate SQL parse on the local replicaCluster::executeparses the inbound SQL once at line 509 to determine routing (partition keys β replicas). Thenexecute_write_with_consistencyre-parsed the same SQL by callingSqlEngine::execute_with_ts(sql, ...)for the local-replica branch. The engine already exposedexecute_with_parsed(&ParsedQuery, ...)β it was just not wired through.This commit threads the existing
ParsedQueryfrom the routing step intoexecute_write_with_consistencyand usesexecute_with_parsedfor the local branch. Remote replicas still receive the SQL string and parse it on receipt (no protocol change).Why it matters: eliminates one full sqlparser invocation per write request on the coordinator. The same trick was attempted for the read path (
run_read_with_quorumβexecute_on_node) but regressed marginally because the per-peerArc<ParsedQuery>clone overhead outweighed the savings β kept the change scoped to writes only.Measured: +0.5% (8838 β 8886 ops/s); both trials within 0.3% of each other (8899/8872) β clean signal.
Things tried and reverted
panic_untilβstd::sync::RwLockβ only 1 read per local request; effect lost in run-to-run noise.SCHEMA_CACHEβstd::sync::RwLockβ read per request vialookup_schema; trial avg regressed but unclear whether real or noise.Server::tcp_nodelay(true)β tonic's default may already cover this; no measurable effect.Arc<ParsedQuery>threading β tight cluster regression (-1.2%); the per-peer Arc-clone cost negated the savings.Methodology note
Trial-to-trial variance in this benchmark is roughly 5β10% with occasional outliers up to ~18%. With
--trials 2, single environmental outliers can defeat sound changes (iter 5 of the run was a clear example: trial 1 = 7081, trial 2 = 8670; the change was good, retried with the same code in iter 8 and kept it). Future ratchet runs against this harness would benefit from--trials 3.Test plan
cargo test(lib + integration; flakycluster_remote_ops_testintegration tests pre-exist onmaindue to port-startup race β not introduced here)docker compose buildsucceedsscripts/perf_compare.sh --cass-onlyruns end-to-end and producesperf-results/cass_t*.logOp rateandLatency 95th percentilelines vs. baseline numbers onmainπ€ Generated with Claude Code