Skip to content

Server perf: +7.4% throughput via lock conversions, LTO, parse reuse#111

Merged
mweiden merged 6 commits intomainfrom
matt-weiden/server-throughput-perf
Apr 27, 2026
Merged

Server perf: +7.4% throughput via lock conversions, LTO, parse reuse#111
mweiden merged 6 commits intomainfrom
matt-weiden/server-throughput-perf

Conversation

@mweiden
Copy link
Copy Markdown
Owner

@mweiden mweiden commented Apr 27, 2026

Summary

Throughput optimization run targeting the average operation rate across perf_compare.sh (1/2/4/8/16/32/64-thread READ+WRITE phases). Final result: 8270 β†’ 8886 ops/s (+7.4%), p95 latency 2.55 β†’ 2.34 ms as a side effect.

Conducted via /ratchet:ratchet (10 iterations, 2 trials each, kept on improvement). Four changes were kept; five hypotheses were tested and reverted. Commits below are in keep order; each one is independently reviewable and was validated in isolation against the prior best.

Changes

1. 2c6b0d6 β€” scripts/_ratchet_score.sh: emit Latency + Operations scores

Tooling change. The prior version of the harness emitted only an averaged p95 latency. This version also emits average ops/s across all (R+W) Γ— (1/2/4/8/16/32/64) phases, so future ratchet runs can configure either metric without re-editing the script. Throughput parsing is anchored on ^Op rate to avoid matching Partition rate/Row rate.

Why it matters: lets the same harness drive throughput-maximizing or latency-minimizing runs by changing only the ratchet config.

2. 04f7f09 β€” Dockerfile: bump Rust 1.85 β†’ 1.89

Build fix. PR #110's wal.rs Drop impl introduced let-chain syntax (if let Some(task) = … && let Err(err) = task.join()), which stabilized in Rust 1.88. The Dockerfile was still pinned at 1.85, so docker compose build failed with E0658 before the perf run could start.

Why it matters: unblocks the existing wal.rs code in container builds. Local development was unaffected because the dev toolchain is 1.89.

3. 8debed1 β€” Cluster: health map β†’ std::sync::RwLock

The cluster's per-replica liveness map was a tokio::sync::RwLock<HashMap<String, Instant>>. Every coordinator request calls is_alive(replica) once per replica (RF=3 β‡’ 3 reads per request), and the gossip task writes to it once per second. The critical sections are tiny β€” a single HashMap::get + Instant::elapsed for reads, a single insert for writes β€” and never hold the lock across an .await.

Why it matters: swapping to std::sync::RwLock removes the async-lock state machine and the implicit yield point per acquisition. Same pattern as the recent MemTable and WAL conversions.

Measured: +3.2% (8270 β†’ 8534 ops/s).

4. 5209e9b β€” Cluster: hints map β†’ std::sync::RwLock

Same shape as health. apply_hints(replica) runs once per replica per coordinator request and hits a read fast-path (!hints.contains_key(node) β†’ return). In steady state with no failures, every request takes this read lock and immediately drops it. Writes only occur when a replica fails or a hint is delivered β€” rare.

Why it matters: removes the async-lock overhead from another per-replica per-request hot path. Critical sections do not hold across .await.

Measured: +1.6% on top of the prior change (8534 β†’ 8672 ops/s).

5. ae48687 β€” Cargo.toml: enable lto = "thin" for release builds

Cargo's release profile defaults to lto = false. Thin LTO enables cross-crate inlining and dead-code elimination, which is particularly impactful here because the hot path (Cluster::execute β†’ SqlEngine::parse_query β†’ sqlparser) crosses multiple crate boundaries on every request.

Build cost is meaningful in absolute terms (one-time link is slower) but is fully cushioned by Docker's target/ cache mount in the multi-stage Dockerfile, so iteration time was not impacted.

Why it matters: the simplest single change in the run with the broadest reach β€” every code path benefits, including the SQL parser and the gRPC stack.

Measured: +1.9% (8672 β†’ 8838 ops/s).

6. 3594598 β€” Cluster: skip duplicate SQL parse on the local replica

Cluster::execute parses the inbound SQL once at line 509 to determine routing (partition keys β†’ replicas). Then execute_write_with_consistency re-parsed the same SQL by calling SqlEngine::execute_with_ts(sql, ...) for the local-replica branch. The engine already exposed execute_with_parsed(&ParsedQuery, ...) β€” it was just not wired through.

This commit threads the existing ParsedQuery from the routing step into execute_write_with_consistency and uses execute_with_parsed for the local branch. Remote replicas still receive the SQL string and parse it on receipt (no protocol change).

Why it matters: eliminates one full sqlparser invocation per write request on the coordinator. The same trick was attempted for the read path (run_read_with_quorum β†’ execute_on_node) but regressed marginally because the per-peer Arc<ParsedQuery> clone overhead outweighed the savings β€” kept the change scoped to writes only.

Measured: +0.5% (8838 β†’ 8886 ops/s); both trials within 0.3% of each other (8899/8872) β€” clean signal.

Things tried and reverted

  • panic_until β†’ std::sync::RwLock β€” only 1 read per local request; effect lost in run-to-run noise.
  • SCHEMA_CACHE β†’ std::sync::RwLock β€” read per request via lookup_schema; trial avg regressed but unclear whether real or noise.
  • Server::tcp_nodelay(true) β€” tonic's default may already cover this; no measurable effect.
  • Read-path Arc<ParsedQuery> threading β€” tight cluster regression (-1.2%); the per-peer Arc-clone cost negated the savings.

Methodology note

Trial-to-trial variance in this benchmark is roughly 5–10% with occasional outliers up to ~18%. With --trials 2, single environmental outliers can defeat sound changes (iter 5 of the run was a clear example: trial 1 = 7081, trial 2 = 8670; the change was good, retried with the same code in iter 8 and kept it). Future ratchet runs against this harness would benefit from --trials 3.

Test plan

  • cargo test (lib + integration; flaky cluster_remote_ops_test integration tests pre-exist on main due to port-startup race β€” not introduced here)
  • docker compose build succeeds
  • scripts/perf_compare.sh --cass-only runs end-to-end and produces perf-results/cass_t*.log
  • Spot-check the Op rate and Latency 95th percentile lines vs. baseline numbers on main

πŸ€– Generated with Claude Code

@mweiden mweiden merged commit 67f9460 into main Apr 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant