Server perf: +7.4% throughput via lock conversions, LTO, parse reuse by mweiden · Pull Request #111 · mweiden/cass

mweiden · 2026-04-27T05:06:41Z

Summary

Throughput optimization run targeting the average operation rate across perf_compare.sh (1/2/4/8/16/32/64-thread READ+WRITE phases). Final result: 8270 → 8886 ops/s (+7.4%), p95 latency 2.55 → 2.34 ms as a side effect.

Conducted via /ratchet:ratchet (10 iterations, 2 trials each, kept on improvement). Four changes were kept; five hypotheses were tested and reverted. Commits below are in keep order; each one is independently reviewable and was validated in isolation against the prior best.

Changes

1. `2c6b0d6` — `scripts/_ratchet_score.sh`: emit Latency + Operations scores

Tooling change. The prior version of the harness emitted only an averaged p95 latency. This version also emits average ops/s across all (R+W) × (1/2/4/8/16/32/64) phases, so future ratchet runs can configure either metric without re-editing the script. Throughput parsing is anchored on ^Op rate to avoid matching Partition rate/Row rate.

Why it matters: lets the same harness drive throughput-maximizing or latency-minimizing runs by changing only the ratchet config.

2. `04f7f09` — `Dockerfile`: bump Rust 1.85 → 1.89

Build fix. PR #110's wal.rs Drop impl introduced let-chain syntax (if let Some(task) = … && let Err(err) = task.join()), which stabilized in Rust 1.88. The Dockerfile was still pinned at 1.85, so docker compose build failed with E0658 before the perf run could start.

Why it matters: unblocks the existing wal.rs code in container builds. Local development was unaffected because the dev toolchain is 1.89.

3. `8debed1` — `Cluster`: `health` map → `std::sync::RwLock`

The cluster's per-replica liveness map was a tokio::sync::RwLock<HashMap<String, Instant>>. Every coordinator request calls is_alive(replica) once per replica (RF=3 ⇒ 3 reads per request), and the gossip task writes to it once per second. The critical sections are tiny — a single HashMap::get + Instant::elapsed for reads, a single insert for writes — and never hold the lock across an .await.

Why it matters: swapping to std::sync::RwLock removes the async-lock state machine and the implicit yield point per acquisition. Same pattern as the recent MemTable and WAL conversions.

Measured: +3.2% (8270 → 8534 ops/s).

4. `5209e9b` — `Cluster`: `hints` map → `std::sync::RwLock`

Same shape as health. apply_hints(replica) runs once per replica per coordinator request and hits a read fast-path (!hints.contains_key(node) → return). In steady state with no failures, every request takes this read lock and immediately drops it. Writes only occur when a replica fails or a hint is delivered — rare.

Why it matters: removes the async-lock overhead from another per-replica per-request hot path. Critical sections do not hold across .await.

Measured: +1.6% on top of the prior change (8534 → 8672 ops/s).

5. `ae48687` — `Cargo.toml`: enable `lto = "thin"` for release builds

Cargo's release profile defaults to lto = false. Thin LTO enables cross-crate inlining and dead-code elimination, which is particularly impactful here because the hot path (Cluster::execute → SqlEngine::parse_query → sqlparser) crosses multiple crate boundaries on every request.

Build cost is meaningful in absolute terms (one-time link is slower) but is fully cushioned by Docker's target/ cache mount in the multi-stage Dockerfile, so iteration time was not impacted.

Why it matters: the simplest single change in the run with the broadest reach — every code path benefits, including the SQL parser and the gRPC stack.

Measured: +1.9% (8672 → 8838 ops/s).

6. `3594598` — `Cluster`: skip duplicate SQL parse on the local replica

Cluster::execute parses the inbound SQL once at line 509 to determine routing (partition keys → replicas). Then execute_write_with_consistency re-parsed the same SQL by calling SqlEngine::execute_with_ts(sql, ...) for the local-replica branch. The engine already exposed execute_with_parsed(&ParsedQuery, ...) — it was just not wired through.

This commit threads the existing ParsedQuery from the routing step into execute_write_with_consistency and uses execute_with_parsed for the local branch. Remote replicas still receive the SQL string and parse it on receipt (no protocol change).

Why it matters: eliminates one full sqlparser invocation per write request on the coordinator. The same trick was attempted for the read path (run_read_with_quorum → execute_on_node) but regressed marginally because the per-peer Arc<ParsedQuery> clone overhead outweighed the savings — kept the change scoped to writes only.

Measured: +0.5% (8838 → 8886 ops/s); both trials within 0.3% of each other (8899/8872) — clean signal.

Things tried and reverted

panic_until → std::sync::RwLock — only 1 read per local request; effect lost in run-to-run noise.
SCHEMA_CACHE → std::sync::RwLock — read per request via lookup_schema; trial avg regressed but unclear whether real or noise.
Server::tcp_nodelay(true) — tonic's default may already cover this; no measurable effect.
Read-path Arc<ParsedQuery> threading — tight cluster regression (-1.2%); the per-peer Arc-clone cost negated the savings.

Methodology note

Trial-to-trial variance in this benchmark is roughly 5–10% with occasional outliers up to ~18%. With --trials 2, single environmental outliers can defeat sound changes (iter 5 of the run was a clear example: trial 1 = 7081, trial 2 = 8670; the change was good, retried with the same code in iter 8 and kept it). Future ratchet runs against this harness would benefit from --trials 3.

Test plan

cargo test (lib + integration; flaky cluster_remote_ops_test integration tests pre-exist on main due to port-startup race — not introduced here)
docker compose build succeeds
scripts/perf_compare.sh --cass-only runs end-to-end and produces perf-results/cass_t*.log
Spot-check the Op rate and Latency 95th percentile lines vs. baseline numbers on main

🤖 Generated with Claude Code

… skip duplicate parse on local replica (retry)

mweiden added 6 commits April 26, 2026 15:21

Update _ratchet_score.sh to emit Latency Score and Operations Score

2c6b0d6

Dockerfile: bump Rust 1.85 -> 1.89 to support let-chains in wal.rs

04f7f09

Cluster: replace tokio::RwLock<HashMap> health with std::sync::RwLock

8debed1

Cluster: replace tokio::RwLock<HashMap> hints with std::sync::RwLock

5209e9b

Cargo.toml: enable thin LTO in release profile for cross-crate inlining

ae48687

Cluster: thread ParsedQuery through execute_write_with_consistency to…

3594598

… skip duplicate parse on local replica (retry)

mweiden merged commit 67f9460 into main Apr 27, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server perf: +7.4% throughput via lock conversions, LTO, parse reuse#111

Server perf: +7.4% throughput via lock conversions, LTO, parse reuse#111
mweiden merged 6 commits intomainfrom
matt-weiden/server-throughput-perf

mweiden commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mweiden commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. 2c6b0d6 — scripts/_ratchet_score.sh: emit Latency + Operations scores

2. 04f7f09 — Dockerfile: bump Rust 1.85 → 1.89

3. 8debed1 — Cluster: health map → std::sync::RwLock

4. 5209e9b — Cluster: hints map → std::sync::RwLock

5. ae48687 — Cargo.toml: enable lto = "thin" for release builds

6. 3594598 — Cluster: skip duplicate SQL parse on the local replica

Things tried and reverted

Methodology note

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mweiden commented Apr 27, 2026 •

edited

Loading

1. `2c6b0d6` — `scripts/_ratchet_score.sh`: emit Latency + Operations scores

2. `04f7f09` — `Dockerfile`: bump Rust 1.85 → 1.89

3. `8debed1` — `Cluster`: `health` map → `std::sync::RwLock`

4. `5209e9b` — `Cluster`: `hints` map → `std::sync::RwLock`

5. `ae48687` — `Cargo.toml`: enable `lto = "thin"` for release builds

6. `3594598` — `Cluster`: skip duplicate SQL parse on the local replica