ScratchBird-Benchmarks is a Docker-first benchmark harness for establishing
repeatable upstream baselines before ScratchBird itself is added as a target.
The project has two jobs:
- Measure current upstream engine behavior under the same harness.
- Prepare the exact comparison model ScratchBird will use later:
ScratchBird nativevs upstream enginesScratchBird emulation modevs the original engine it emulates
This repository is not just a raw speed leaderboard. Its most important goal is to answer questions like:
- Which access path did the engine choose?
- Did it stay on the expected index family or fall back to a scan?
- Is the plan comparable to the peer engine's plan?
- Is performance better, equivalent, or worse once the plan is normalized?
- FirebirdSQL
- MySQL
- PostgreSQL
These upstream engines are the current benchmarkable targets.
scratchbird-nativescratchbird-firebirdscratchbird-mysqlscratchbird-postgresql
Those ScratchBird targets are already declared in the target registry for the normalized index-comparison lane, but they remain disabled until a benchmarkable ScratchBird service exists.
The harness is organized around two comparison classes.
This compares upstream engines against each other using the same benchmark harness and output format.
Use this to answer:
- how each engine behaves on the same stress or ACID lane
- which upstream engine is the best native baseline for a given feature area
- how stable the benchmark harness itself is
This compares feature-equivalent index behavior instead of comparing engines as black boxes.
Current phase-1 scope is conservative:
- B-tree point lookup
- B-tree range scan
- B-tree composite predicate with ordered output
This is the lane that will later support:
- upstream engine vs ScratchBird emulation mode
- ScratchBird native vs upstream engines
- pairwise verdicts such as
better,equivalent, andfallback
The project is designed so users do not need full upstream source trees in order to run the main benchmark lanes.
Required for normal use:
- Docker Engine or Docker Desktop
- Python 3
- benchmark dependencies from
requirements.txt
Optional:
- local upstream source clones for regression-only lanes
Source clones are only needed for upstream regression suites. They are not required for:
stressacidengine-differentialindex-comparison
index-comparison- Normalized plan and performance comparison by index family.
- This is the most important future ScratchBird comparison lane.
stress- Synthetic OLTP and mixed-workload pressure with joins, aggregations, bulk operations, and large result sets.
acid- Atomicity, consistency, durability, and baseline isolation checks.
engine-differential- Engine-biased scenario pack that highlights where each engine family tends to excel or diverge.
regression- Optional upstream regression integration when local clones are available.
performancetpc-ctpc-h
Those lanes exist in the repository and are wired into the matrix tooling, but they should currently be treated as scaffolds or placeholders rather than final decision-grade benchmark programs.
Measures:
- execution status
- normalized plan family
- plan capture success
- expectation status
- average latency
- p95 latency
- p99 latency
- throughput in queries per second
- per-scenario quality score
Why:
- This suite is about plan correctness first and speed second.
- It tells you whether the engine chose the expected access path.
- It provides the pairwise comparison model ScratchBird will use later.
Measures:
- data-load row counts
- data-load duration
- data-load rows per second
- per-query duration
- rows returned
- rows affected
- pass/fail/error status
Why:
- This suite exposes workload stability, not just microbenchmark speed.
- It shows whether engines remain functional under large joins, aggregations, and bulk operations.
- It gives a practical mixed-workload baseline for later ScratchBird work.
Measures:
- test pass/fail/error/skip status
- expected vs actual verification result
- duration per test
- category rollups for atomicity, consistency, isolation, and durability
Why:
- This suite is a correctness gate.
- It ensures future performance work is not built on broken transactional behavior.
Measures:
- scenario runtime
- execution success vs engine-specific error
- scenario-level behavior on engine-biased SQL patterns
Why:
- This suite highlights planner and engine-shape differences.
- It is useful for understanding why one engine behaves differently from another.
- It is informative, but it should not be treated as a strict correctness gate
in the same way as
acidorindex-comparison.
Measures:
- upstream regression totals and result summaries
Why:
- This lane helps compare ScratchBird compatibility work against the original engine's own regression expectations.
- It requires local upstream source/test trees and is therefore optional.
Pairwise comparison is directional. A candidate target is compared against a baseline target for the same normalized scenario.
better- The candidate stayed on an equal-or-better normalized plan and improved performance outside the configured noise band.
equivalent- The candidate matched the expected plan quality and stayed within the noise band.
worse- The candidate ran but lost plan quality or performance versus baseline.
fallback- The candidate fell back to a worse access strategy such as a scan when the expected indexed path should have been used.
unsupported- The scenario or plan capture is not supported for that target.
invalid- The result is unusable because the scenario did not produce a valid comparison artifact.
Execution status and comparative verdict are separate concepts. A run can
execute successfully and still receive a worse or fallback verdict.
See also:
For a matrix run under results/matrix-<run-id>/, the primary artifacts are:
matrix-summary.json- overall run integrity and per-suite execution status
.matrix-runs.tsv- one row per engine/suite invocation
matrix-comparison-unified.csv- consolidated comparison table across engines and suites
<engine>/<suite>/*.json- raw suite artifacts
comparison-<suite>/benchmark_comparison_*.txt- human-readable suite comparison output
comparison-index-comparison/index-comparison-pairwise-*.json- pairwise normalized verdict output for index-comparison
The unified CSV is the main decision artifact because it lets you compare:
- run health
- correctness counts
- suite durations
- suite-specific summary metrics
- raw artifact provenance
cd /home/dcalford/CliWork/ScratchBird-Benchmarks
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtRun the Docker-first authoritative baseline:
./scripts/run-benchmark-matrix.sh \
--engines=firebird,mysql,postgresql \
--suites=stress,acid,engine-differential,index-comparison \
--report --compareRun a single engine and suite:
./scripts/start-engine.sh postgresql start
./scripts/run-benchmark.sh postgresql index-comparison --report
./scripts/start-engine.sh postgresql stopRun regression only if local source trees are available:
./scripts/run-benchmark-matrix.sh \
--engines=firebird,mysql,postgresql \
--suites=regression \
--report --compareIf you want decision-grade results right now, prioritize:
acidstressindex-comparisonengine-differential
Use performance, tpc-c, and tpc-h only as work-in-progress lanes until
their scenario packs and reporting contracts are expanded.
This repository gives ScratchBird a stable baseline before ScratchBird enters the matrix.
That matters because later comparisons should answer:
- Is ScratchBird correct against the original engine?
- Does ScratchBird choose the same or a better normalized plan?
- Does ScratchBird native behavior compete with the best relevant upstream engine?
Without this baseline, later ScratchBird results would be hard to interpret.