How to measure and reason about this project's performance honestly. The committed benchmark
numbers (results/) are a reproducible baseline, not a production-latency claim.
- Benchmark only the bench preset, which inherits the Release configuration
(
-O2/-O3,NDEBUG) and disables tests. Debug numbers are meaningless for latency. - Sanitizer builds (ASan/UBSan, see
make asan) are for correctness, not timing — they add large, uneven overhead.
- CPU frequency scaling / turbo: clocks vary with thermal and power state. For stable
numbers, pin the governor to
performance(cpupower frequency-set -g performance) and be aware turbo can still move results run to run. - Core sharing / scheduling: a shared machine adds jitter. Pinning to an isolated core
(
taskset -c <cpu>,isolcpus=) reduces variance; the committed numbers do not do this. - Scheduler migration: if a benchmark thread moves across cores, cache warmth and run-to-run variance can change independently of application logic. M43 records migration evidence where Linux exposes it and labels hosts that cannot provide it.
- Cache and allocator effects: the order book uses
std::map/std::listand heap allocation; cache locality and allocator behavior dominate small-op latency. Custom allocators / flat structures (backlog inMILESTONES.md) would change the picture. - Wall-clock vs logical time: timing uses
std::chrono::steady_clockat the benchmark layer only. The engine itself is logical-time and deterministic, so results are not affected by clock resolution beyond the measurement boundary.
- Report
p50/p95/p99, not just a mean, when latency distribution matters. The current harness reports mean ns/op over many iterations as a first-order baseline; percentile reporting is a documented follow-up. - Always record hardware, OS, compiler, build type, and artifact provenance alongside the
numbers. For migrated artifacts,
Source digestis the stable identity andGit commit (informational)is not a stale-artifact signal by itself.
make perf-statrunsscripts/perf_stat.shon Linux and records cycles, instructions, branch/cache events, context switches, and page faults when the host exposes those counters.make perf-recordrunsscripts/perf_record.shon Linux and records aperf report --stdiosoftware sampling report by default; it is a hot-symbol profile only when the recorded sample count clears the threshold reported in the artifact.- See
docs/perf_analysis.mdfor the M29 profiling workflow, artifacts, and caveats.
The current perf artifacts are partial hardware PMU evidence from a bare-metal Apple MacBook Air
(M2, aarch64) running Fedora Asahi Remix — not the earlier Docker Desktop runs. perf stat reads
real cycles/instructions/branches/branch-misses off the Apple Avalanche/Blizzard PMUs; only
cache-references/cache-misses come back <not supported> because the Apple Silicon PMU driver
does not expose them. Issue #90's remaining ask is therefore the cache-counter set specifically,
which needs a PMU microarchitecture that exposes those events (x86_64 Intel/AMD, or an ARM server
core) — being bare metal is necessary but not sufficient.
M43 owns CPU affinity and NUMA/locality evidence. Run:
make numa-studyThis builds the benchmark preset and runs scripts/numa_affinity_study.sh. The script records an
unpinned benchmark run and a taskset-pinned run, then attempts perf stat software counters for
context-switches and cpu-migrations. It also records lscpu output and numactl --hardware
when available.
The artifact self-classifies its evidence:
full-linux-numa— NUMA-capable Linux host withtaskset,numactltopology, successful node-local and remote-memory binding attempts, and captured unpinned and pinned scheduler counters.linux-constrained— Linux host where at least one required topology or scheduler signal is unavailable. Commit only when intentionally documenting the constraint.unsupported-host— non-Linux host; no CPU-affinity, scheduler-migration, or NUMA evidence.
Use QSL_NUMA_ALLOW_CONSTRAINED=1 only when the committed result is intentionally constrained.
Use QSL_NUMA_CPU=<cpu> to pin a specific CPU; otherwise the script picks the first CPU allowed by
the current cpuset.
Unsupported or constrained hosts are valid outcomes. macOS, Docker Desktop, restricted CI,
single-NUMA-node Linux machines, and hosts that can pin a CPU but cannot bind local/remote NUMA
memory should be labeled as constrained rather than used to imply full NUMA or production-latency
evidence. The committed numa_affinity_study.txt is now from the bare-metal Apple M2 host, which is
a single-NUMA-node machine — so it is linux-constrained for NUMA purposes (real CPU pinning, but
no cross-node local/remote binding to measure), not because of virtualization.
M44 owns the SPSC cursor false-sharing study. Run:
make false-sharing-studyThis builds the benchmark preset and runs scripts/run_false_sharing_study.sh, which records a
benchmark-only packed-vs-padded SPSC queue-cursor comparison in
results/false_sharing_study.txt. The study uses the same producer-owned tail /
consumer-owned head release/acquire observation pattern as the production SpscRing, but it does
not change the production ring layout. Treat the artifact as host-local cache-line contention
evidence; scheduler placement, CPU topology, and OS behavior can move the result.
M48 owns late-stage DPDK research. Run:
make dpdk-checkThis writes results/dpdk_environment.txt. It is a non-mutating support check: it does not reserve
hugepages, load kernel modules, bind NICs, or send packets. Treat it as research/environment
evidence only unless a later prototype artifact records DPDK version, EAL arguments, hugepage
state, device binding, packet workload, and source provenance.
M49 owns NIC offload, RSS, and hardware timestamping research. Run:
make nic-offload-checkThis writes results/nic_offload_environment.txt. It is a non-mutating capability check: it does
not change offload flags, RSS tables, queue counts, timestamp filters, driver bindings, IRQ
affinity, or CPU affinity, and it does not send packets. Treat it as environment classification or
read-only device capability observation only unless a future artifact records a real NIC workload,
timestamp source, queue/IRQ placement, packet shape, drops/backpressure, and source provenance.
These are in-process microbenchmarks on a commodity machine with the standard library and a general-purpose allocator. They are useful for regression detection and honest, order-of- magnitude framing — not evidence of production trading-system latency.