diff --git a/AGENTS.md b/AGENTS.md index 0a7abc4..2111146 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -385,6 +385,7 @@ Keep this synchronized with the Makefile. - `make bench-recovery` — run M46 recovery benchmarking (full-replay restart vs book rebuild) - `make perf-stat` — run Linux `perf stat` workflow where supported - `make perf-record` — run Linux `perf record/report` workflow where supported +- `make flamegraph` — render a Linux `perf` call-graph flamegraph (SVG) where supported - `make numa-study` — run Linux CPU-affinity / scheduler-migration / NUMA-locality study where supported - `make false-sharing-study` — run benchmark-only packed-vs-padded SPSC cursor contention study - `make profile-io` — run Linux syscall/socket-path profiling where supported diff --git a/CLAUDE.md b/CLAUDE.md index 5fbdf6c..46e95c2 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -385,6 +385,7 @@ Keep this synchronized with the Makefile. - `make bench-recovery` — run M46 recovery benchmarking (full-replay restart vs book rebuild) - `make perf-stat` — run Linux `perf stat` workflow where supported - `make perf-record` — run Linux `perf record/report` workflow where supported +- `make flamegraph` — render a Linux `perf` call-graph flamegraph (SVG) where supported - `make numa-study` — run Linux CPU-affinity / scheduler-migration / NUMA-locality study where supported - `make false-sharing-study` — run benchmark-only packed-vs-padded SPSC cursor contention study - `make profile-io` — run Linux syscall/socket-path profiling where supported diff --git a/MILESTONES.md b/MILESTONES.md index 01d240c..c32aec8 100644 --- a/MILESTONES.md +++ b/MILESTONES.md @@ -484,7 +484,9 @@ Do not pull backlog items into earlier PRs. - FIX-like text protocol adapter. (#29) - Web dashboard for visualization. (#30) - Docker packaging. (#31) -- Perf/flamegraph docs. (#32) +- Perf/flamegraph docs. (#32) — **done**: `make flamegraph` renders a perf call-graph flamegraph + via the dependency-free `scripts/flamegraph.py` (`results/flamegraph.svg` + `.txt`), unit-tested in + `tests/shell/test_flamegraph.sh`. Full hardware cache-PMU evidence stays in #90. - GitHub Pages documentation site. (#33) ### Differential-testing follow-ups (prioritized) diff --git a/Makefile b/Makefile index 426e0bb..8c2e932 100644 --- a/Makefile +++ b/Makefile @@ -1,4 +1,4 @@ -.PHONY: configure build test check fmt fmt-check tidy bench bench-diff bench-allocator bench-storage bench-recovery perf-stat perf-record numa-study false-sharing-study profile-io socket-stress socket-load dpdk-check nic-offload-check crash-recovery concurrency-stress asan tsan demo check-fixtures check-manifest determinism divergence-demo clean +.PHONY: configure build test check fmt fmt-check tidy bench bench-diff bench-allocator bench-storage bench-recovery perf-stat perf-record flamegraph numa-study false-sharing-study profile-io socket-stress socket-load dpdk-check nic-offload-check crash-recovery concurrency-stress asan tsan demo check-fixtures check-manifest determinism divergence-demo clean BUILD_DIR := build/dev @@ -63,6 +63,13 @@ perf-record: cmake --build --preset bench --target qsl-bench QSL_BENCH_BIN=build/bench/qsl-bench bash scripts/perf_record.sh +# Issue #32: render a perf call-graph flamegraph (SVG) from the benchmark harness. Linux-only. +flamegraph: + @test "$$(uname -s)" = "Linux" || { echo "error: make flamegraph requires Linux perf; current OS is $$(uname -s)." >&2; exit 2; } + cmake --preset bench + cmake --build --preset bench --target qsl-bench + QSL_BENCH_BIN=build/bench/qsl-bench bash scripts/flamegraph.sh + # M43: CPU-affinity / scheduler-migration / NUMA locality study. Linux-only. numa-study: @if test "$$(uname -s)" != "Linux"; then \ diff --git a/PROGRESS.md b/PROGRESS.md index 5888c0e..2919d13 100644 --- a/PROGRESS.md +++ b/PROGRESS.md @@ -370,6 +370,22 @@ Lower priority: (E-core) PMU carries live counts — the `apple_blizzard_pmu/...` rows read `` in `results/perf_stat_linux.txt` because the single-threaded benchmark stays on the Avalanche P-cores. Docs/memory only; no code or artifacts changed. +- [2026-06-21] Issue #32 flamegraph profiling artifact (`perf/flamegraph-artifact`, stacked on the + Codex-followup branch). Added `make flamegraph` → `scripts/flamegraph.sh`, which records + `perf record --call-graph dwarf -F 4000 -g -e cpu-clock` on `qsl-bench` and renders + `results/flamegraph.svg` (+ `results/flamegraph.txt` provenance/classification companion). The + fold + SVG render live in `scripts/flamegraph.py`, a dependency-free stdlib-only stackcollapse + + flamegraph renderer (no vendored Perl FlameGraph toolkit), deterministic by design (frames sorted + by name; colors a pure function of the name; no RNG/timestamps in the drawn body). DWARF call + graphs are used because the Release `bench` preset omits frame pointers; application symbols + (`OrderBook::add_limit`, `MatchingEngine::new_limit`, the replay path, …) still resolve from the + symtab. Added `tests/shell/test_flamegraph.sh` (CTest-registered, python3-only, skips cleanly if + absent) covering folding (offset/dso stripping, perf-order reversal, comm-at-base, count + aggregation, sortedness), SVG well-formedness, XML escaping, determinism, and empty-input + handling; `make check` 242/242. The committed `results/flamegraph.svg`/`.txt` were generated on + the bare-metal Fedora Asahi host (aarch64) from the clean committed tree (`Dirty inputs: no`). + This is a software cpu-clock sampling hot-symbol profile, not a latency/throughput claim; full + hardware cache-PMU evidence stays in #90. Do not merge from automation; human squash-merges. - [2026-06-03] M35: implemented a multi-client TCP connection-scaling load test (`scripts/socket_load.sh`, `make socket-load`, Linux-only) driving N concurrent `qsl-client`s against the portable TCP and epoll (M34) gateways; `results/socket_load_summary.txt` is Docker-generated and constrained. A `/code-review` (3 finder agents) caught and fixed real measurement-integrity bugs before the PR: a failed trial's `wall=0` no longer poisons the reported best (only trials whose gateway served count toward the min); the `completed` column reports the WORST per-trial completion, not the last, so partial/total trial failures are surfaced rather than masked; a per-client `timeout` bounds a hang if the gateway dies; and `QSL_LOAD_TRIALS` is validated. Post-PR hardening uses fresh monotonic ports per gateway start, retries transient startup/serve failures on new ports, and refuses to write a partial artifact unless `QSL_LOAD_ALLOW_PARTIAL=1` is set intentionally; the refreshed artifact records `Dirty tree: no`. The scaling-shape claim remains constrained to loopback connection setup, not a demonstrated production-capacity advantage for either transport. Deferred follow-up: a shared `scripts/lib` to remove the dirty-tree / `wait_ready` / gateway-stop duplication across the three socket scripts. - [2026-06-03] M35: started after M34 (#98) squash-merged (commit 9e3750b). Scope: multi-client load / socket-pressure testing of the gateway/feed path (TCP/UDP stress, socket-buffer pressure, connection scaling, backpressure) building on M34's epoll multi-client path and M30's socket tooling. Constraints: scripts/tests document load shape + environment; results must distinguish kernel/socket pressure from user-space engine cost; no production-capacity claims (honest constrained-environment framing, like M29/M30). - [2026-06-04] M35: PR #100 squash-merged to `main` as a86b701 after all CI jobs and review checks were green. M35 is now landed; original M36 NUMA remains deferred until the repository-health refactor analysis is completed or explicitly skipped by the human. diff --git a/README.md b/README.md index 4332532..d1bb19d 100644 --- a/README.md +++ b/README.md @@ -109,6 +109,23 @@ Reproduce with `make bench` (numbers will differ by machine). The differential-t [`results/differential.txt`](results/differential.txt) — kept separate so it does not disturb the core numbers above. +### Flamegraph + +Where on-CPU time goes in the `qsl-bench` synthetic suite, rendered by `make flamegraph` +(`scripts/flamegraph.sh` → the dependency-free `scripts/flamegraph.py` — no external FlameGraph +toolchain): + +[![qsl-bench cpu-clock flamegraph](results/flamegraph.svg)](results/flamegraph.svg) + +This is a **software cpu-clock sampling** hot-symbol profile, **not** PMU evidence: frame width is +proportional to on-CPU samples (329 folded across 159 stacks on this run), not wall-clock latency or +throughput, and it is hardware/kernel/compiler/build dependent. The hot frames are protocol +`decode_new_order`, gateway session framing, `MatchingEngine::new_limit`, and order-book +cancel/allocation. Provenance and classification are in +[`results/flamegraph.txt`](results/flamegraph.txt); methodology in +[docs/perf_analysis.md](docs/perf_analysis.md). GitHub renders the SVG statically; download the raw +file for interactive zoom and search. + ## Limitations - **Synthetic and local.** No real market data, no real venue connectivity, no order types diff --git a/docs/perf_analysis.md b/docs/perf_analysis.md index 3ab3881..7400f02 100644 --- a/docs/perf_analysis.md +++ b/docs/perf_analysis.md @@ -55,6 +55,30 @@ default is intentional: many CI, VM, and container environments do not expose ha to unprivileged processes, and the benchmark harness is short enough that a lower frequency can miss the minimum sample count needed for meaningful hot-symbol ordering. +Render a flamegraph (issue #32): + +```bash +make flamegraph +``` + +This runs `scripts/flamegraph.sh`, which records call-graph samples +(`perf record --call-graph dwarf -F 4000 -g -e cpu-clock`), folds them, and renders an SVG to +`results/flamegraph.svg` plus a text companion `results/flamegraph.txt` (provenance, classification, +and the top folded stacks). DWARF call graphs are used so stacks unwind correctly even though the +`bench` (Release) preset omits frame pointers — the application symbols (`OrderBook::add_limit`, +`MatchingEngine::new_limit`, the replay path, …) resolve from the symbol table without changing the +optimization level under measurement. + +The folding and SVG rendering live in `scripts/flamegraph.py`, a dependency-free Python script +(standard library only) that reimplements the `stackcollapse` + flamegraph data model rather than +vendoring Brendan Gregg's Perl toolkit, so the artifact is reproducible from this repository alone. +The renderer is deterministic — frames are sorted by name and colors are a pure function of the +frame name (no RNG, no timestamps in the drawn body) — and is unit-tested in +`tests/shell/test_flamegraph.sh` (registered with CTest, runs under `make check`). Frame width is +proportional to on-CPU samples; this is a software cpu-clock sampling profile for **hot-symbol +investigation**, not a latency or throughput measurement. Set `QSL_FLAMEGRAPH_EVENT=cycles` to +sample the hardware PMU cycles event instead, where the host exposes it. + ## Required Environment Both scripts are Linux-only and fail before running on non-Linux hosts. `perf stat` also fails @@ -113,8 +137,14 @@ counters, permission-limited sampling, or a sample report that is explicitly mar - `results/perf_report_linux.txt` records benchmark output, `perf record` stderr, and `perf report --stdio` output. It is useful as a hot-symbol profile only when `No samples: no`, `Insufficient samples: no`, and `Sample count` is at least `Minimum samples for hot profile`. -- `build/perf/qsl-bench.perf.data` is generated by `make perf-record` and is intentionally not - committed; it is host-specific binary profiler data. +- `results/flamegraph.svg` is the rendered flamegraph from `make flamegraph`; `results/flamegraph.txt` + is its provenance/classification companion (and lists the top folded stacks). Treat frame widths as + a hot-symbol guide only when the `.txt` reports a `flamegraph (...)` `Artifact:` and a `Sample + count` at least `Minimum samples for hot profile`; a `constrained-environment validation` label + means sampling did not capture enough stacks to trust. +- `build/perf/qsl-bench.perf.data` and `build/perf/qsl-bench.flame.data` are generated by + `make perf-record` / `make flamegraph` and are intentionally not committed; they are host-specific + binary profiler data. Each artifact includes hardware, kernel, compiler, perf version, build type, dataset, command, event set, and source-digest provenance. The `Source digest` is the authoritative source identity; diff --git a/results/README.md b/results/README.md index 49bd9d2..0f8b7aa 100644 --- a/results/README.md +++ b/results/README.md @@ -23,6 +23,12 @@ Benchmark results produced by `make bench` and scripts under `scripts/`. - `perf_report_linux.txt` — Linux `perf record/report` hot-symbol output for the benchmark harness (`make perf-record`). It is useful as a hot-symbol profile only when the file says `No samples: no`, `Insufficient samples: no`, and the sample count meets the reported minimum. +- `flamegraph.svg` / `flamegraph.txt` — Linux `perf` call-graph flamegraph (`make flamegraph`, + issue #32) rendered by the dependency-free `scripts/flamegraph.py`. The `.svg` is the visual + (frame width ∝ on-CPU samples) with provenance in a leading XML comment; the `.txt` carries + provenance, the `Artifact:` classification, and the top folded stacks. It is a software cpu-clock + sampling profile for hot-symbol investigation, not a latency/throughput claim — trust frame widths + only when the `.txt` reports a `flamegraph (...)` artifact with enough samples. - `numa_affinity_study.txt` — Linux CPU-affinity / scheduler-migration / NUMA-locality study output (`make numa-study`). It must self-classify as `full-linux-numa`, `linux-constrained`, or `unsupported-host`; only `full-linux-numa` is full NUMA evidence. diff --git a/results/flamegraph.svg b/results/flamegraph.svg new file mode 100644 index 0000000..80466d2 --- /dev/null +++ b/results/flamegraph.svg @@ -0,0 +1,31 @@ + + +QSL Matching-Engine Flame Graph (qsl-bench)flamegraph (software cpu-clock sampling hot-symbol profile) | Linux aarch64 | cpu-clock @ 4000Hz | 329 samples | 159 stacks | 2026-06-22T02:18:23ZSearch all (329 cpu-clock samples, 100.00%)allqsl-bench (329 cpu-clock samples, 100.00%)qsl-bench[unknown] (251 cpu-clock samples, 76.29%)[unknown][unknown] (237 cpu-clock samples, 72.04%)[unknown][unknown] (201 cpu-clock samples, 61.09%)[unknown][unknown] (2 cpu-clock samples, 0.61%)[unknown] (2 cpu-clock samples, 0.61%)[unknown] (2 cpu-clock samples, 0.61%)[unknown] (2 cpu-clock samples, 0.61%)[unknown] (1 cpu-clock samples, 0.30%)do_lookup_x (1 cpu-clock samples, 0.30%)_dl_lookup_symbol_x (1 cpu-clock samples, 0.30%)_dl_new_hash (1 cpu-clock samples, 0.30%)__libc_start_call_main (199 cpu-clock samples, 60.49%)__libc_start_call_mainmain (199 cpu-clock samples, 60.49%)maincfree@GLIBC_2.17 (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::add_limit(unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) (20 cpu-clock samples, 6.08%)qsl::en..decltype(auto) qsl::engine::OrderBook::dispatch_storage<qsl::engine::OrderBook::contains(unsigned long) const::{lambda()#1}, qsl::engine::OrderBook::contains(unsigned long) const::{lambda(qsl::engine::OrderBook::IntrusiveStore const&)#1}, qsl::engine::OrderBook::contains(unsigned long) const::{lambda(qsl::engine::OrderBook::ContiguousStore const&)#1}>(qsl::engine::OrderBook::contains(unsigned long) const::{lambda()#1}&&, qsl::engine::OrderBook::contains(unsigned long) const::{lambda(qsl::engine::OrderBook::IntrusiveStore const&)#1}&&, qsl::engine::OrderBook::contains(unsigned long) const::{lambda(qsl::engine::OrderBook::ContiguousStore const&)#1}&&) const [clone .isra.0] (2 cpu-clock samples, 0.61%)qsl::engine::OrderBook::match_baseline(qsl::core::Side, qsl::engine::OrderBook::MatchContext&) (2 cpu-clock samples, 0.61%)qsl::engine::OrderBook::rest(unsigned long, qsl::core::Side, long, unsigned int) (13 cpu-clock samples, 3.95%)qsl..operator new(unsigned long, std::align_val_t) (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::level_for[abi:cxx11](qsl::core::Side, long) (4 cpu-clock samples, 1.22%)std::pair<std::_Rb_tree_iterator<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > >, bool> std::_Rb_tree<long, std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > >, std::_Select1st<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > >, std::greater<long>, std::pmr::polymorphic_allocator<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > > >::_M_emplace_unique<long&, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > >(long&, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> >&&) (3 cpu-clock samples, 0.91%)std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) (1 cpu-clock samples, 0.30%)std::__detail::_Map_base<unsigned long, std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, std::pmr::polymorphic_allocator<std::pair<unsigned long const, qsl::engine::OrderBook::Locator> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true>, true>::operator[](unsigned long const&) (7 cpu-clock samples, 2.13%)std::_Hashtable<unsigned long, std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, std::pmr::polymorphic_allocator<std::pair<unsigned long const, qsl::engine::OrderBook::Locator> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, false>*, unsigned long) (3 cpu-clock samples, 0.91%)std::__detail::_Prime_rehash_policy::_M_need_rehash(unsigned long, unsigned long, unsigned long) const (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::cancel(unsigned long) (18 cpu-clock samples, 5.47%)qsl::e..decltype(auto) qsl::engine::OrderBook::dispatch_storage<qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}>(qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}&&) [clone .isra.0] (18 cpu-clock samples, 5.47%)declty..qsl::engine::OrderBook::erase_resting_order(qsl::engine::OrderBook::Locator const&) (13 cpu-clock samples, 3.95%)qsl..std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) (4 cpu-clock samples, 1.22%)std::__detail::_List_node_base::_M_unhook() (1 cpu-clock samples, 0.30%)std::pmr::(anonymous namespace)::newdel_res_t::do_deallocate(void*, unsigned long, unsigned long) (1 cpu-clock samples, 0.30%)std::_Hashtable<unsigned long, std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, std::pmr::polymorphic_allocator<std::pair<unsigned long const, qsl::engine::OrderBook::Locator> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node<std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, false>*) (3 cpu-clock samples, 0.91%)cfree@GLIBC_2.17 (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::modify(unsigned long, long, unsigned int) (2 cpu-clock samples, 0.61%)qsl::gateway::Session::on_bytes(std::span<std::byte const, 18446744073709551615ul>) (56 cpu-clock samples, 17.02%)qsl::gateway::Session::on..qsl::gateway::Session::on_bytes(std::span<std::byte const, 18446744073709551615ul>, std::vector<std::byte, std::allocator<std::byte> >&, unsigned long) (56 cpu-clock samples, 17.02%)qsl::gateway::Session::on..qsl::gateway::Session::process_frame(std::span<std::byte const, 18446744073709551615ul>, std::vector<std::byte, std::allocator<std::byte> >&, unsigned long) (53 cpu-clock samples, 16.11%)qsl::gateway::Session::p..cfree@GLIBC_2.17 (1 cpu-clock samples, 0.30%)qsl::gateway::(anonymous namespace)::emit_result(unsigned long, qsl::gateway::GatewayResult const&, std::vector<std::byte, std::allocator<std::byte> >&, unsigned long) (13 cpu-clock samples, 3.95%)qsl..cfree@GLIBC_2.17 (3 cpu-clock samples, 0.91%)qsl::gateway::(anonymous namespace)::append(std::vector<std::byte, std::allocator<std::byte> >&, std::vector<std::byte, std::allocator<std::byte> > const&, unsigned long) [clone .isra.0] (5 cpu-clock samples, 1.52%)__memcpy_generic (3 cpu-clock samples, 0.91%)qsl::protocol::encode(qsl::protocol::Fill const&) (2 cpu-clock samples, 0.61%)operator new(unsigned long) (1 cpu-clock samples, 0.30%)qsl::gateway::OrderGateway::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) (33 cpu-clock samples, 10.03%)qsl::gateway::..qsl::engine::MatchingEngine::can_store_limit(unsigned int, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) const (2 cpu-clock samples, 0.61%)qsl::engine::MatchingEngine::contains(unsigned int, unsigned long) const (4 cpu-clock samples, 1.22%)qsl::engine::MatchingEngine::has_symbol(unsigned int) const (1 cpu-clock samples, 0.30%)qsl::engine::MatchingEngine::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) (16 cpu-clock samples, 4.86%)qsl::..qsl::engine::OrderBook::add_limit(unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) (3 cpu-clock samples, 0.91%)qsl::engine::OrderBook::match_baseline(qsl::core::Side, qsl::engine::OrderBook::MatchContext&) (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::can_store_limit(qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) const (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::contains(unsigned long) const (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::can_store_limit(qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) const (2 cpu-clock samples, 0.61%)qsl::engine::OrderBook::contains(unsigned long) const (1 cpu-clock samples, 0.30%)qsl::engine::check_limit(qsl::engine::RiskConfig const&, qsl::core::Side, long, unsigned int) (1 cpu-clock samples, 0.30%)qsl::protocol::decode_header(std::span<std::byte const, 18446744073709551615ul>) (3 cpu-clock samples, 0.91%)qsl::protocol::decode_new_order(std::span<std::byte const, 18446744073709551615ul>) (3 cpu-clock samples, 0.91%)qsl::protocol::decode_header(std::span<std::byte const, 18446744073709551615ul>) (1 cpu-clock samples, 0.30%)qsl::protocol::decode_new_order(std::span<std::byte const, 18446744073709551615ul>) (15 cpu-clock samples, 4.56%)qsl:..qsl::protocol::encode(qsl::protocol::NewOrder const&, unsigned long) (1 cpu-clock samples, 0.30%)qsl::replay::apply(qsl::engine::MatchingEngine&, std::variant<qsl::replay::RegisterSymbol, qsl::replay::NewLimit, qsl::replay::NewMarket, qsl::replay::Cancel, qsl::replay::Modify> const&) (33 cpu-clock samples, 10.03%)qsl::replay::a..qsl::engine::MatchingEngine::cancel(unsigned int, unsigned long) (4 cpu-clock samples, 1.22%)qsl::engine::OrderBook::cancel(unsigned long) (3 cpu-clock samples, 0.91%)decltype(auto) qsl::engine::OrderBook::dispatch_storage<qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}>(qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}&&) [clone .isra.0] (3 cpu-clock samples, 0.91%)qsl::engine::OrderBook::erase_resting_order(qsl::engine::OrderBook::Locator const&) (2 cpu-clock samples, 0.61%)std::pmr::(anonymous namespace)::newdel_res_t::do_deallocate(void*, unsigned long, unsigned long) (1 cpu-clock samples, 0.30%)std::_Hashtable<unsigned long, std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, std::pmr::polymorphic_allocator<std::pair<unsigned long const, qsl::engine::OrderBook::Locator> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node<std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, false>*) (1 cpu-clock samples, 0.30%)qsl::engine::MatchingEngine::modify(unsigned int, unsigned long, long, unsigned int) (5 cpu-clock samples, 1.52%)qsl::engine::OrderBook::modify(unsigned long, long, unsigned int) (5 cpu-clock samples, 1.52%)decltype(auto) qsl::engine::OrderBook::dispatch_storage<qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}>(qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}&&) [clone .isra.0] (3 cpu-clock samples, 0.91%)qsl::engine::OrderBook::erase_resting_order(qsl::engine::OrderBook::Locator const&) (2 cpu-clock samples, 0.61%)std::_Hashtable<unsigned long, std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, std::pmr::polymorphic_allocator<std::pair<unsigned long const, qsl::engine::OrderBook::Locator> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node<std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, false>*) (1 cpu-clock samples, 0.30%)qsl::engine::MatchingEngine::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) (17 cpu-clock samples, 5.17%)qsl::..qsl::engine::OrderBook::add_limit(unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) (11 cpu-clock samples, 3.34%)qs..qsl::engine::OrderBook::match_baseline(qsl::core::Side, qsl::engine::OrderBook::MatchContext&) (7 cpu-clock samples, 2.13%)qsl::engine::OrderBook::fill_front_order(std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> >&, long, qsl::engine::OrderBook::MatchContext&) (2 cpu-clock samples, 0.61%)std::_Hashtable<unsigned long, std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, std::pmr::polymorphic_allocator<std::pair<unsigned long const, qsl::engine::OrderBook::Locator> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node<std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, false>*) (2 cpu-clock samples, 0.61%)std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::rest(unsigned long, qsl::core::Side, long, unsigned int) (3 cpu-clock samples, 0.91%)qsl::engine::OrderBook::level_for[abi:cxx11](qsl::core::Side, long) (3 cpu-clock samples, 0.91%)std::pair<std::_Rb_tree_iterator<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > >, bool> std::_Rb_tree<long, std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > >, std::_Select1st<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > >, std::greater<long>, std::pmr::polymorphic_allocator<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > > >::_M_emplace_unique<long&, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > >(long&, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> >&&) (2 cpu-clock samples, 0.61%)std::_Rb_tree_decrement(std::_Rb_tree_node_base*) (1 cpu-clock samples, 0.30%)std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) (1 cpu-clock samples, 0.30%)std::pair<std::_Rb_tree_iterator<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > >, bool> std::_Rb_tree<long, std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > >, std::_Select1st<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > >, std::less<long>, std::pmr::polymorphic_allocator<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > > >::_M_emplace_unique<long&, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > >(long&, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> >&&) (1 cpu-clock samples, 0.30%)qsl::engine::MatchingEngine::new_market(unsigned int, unsigned long, qsl::core::Side, unsigned int) (3 cpu-clock samples, 0.91%)qsl::engine::OrderBook::add_market(unsigned long, qsl::core::Side, unsigned int) (2 cpu-clock samples, 0.61%)qsl::engine::OrderBook::match_baseline(qsl::core::Side, qsl::engine::OrderBook::MatchContext&) (2 cpu-clock samples, 0.61%)std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) (1 cpu-clock samples, 0.30%)qsl::replay::generate_flow(unsigned long, unsigned int, unsigned long) (18 cpu-clock samples, 5.47%)qsl::r..qsl::engine::MatchingEngine::contains(unsigned int, unsigned long) const (11 cpu-clock samples, 3.34%)qs..qsl::engine::OrderBook::contains(unsigned long) const (5 cpu-clock samples, 1.52%)qsl::replay::apply(qsl::engine::MatchingEngine&, std::variant<qsl::replay::RegisterSymbol, qsl::replay::NewLimit, qsl::replay::NewMarket, qsl::replay::Cancel, qsl::replay::Modify> const&) (2 cpu-clock samples, 0.61%)qsl::engine::MatchingEngine::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) (2 cpu-clock samples, 0.61%)qsl::engine::OrderBook::add_limit(unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::rest(unsigned long, qsl::core::Side, long, unsigned int) (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::level_for[abi:cxx11](qsl::core::Side, long) (1 cpu-clock samples, 0.30%)std::pair<std::_Rb_tree_iterator<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > >, bool> std::_Rb_tree<long, std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > >, std::_Select1st<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > >, std::less<long>, std::pmr::polymorphic_allocator<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > > >::_M_emplace_unique<long&, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > >(long&, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> >&&) (1 cpu-clock samples, 0.30%)std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) (1 cpu-clock samples, 0.30%)qsl::replay::replay(qsl::engine::MatchingEngine&, std::vector<qsl::replay::LogRecord, std::allocator<qsl::replay::LogRecord> > const&) (34 cpu-clock samples, 10.33%)qsl::replay::r..operator delete(void*, unsigned long) (1 cpu-clock samples, 0.30%)qsl::replay::apply(qsl::engine::MatchingEngine&, std::variant<qsl::replay::RegisterSymbol, qsl::replay::NewLimit, qsl::replay::NewMarket, qsl::replay::Cancel, qsl::replay::Modify> const&) (26 cpu-clock samples, 7.90%)qsl::repla..qsl::engine::MatchingEngine::cancel(unsigned int, unsigned long) (3 cpu-clock samples, 0.91%)qsl::engine::OrderBook::cancel(unsigned long) (1 cpu-clock samples, 0.30%)decltype(auto) qsl::engine::OrderBook::dispatch_storage<qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}>(qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}&&) [clone .isra.0] (1 cpu-clock samples, 0.30%)qsl::engine::MatchingEngine::modify(unsigned int, unsigned long, long, unsigned int) (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::modify(unsigned long, long, unsigned int) (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::add_limit(unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::match_baseline(qsl::core::Side, qsl::engine::OrderBook::MatchContext&) (1 cpu-clock samples, 0.30%)std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) (1 cpu-clock samples, 0.30%)qsl::engine::MatchingEngine::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) (19 cpu-clock samples, 5.78%)qsl::e..qsl::engine::OrderBook::add_limit(unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) (17 cpu-clock samples, 5.17%)qsl::..operator delete(void*, unsigned long) (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::match_baseline(qsl::core::Side, qsl::engine::OrderBook::MatchContext&) (4 cpu-clock samples, 1.22%)std::_Hashtable<unsigned long, std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, std::pmr::polymorphic_allocator<std::pair<unsigned long const, qsl::engine::OrderBook::Locator> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node<std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, false>*) (1 cpu-clock samples, 0.30%)std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::rest(unsigned long, qsl::core::Side, long, unsigned int) (11 cpu-clock samples, 3.34%)qs..qsl::engine::OrderBook::level_for[abi:cxx11](qsl::core::Side, long) (7 cpu-clock samples, 2.13%)std::pair<std::_Rb_tree_iterator<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > >, bool> std::_Rb_tree<long, std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > >, std::_Select1st<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > >, std::greater<long>, std::pmr::polymorphic_allocator<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > > >::_M_emplace_unique<long&, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > >(long&, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> >&&) (4 cpu-clock samples, 1.22%)std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) (2 cpu-clock samples, 0.61%)std::pair<std::_Rb_tree_iterator<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > >, bool> std::_Rb_tree<long, std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > >, std::_Select1st<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > >, std::less<long>, std::pmr::polymorphic_allocator<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > > >::_M_emplace_unique<long&, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > >(long&, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> >&&) (3 cpu-clock samples, 0.91%)std::_Rb_tree_decrement(std::_Rb_tree_node_base*) (1 cpu-clock samples, 0.30%)std::__detail::_Map_base<unsigned long, std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, std::pmr::polymorphic_allocator<std::pair<unsigned long const, qsl::engine::OrderBook::Locator> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true>, true>::operator[](unsigned long const&) (3 cpu-clock samples, 0.91%)std::_Hashtable<unsigned long, std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, std::pmr::polymorphic_allocator<std::pair<unsigned long const, qsl::engine::OrderBook::Locator> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, false>*, unsigned long) (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::contains(unsigned long) const (2 cpu-clock samples, 0.61%)qsl::engine::MatchingEngine::new_market(unsigned int, unsigned long, qsl::core::Side, unsigned int) (1 cpu-clock samples, 0.30%)qsl::replay::decode_command(std::span<std::byte const, 18446744073709551615ul>) (3 cpu-clock samples, 0.91%)operator new(unsigned long) (5 cpu-clock samples, 1.52%)malloc@plt (5 cpu-clock samples, 1.52%)operator new(unsigned long, std::align_val_t) (2 cpu-clock samples, 0.61%)posix_memalign@plt (2 cpu-clock samples, 0.61%)qsl::engine::OrderBook::level_for[abi:cxx11](qsl::core::Side, long) (4 cpu-clock samples, 1.22%)[unknown] (4 cpu-clock samples, 1.22%)[unknown] (4 cpu-clock samples, 1.22%)[unknown] (2 cpu-clock samples, 0.61%)__posix_memalign (2 cpu-clock samples, 0.61%)malloc (2 cpu-clock samples, 0.61%)operator new(unsigned long, std::align_val_t) (2 cpu-clock samples, 0.61%)__posix_memalign (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::rest(unsigned long, qsl::core::Side, long, unsigned int) (7 cpu-clock samples, 2.13%)[unknown] (5 cpu-clock samples, 1.52%)[unknown] (5 cpu-clock samples, 1.52%)[unknown] (5 cpu-clock samples, 1.52%)[unknown] (1 cpu-clock samples, 0.30%)_mid_memalign (1 cpu-clock samples, 0.30%)__posix_memalign (4 cpu-clock samples, 1.22%)malloc (3 cpu-clock samples, 0.91%)operator new(unsigned long, std::align_val_t)@plt (2 cpu-clock samples, 0.61%)qsl::gateway::(anonymous namespace)::emit_result(unsigned long, qsl::gateway::GatewayResult const&, std::vector<std::byte, std::allocator<std::byte> >&, unsigned long) (10 cpu-clock samples, 3.04%)qs..[unknown] (9 cpu-clock samples, 2.74%)[..[unknown] (9 cpu-clock samples, 2.74%)[..cfree@GLIBC_2.17 (3 cpu-clock samples, 0.91%)operator new(unsigned long) (6 cpu-clock samples, 1.82%)malloc (4 cpu-clock samples, 1.22%)operator delete(void*)@plt (1 cpu-clock samples, 0.30%)qsl::gateway::OrderGateway::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) (8 cpu-clock samples, 2.43%)q..[unknown] (8 cpu-clock samples, 2.43%)[..[unknown] (8 cpu-clock samples, 2.43%)[..cfree@GLIBC_2.17 (1 cpu-clock samples, 0.30%)operator new(unsigned long) (7 cpu-clock samples, 2.13%)malloc (4 cpu-clock samples, 1.22%)decltype(auto) qsl::engine::OrderBook::dispatch_storage<qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}>(qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}&&) [clone .isra.0] (1 cpu-clock samples, 0.30%)[unknown] (1 cpu-clock samples, 0.30%)[unknown] (1 cpu-clock samples, 0.30%)cfree@GLIBC_2.17 (1 cpu-clock samples, 0.30%)main (1 cpu-clock samples, 0.30%)decltype(auto) qsl::engine::OrderBook::dispatch_storage<qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}>(qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}&&) [clone .isra.0] (1 cpu-clock samples, 0.30%)[unknown] (1 cpu-clock samples, 0.30%)[unknown] (1 cpu-clock samples, 0.30%)cfree@GLIBC_2.17 (1 cpu-clock samples, 0.30%)operator new(unsigned long) (1 cpu-clock samples, 0.30%)malloc@plt (1 cpu-clock samples, 0.30%)operator new(unsigned long, std::align_val_t) (1 cpu-clock samples, 0.30%)posix_memalign@plt (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::level_for[abi:cxx11](qsl::core::Side, long) (3 cpu-clock samples, 0.91%)[unknown] (3 cpu-clock samples, 0.91%)[unknown] (3 cpu-clock samples, 0.91%)[unknown] (1 cpu-clock samples, 0.30%)[unknown] (1 cpu-clock samples, 0.30%)_mid_memalign (1 cpu-clock samples, 0.30%)cfree@GLIBC_2.17 (1 cpu-clock samples, 0.30%)operator new(unsigned long, std::align_val_t) (1 cpu-clock samples, 0.30%)__posix_memalign (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::rest(unsigned long, qsl::core::Side, long, unsigned int) (3 cpu-clock samples, 0.91%)[unknown] (3 cpu-clock samples, 0.91%)[unknown] (3 cpu-clock samples, 0.91%)[unknown] (3 cpu-clock samples, 0.91%)[unknown] (1 cpu-clock samples, 0.30%)_mid_memalign (1 cpu-clock samples, 0.30%)__posix_memalign (2 cpu-clock samples, 0.61%)malloc (1 cpu-clock samples, 0.30%)qsl::gateway::Session::process_frame(std::span<std::byte const, 18446744073709551615ul>, std::vector<std::byte, std::allocator<std::byte> >&, unsigned long) (3 cpu-clock samples, 0.91%)[unknown] (2 cpu-clock samples, 0.61%)[unknown] (2 cpu-clock samples, 0.61%)cfree@GLIBC_2.17 (2 cpu-clock samples, 0.61%)free@plt (1 cpu-clock samples, 0.30%)std::__detail::_Map_base<unsigned long, std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, std::pmr::polymorphic_allocator<std::pair<unsigned long const, qsl::engine::OrderBook::Locator> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true>, true>::operator[](unsigned long const&) (1 cpu-clock samples, 0.30%)operator new(unsigned long, std::align_val_t)@plt (1 cpu-clock samples, 0.30%)__libc_start_call_main (9 cpu-clock samples, 2.74%)_..[unknown] (9 cpu-clock samples, 2.74%)[..[unknown] (9 cpu-clock samples, 2.74%)[..[unknown] (1 cpu-clock samples, 0.30%)[unknown] (1 cpu-clock samples, 0.30%)unlink_chunk.isra.0 (1 cpu-clock samples, 0.30%)cfree@GLIBC_2.17 (8 cpu-clock samples, 2.43%)c..decltype(auto) qsl::engine::OrderBook::dispatch_storage<qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}>(qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}&&) [clone .isra.0] (4 cpu-clock samples, 1.22%)[unknown] (4 cpu-clock samples, 1.22%)[unknown] (4 cpu-clock samples, 1.22%)cfree@GLIBC_2.17 (4 cpu-clock samples, 1.22%)main (11 cpu-clock samples, 3.34%)main[unknown] (5 cpu-clock samples, 1.52%)[unknown] (5 cpu-clock samples, 1.52%)[unknown] (1 cpu-clock samples, 0.30%)_int_free_merge_chunk (1 cpu-clock samples, 0.30%)operator new(unsigned long) (4 cpu-clock samples, 1.22%)malloc (4 cpu-clock samples, 1.22%)free@plt (2 cpu-clock samples, 0.61%)operator delete(void*)@plt (3 cpu-clock samples, 0.91%)operator delete(void*, unsigned long)@plt (1 cpu-clock samples, 0.30%)operator new(unsigned long) (4 cpu-clock samples, 1.22%)malloc@plt (4 cpu-clock samples, 1.22%)qsl::engine::MatchingEngine::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) (8 cpu-clock samples, 2.43%)q..[unknown] (3 cpu-clock samples, 0.91%)[unknown] (3 cpu-clock samples, 0.91%)cfree@GLIBC_2.17 (2 cpu-clock samples, 0.61%)operator new(unsigned long) (1 cpu-clock samples, 0.30%)malloc (1 cpu-clock samples, 0.30%)free@plt (1 cpu-clock samples, 0.30%)operator delete(void*)@plt (1 cpu-clock samples, 0.30%)operator delete(void*, unsigned long)@plt (1 cpu-clock samples, 0.30%)operator new(unsigned long)@plt (2 cpu-clock samples, 0.61%)qsl::engine::MatchingEngine::new_market(unsigned int, unsigned long, qsl::core::Side, unsigned int) (1 cpu-clock samples, 0.30%)operator new(unsigned long)@plt (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::add_limit(unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) (12 cpu-clock samples, 3.65%)qsl..[unknown] (10 cpu-clock samples, 3.04%)[u..[unknown] (10 cpu-clock samples, 3.04%)[u..[unknown] (7 cpu-clock samples, 2.13%)[unknown] (1 cpu-clock samples, 0.30%)_mid_memalign (1 cpu-clock samples, 0.30%)__posix_memalign (6 cpu-clock samples, 1.82%)malloc (4 cpu-clock samples, 1.22%)operator new(unsigned long, std::align_val_t) (3 cpu-clock samples, 0.91%)__posix_memalign (2 cpu-clock samples, 0.61%)memcpy@plt (1 cpu-clock samples, 0.30%)operator delete(void*)@plt (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::erase_resting_order(qsl::engine::OrderBook::Locator const&) (11 cpu-clock samples, 3.34%)qs..operator delete(void*, std::align_val_t)@plt (5 cpu-clock samples, 1.52%)operator delete(void*, unsigned long, std::align_val_t)@plt (5 cpu-clock samples, 1.52%)std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&)@plt (1 cpu-clock samples, 0.30%)qsl::engine::OrderBook::fill_front_order(std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> >&, long, qsl::engine::OrderBook::MatchContext&) (1 cpu-clock samples, 0.30%)operator new(unsigned long)@plt (1 cpu-clock samples, 0.30%)qsl::gateway::(anonymous namespace)::append(std::vector<std::byte, std::allocator<std::byte> >&, std::vector<std::byte, std::allocator<std::byte> > const&, unsigned long) [clone .isra.0] (1 cpu-clock samples, 0.30%)operator delete(void*)@plt (1 cpu-clock samples, 0.30%)qsl::gateway::Session::on_bytes(std::span<std::byte const, 18446744073709551615ul>, std::vector<std::byte, std::allocator<std::byte> >&, unsigned long) (3 cpu-clock samples, 0.91%)[unknown] (2 cpu-clock samples, 0.61%)[unknown] (2 cpu-clock samples, 0.61%)cfree@GLIBC_2.17 (2 cpu-clock samples, 0.61%)memcpy@plt (1 cpu-clock samples, 0.30%)qsl::protocol::encode(qsl::protocol::Ack const&) (1 cpu-clock samples, 0.30%)operator new(unsigned long)@plt (1 cpu-clock samples, 0.30%)qsl::protocol::encode(qsl::protocol::NewOrder const&, unsigned long) (1 cpu-clock samples, 0.30%)operator new(unsigned long)@plt (1 cpu-clock samples, 0.30%)qsl::replay::apply(qsl::engine::MatchingEngine&, std::variant<qsl::replay::RegisterSymbol, qsl::replay::NewLimit, qsl::replay::NewMarket, qsl::replay::Cancel, qsl::replay::Modify> const&) (1 cpu-clock samples, 0.30%)[unknown] (1 cpu-clock samples, 0.30%)[unknown] (1 cpu-clock samples, 0.30%)operator new(unsigned long) (1 cpu-clock samples, 0.30%)malloc (1 cpu-clock samples, 0.30%)qsl::replay::replay(qsl::engine::MatchingEngine&, std::vector<qsl::replay::LogRecord, std::allocator<qsl::replay::LogRecord> > const&) (1 cpu-clock samples, 0.30%)memcpy@plt (1 cpu-clock samples, 0.30%)std::_Hashtable<unsigned long, std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, std::pmr::polymorphic_allocator<std::pair<unsigned long const, qsl::engine::OrderBook::Locator> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node<std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, false>*) (7 cpu-clock samples, 2.13%)free@plt (2 cpu-clock samples, 0.61%)operator delete(void*, unsigned long, std::align_val_t)@plt (5 cpu-clock samples, 1.52%)std::pair<std::_Rb_tree_iterator<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > >, bool> std::_Rb_tree<long, std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > >, std::_Select1st<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > >, std::greater<long>, std::pmr::polymorphic_allocator<std::pair<long const, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > > > >::_M_emplace_unique<long&, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> > >(long&, std::__cxx11::list<qsl::engine::Order, std::pmr::polymorphic_allocator<qsl::engine::Order> >&&) (2 cpu-clock samples, 0.61%)free@plt (1 cpu-clock samples, 0.30%)std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&)@plt (1 cpu-clock samples, 0.30%) diff --git a/results/flamegraph.txt b/results/flamegraph.txt new file mode 100644 index 0000000..4969a22 --- /dev/null +++ b/results/flamegraph.txt @@ -0,0 +1,59 @@ +Command: make flamegraph +Artifact: flamegraph (software cpu-clock sampling hot-symbol profile) +Hardware: aarch64 +OS: Linux 6.19.14-400.asahi.fc44.aarch64+16k +CPU: Avalanche-M2 +Compiler: c++ (GCC) 16.1.1 20260515 (Red Hat 16.1.1-2) +Perf: perf version 6.19.14-400.asahi.fc44.aarch64 +Perf paranoid: 2 +Build type: Release +Provenance version: 1 +Git commit (informational): 31070b1 +Source digest: sha256:6aa521e6295a99f9dbf7dee9e5bcef04e93174ed12c3e8de9b991a8bfc14c809 +Source digest scope: flamegraph-benchmark +Dirty inputs: no +Generated output: results/flamegraph.svg +Date: 2026-06-22T02:18:23Z +Benchmark binary: build/bench/qsl-bench +Dataset: qsl-bench default synthetic benchmark suite +Call graph: dwarf +Record event: cpu-clock +Sample freq: 4000 Hz +Sample count (folded total): 329 +Sample count (perf record est.): 329 +Folded stacks: 159 +Minimum samples for hot profile: 200 +Insufficient samples: no +Record status: 0 +Script status: 0 +Perf access limitation: no +Flamegraph SVG: results/flamegraph.svg +Perf data: build/perf/qsl-bench.flame.data (generated, not intended for commit) + +Caveat: this flamegraph is a software cpu-clock sampling profile for hot-symbol +investigation. Frame width is proportional to on-CPU samples, not wall-clock +latency or throughput, and is hardware/kernel/compiler/build dependent. + +Top 15 folded stacks (count stack): + 15 qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::protocol::decode_new_order(std::span) + 11 qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::gateway::Session::on_bytes(std::span);qsl::gateway::Session::on_bytes(std::span, std::vector >&, unsigned long);qsl::gateway::Session::process_frame(std::span, std::vector >&, unsigned long);qsl::gateway::OrderGateway::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce);qsl::engine::MatchingEngine::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) + 11 qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::replay::generate_flow(unsigned long, unsigned int, unsigned long);qsl::engine::MatchingEngine::contains(unsigned int, unsigned long) const + 8 qsl-bench;__libc_start_call_main;[unknown];[unknown];cfree@GLIBC_2.17 + 7 qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::engine::OrderBook::cancel(unsigned long);decltype(auto) qsl::engine::OrderBook::dispatch_storage(qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}&&) [clone .isra.0];qsl::engine::OrderBook::erase_resting_order(qsl::engine::OrderBook::Locator const&) + 6 qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::gateway::Session::on_bytes(std::span);qsl::gateway::Session::on_bytes(std::span, std::vector >&, unsigned long);qsl::gateway::Session::process_frame(std::span, std::vector >&, unsigned long);qsl::gateway::OrderGateway::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) + 6 qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::replay::apply(qsl::engine::MatchingEngine&, std::variant const&);qsl::engine::MatchingEngine::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce) + 5 qsl-bench;qsl::engine::OrderBook::erase_resting_order(qsl::engine::OrderBook::Locator const&);operator delete(void*, std::align_val_t)@plt + 5 qsl-bench;qsl::engine::OrderBook::erase_resting_order(qsl::engine::OrderBook::Locator const&);operator delete(void*, unsigned long, std::align_val_t)@plt + 5 qsl-bench;std::_Hashtable, std::pmr::polymorphic_allocator >, std::__detail::_Select1st, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node, false>*);operator delete(void*, unsigned long, std::align_val_t)@plt + 5 qsl-bench;[unknown];[unknown];operator new(unsigned long);malloc@plt + 5 qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::replay::generate_flow(unsigned long, unsigned int, unsigned long);qsl::engine::OrderBook::contains(unsigned long) const + 4 qsl-bench;decltype(auto) qsl::engine::OrderBook::dispatch_storage(qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}&&) [clone .isra.0];[unknown];[unknown];cfree@GLIBC_2.17 + 4 qsl-bench;main;[unknown];[unknown];operator new(unsigned long);malloc + 4 qsl-bench;operator new(unsigned long);malloc@plt + +Benchmark output: +order_book add/mod/cancel 200000 ops 132.8 ns/op 7531861 ops/sec +protocol encode+decode 500000 ops 20.5 ns/op 48773893 ops/sec +gateway session (fill) 200000 ops 127.4 ns/op 7848348 ops/sec +matching engine flow 5004 items 101.6 ns/item 9840697 items/sec +replay command log 5004 items 112.0 ns/item 8928265 items/sec diff --git a/scripts/flamegraph.py b/scripts/flamegraph.py new file mode 100755 index 0000000..a9cc7f3 --- /dev/null +++ b/scripts/flamegraph.py @@ -0,0 +1,364 @@ +#!/usr/bin/env python3 +"""Self-contained flamegraph generator for QSL perf profiles. + +Reads `perf script` output on stdin, folds it into collapsed stacks +(stackcollapse), and renders a deterministic SVG flamegraph on stdout. + +This is intentionally dependency-free (Python standard library only) so the +profiling artifact is reproducible from the repository alone, without vendoring +Brendan Gregg's Perl FlameGraph toolkit. The data model is identical: a +"collapsed stack" is `root;...;leafcount`, and the flamegraph is a +proportional, sorted, recursive layout of those stacks. + +Modes: + flamegraph.py perf script (stdin) -> SVG (stdout) + flamegraph.py --collapse-only perf script (stdin) -> collapsed stacks (stdout) + flamegraph.py --from-collapsed collapsed stacks (stdin) -> SVG (stdout) + +The rendering is deterministic: frames are sorted by name, and colors are a pure +function of the frame name (no RNG, no timestamps in the drawn body). The driver +script (scripts/flamegraph.sh) records run provenance separately so the SVG stays +reproducible for a given input. +""" + +from __future__ import annotations + +import argparse +import html +import re +import sys +import zlib +from dataclasses import dataclass + +# SVG layout constants (pixels). +_SIDE = 10 # left/right margin +_PAD_TOP = 54 # space above the frames for title/subtitle +_PAD_BOTTOM = 16 # space below the frames for the detail line + +# perf-script stack frame line: leading whitespace, hex address, symbol, "(dso)". +# C++ symbols contain spaces and parentheses, so the dso is taken as the final +# parenthesized group and the symbol is everything between the address and it. +_FRAME_RE = re.compile(r"^\s+(?P[0-9a-fA-F]+)\s+(?P.*\S)\s*$") +_OFFSET_RE = re.compile(r"\+0x[0-9a-fA-F]+$") +# Trailing " (dso)" group. perf prints a space before the dso, and dso strings +# (paths or "[unknown]") never contain parens, so a non-nested match is exact and +# avoids stripping a C++ signature's own "(...)" (which has no preceding space). +_DSO_RE = re.compile(r"\s+\([^()]*\)$") + + +def _clean_symbol(rest: str) -> str: + """Turn a perf-script frame body into a folded frame name. + + Drops the trailing `(dso)` and the `+0xoffset`, matching stackcollapse-perf. + """ + rest = _DSO_RE.sub("", rest) + rest = _OFFSET_RE.sub("", rest).strip() + return rest if rest else "[unknown]" + + +class _Folder: + """Accumulates `perf script` samples into collapsed {stack: count} pairs. + + Keeping the per-line state transitions as small methods keeps the parsing + loop flat (one if/elif/else) instead of a deeply nested block. + """ + + def __init__(self) -> None: + self.folded: dict[str, int] = {} + self._comm = "" + self._stack: list[str] = [] + + def start_sample(self, header: str) -> None: + # Header line: "comm pid timestamp: period event:". Finalize any prior + # sample (perf usually separates with a blank line, but not always). + self._flush() + self._comm = header.split()[0] + + def add_frame(self, line: str) -> None: + m = _FRAME_RE.match(line) + if m: + self._stack.append(_clean_symbol(m.group("rest"))) + + def end_sample(self) -> None: + self._flush() + self._comm = "" + + def _flush(self) -> None: + if self._stack: + frames = list(reversed(self._stack)) # perf prints leaf-first + if self._comm: + frames.insert(0, self._comm) + key = ";".join(frames) + self.folded[key] = self.folded.get(key, 0) + 1 + self._stack = [] + + def result(self) -> dict[str, int]: + self._flush() + return self.folded + + +def fold_perf_script(lines) -> dict[str, int]: + """Collapse `perf script` output into {stack_string: sample_count}.""" + folder = _Folder() + for raw in lines: + line = raw.rstrip("\n") + if not line.strip(): + folder.end_sample() + elif line[0].isspace(): + folder.add_frame(line) + else: + folder.start_sample(line) + return folder.result() + + +def parse_collapsed(lines) -> dict[str, int]: + """Parse pre-collapsed `stackcount` lines. + + The canonical folded separator is a space, but a tab is tolerated. Tab is + preferred when present so a stack containing spaces (C++ signatures) still + splits on the trailing count rather than on an interior space. Non-positive + counts are ignored. + """ + folded: dict[str, int] = {} + for raw in lines: + line = raw.rstrip("\n") + if not line.strip(): + continue + sep = "\t" if "\t" in line else " " + stack, found, count = line.rpartition(sep) + if not found: + continue + try: + n = int(count) + except ValueError: + continue + if n <= 0: + continue + folded[stack] = folded.get(stack, 0) + n + return folded + + +class _Node: + __slots__ = ("name", "value", "children") + + def __init__(self, name: str) -> None: + self.name = name + self.value = 0 + self.children: dict[str, _Node] = {} + + +def build_tree(folded: dict[str, int], root_name: str) -> _Node: + root = _Node(root_name) + for stack, count in folded.items(): + root.value += count + node = root + for frame in stack.split(";"): + if not frame: + continue + child = node.children.get(frame) + if child is None: + child = _Node(frame) + node.children[frame] = child + child.value += count + node = child + return root + + +def _color(name: str) -> str: + """Deterministic warm 'hot' palette derived purely from the frame name.""" + h = zlib.crc32(name.encode("utf-8")) & 0xFFFFFFFF + r = 205 + (h % 51) + g = (h >> 8) % 231 + b = (h >> 16) % 56 + return f"rgb({r},{g},{b})" + + +def _layout(node: _Node, depth: int, x: int, out: list) -> None: + """Pre-order walk assigning each node a (depth, x-offset-in-samples).""" + out.append((node, depth, x)) + cursor = x + for name in sorted(node.children): + child = node.children[name] + _layout(child, depth + 1, cursor, out) + cursor += child.value + + +@dataclass +class FlameOptions: + """Styling/labelling knobs for an SVG render.""" + + title: str = "QSL Flame Graph" + subtitle: str = "" + countname: str = "samples" + width: int = 1200 + frame_height: int = 16 + min_px: float = 0.1 + + +@dataclass +class _Canvas: + """Derived geometry passed to per-frame rendering.""" + + total: int + max_depth: int + height: int + plot_width: int + frame_height: int + min_px: float + countname: str + + +def _append_chrome(parts: list, opts: FlameOptions, height: int) -> None: + """Append the static page furniture: SVG root, style, title, controls.""" + width = opts.width + parts.append( + f'\n' + f'' + ) + parts.append( + '' + ) + parts.append(_SEARCH_JS) + parts.append(f'') + parts.append( + f'{html.escape(opts.title)}' + ) + parts.append( + f'' + f'{html.escape(opts.subtitle)}' + ) + parts.append( + f'Search' + ) + parts.append( + f' ' + ) + + +def _truncate(label: str, width_px: float) -> str: + """Fit a label into a frame, ~7px/char with 6px padding (else nothing).""" + maxchars = int((width_px - 6) / 7) + if maxchars < 3: + return "" + return label if len(label) <= maxchars else label[: maxchars - 2] + ".." + + +def _frame_svg(c: _Canvas, node: _Node, depth: int, x: int) -> str: + """Render one frame's group, or "" when narrower than the cutoff.""" + w = node.value / c.total * c.plot_width + if w < c.min_px: + return "" + x_px = _SIDE + x / c.total * c.plot_width + y = _PAD_TOP + (c.max_depth - depth) * c.frame_height + pct = node.value / c.total * 100.0 + tip = f"{node.name} ({node.value} {c.countname}, {pct:.2f}%)" + out = [ + f'', + f"{html.escape(tip)}", + f'', + ] + text = _truncate(node.name, w) + if text: + out.append( + f'{html.escape(text)}' + ) + out.append("") + return "".join(out) + + +def render_svg(root: _Node, opts: FlameOptions | None = None) -> str: + opts = opts or FlameOptions() + total = root.value or 1 + placed: list = [] + _layout(root, 0, 0, placed) + max_depth = max((d for _, d, _ in placed), default=0) + height = _PAD_TOP + (max_depth + 1) * opts.frame_height + _PAD_BOTTOM + canvas = _Canvas( + total=total, + max_depth=max_depth, + height=height, + plot_width=opts.width - 2 * _SIDE, + frame_height=opts.frame_height, + min_px=opts.min_px, + countname=opts.countname, + ) + + parts: list[str] = [] + _append_chrome(parts, opts, height) + for node, depth, x in placed: + parts.append(_frame_svg(canvas, node, depth, x)) + parts.append("\n") + return "".join(parts) + + +# Minimal, self-contained search affordance (highlight matches, report % of +# matched samples). No external assets; deterministic; no zoom to keep the +# artifact robust across renderers. +_SEARCH_JS = ( + "" +) + + +def main(argv=None) -> int: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--collapse-only", action="store_true", + help="emit collapsed stacks instead of SVG") + ap.add_argument("--from-collapsed", action="store_true", + help="read collapsed stacks instead of perf script output") + ap.add_argument("--title", default="QSL Flame Graph") + ap.add_argument("--subtitle", default="") + ap.add_argument("--countname", default="samples") + ap.add_argument("--root-name", default="all") + ap.add_argument("--width", type=int, default=1200) + args = ap.parse_args(argv) + + if args.from_collapsed: + folded = parse_collapsed(sys.stdin) + else: + folded = fold_perf_script(sys.stdin) + + if args.collapse_only: + for stack in sorted(folded): + sys.stdout.write(f"{stack} {folded[stack]}\n") + return 0 + + if not folded: + sys.stderr.write("flamegraph.py: no stacks parsed from input\n") + return 1 + + root = build_tree(folded, args.root_name) + opts = FlameOptions( + title=args.title, + subtitle=args.subtitle, + countname=args.countname, + width=args.width, + ) + sys.stdout.write(render_svg(root, opts)) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/flamegraph.sh b/scripts/flamegraph.sh new file mode 100755 index 0000000..3d7dbfa --- /dev/null +++ b/scripts/flamegraph.sh @@ -0,0 +1,274 @@ +#!/usr/bin/env bash +# Generate a Linux perf flamegraph from the benchmark harness. +# +# Records call-graph samples with `perf record --call-graph dwarf`, folds them +# with scripts/flamegraph.py (a dependency-free stackcollapse + SVG renderer), +# and writes: +# results/flamegraph.svg -- the visual flamegraph (provenance embedded as a +# leading XML comment + a visible subtitle) +# results/flamegraph.txt -- provenance + classification + top folded stacks +# +# Defaults to software cpu-clock sampling so the artifact stays a portable +# hot-symbol *investigation* aid, not a latency/throughput claim. This is the +# missing-flamegraph follow-up tracked by issue #32 (the perf stat/record text +# workflow already exists; full hardware-PMU cache evidence stays in #90). +set -euo pipefail + +cd "$(dirname "$0")/.." +# shellcheck source=scripts/qsl_common.sh +source scripts/qsl_common.sh + +BIN="${QSL_BENCH_BIN:-build/bench/qsl-bench}" +OUT_SVG="${QSL_FLAMEGRAPH_SVG:-results/flamegraph.svg}" +OUT_TXT="${QSL_FLAMEGRAPH_TXT:-results/flamegraph.txt}" +DATA="${QSL_FLAMEGRAPH_DATA:-build/perf/qsl-bench.flame.data}" +EVENT="${QSL_FLAMEGRAPH_EVENT:-cpu-clock}" +FREQ="${QSL_FLAMEGRAPH_FREQ:-4000}" +CALLGRAPH="${QSL_FLAMEGRAPH_CALLGRAPH:-dwarf}" +MIN_SAMPLES="${QSL_FLAMEGRAPH_MIN_SAMPLES:-200}" +TOP_STACKS="${QSL_FLAMEGRAPH_TOP_STACKS:-15}" +BUILD_DIR="$(dirname "$BIN")" +PROVENANCE_SCOPE="flamegraph-benchmark" +PROVENANCE_INPUTS=( + Makefile + CMakeLists.txt + CMakePresets.json + cmake + include + src + apps/qsl-bench + benchmarks + scripts/flamegraph.sh + scripts/flamegraph.py + scripts/qsl_common.sh +) + +perf_version_line() { + perf --version 2>&1 | head -1 || true +} + +parse_sample_count_token() { + awk -v raw="$1" ' + BEGIN { + gsub(/,/, "", raw) + suffix = substr(raw, length(raw), 1) + mult = 1 + if (suffix == "K" || suffix == "k") { mult = 1000; raw = substr(raw, 1, length(raw) - 1) } + else if (suffix == "M" || suffix == "m") { mult = 1000000; raw = substr(raw, 1, length(raw) - 1) } + if (raw ~ /^[0-9]+([.][0-9]+)?$/) printf "%d\n", raw * mult + }' +} + +qsl_require_linux "scripts/flamegraph.sh" "perf" + +if ! command -v perf >/dev/null 2>&1; then + echo "error: perf not found. Install linux perf tooling for this kernel." >&2 + exit 2 +fi +if ! command -v python3 >/dev/null 2>&1; then + echo "error: python3 is required to render the flamegraph." >&2 + exit 2 +fi +if [[ ! -x "$BIN" ]]; then + echo "error: $BIN not found; build the benchmark preset first (make flamegraph)." >&2 + exit 1 +fi + +mkdir -p "$(dirname "$OUT_SVG")" "$(dirname "$DATA")" + +BENCH_OUT="$(mktemp)" +RECORD_BENCH_OUT="$(mktemp)" +RECORD_ERR="$(mktemp)" +SCRIPT_OUT="$(mktemp)" +SCRIPT_ERR="$(mktemp)" +FOLDED="$(mktemp)" +COLLAPSE_ERR="$(mktemp)" +SVG_TMP="$(mktemp)" +TXT_TMP="$(mktemp)" +trap 'rm -f "$BENCH_OUT" "$RECORD_BENCH_OUT" "$RECORD_ERR" "$SCRIPT_OUT" "$SCRIPT_ERR" "$FOLDED" "$COLLAPSE_ERR" "$SVG_TMP" "$TXT_TMP"' EXIT + +# Fail fast if the benchmark itself is broken (partial mode must not mask this). +BENCH_STATUS=0 +"$BIN" >"$BENCH_OUT" 2>&1 || BENCH_STATUS=$? +if [[ "$BENCH_STATUS" -ne 0 ]]; then + echo "error: benchmark command failed before perf record (status $BENCH_STATUS); partial mode cannot override this." >&2 + cat "$BENCH_OUT" >&2 + exit 4 +fi + +RECORD_STATUS=0 +perf record --call-graph "$CALLGRAPH" -F "$FREQ" -g -e "$EVENT" -o "$DATA" -- "$BIN" \ + >"$RECORD_BENCH_OUT" 2>"$RECORD_ERR" || RECORD_STATUS=$? + +SCRIPT_STATUS=0 +if [[ "$RECORD_STATUS" -eq 0 ]]; then + perf script -i "$DATA" >"$SCRIPT_OUT" 2>"$SCRIPT_ERR" || SCRIPT_STATUS=$? +fi + +PERF_LIMITATION=no +# `zero-sized data` is how `perf script` reports a no-sample capture; classify it +# as a perf limitation here exactly as scripts/perf_record.sh does, so the +# documented constrained-host (QSL_PERF_ALLOW_PARTIAL=1) path works instead of +# tripping the unexpected-failure exit. +if grep -Eiq 'zero-sized data|No samples|failed to open|Permission denied|Operation not permitted|perf_event_open|not supported|Operation not supported|perf not found for kernel|linux-tools' \ + "$RECORD_ERR" "$SCRIPT_ERR"; then + PERF_LIMITATION=yes +fi + +# perf record prints its sample summary as "(N samples)" or, on some versions, +# "(~N samples)" — and that count is only its own estimate. Accept the optional +# `~` so the token is not dropped, but keep this value informational; the sample +# gate below uses the authoritative folded total, not this estimate. +SAMPLE_TOKEN="$(sed -nE 's/.*\(~?([0-9][0-9.,]*[KkMm]?) samples\).*/\1/p' "$RECORD_ERR" | head -1)" +PERF_EST_SAMPLES="$(parse_sample_count_token "$SAMPLE_TOKEN")" +[[ -z "$PERF_EST_SAMPLES" ]] && PERF_EST_SAMPLES=0 + +# Fold to collapsed stacks for the text summary and as an SVG precondition. A +# nonzero COLLAPSE_STATUS means the renderer/parser itself failed (a generator +# regression), which is handled as an unexpected failure below — never masked as +# a perf sampling limitation. FOLDED_SAMPLES is the real sample total carried by +# the folded stacks (sum of trailing counts), the authoritative gate input. +STACK_COUNT=0 +FOLDED_SAMPLES=0 +COLLAPSE_STATUS=0 +if [[ "$SCRIPT_STATUS" -eq 0 && -s "$SCRIPT_OUT" ]]; then + python3 scripts/flamegraph.py --collapse-only <"$SCRIPT_OUT" >"$FOLDED" 2>"$COLLAPSE_ERR" || + COLLAPSE_STATUS=$? + STACK_COUNT="$(wc -l <"$FOLDED" | tr -d ' ')" + FOLDED_SAMPLES="$(awk '{ s += $NF } END { printf "%d\n", s + 0 }' "$FOLDED")" +fi + +INSUFFICIENT_SAMPLES=no +if [[ "$RECORD_STATUS" -eq 0 && "$SCRIPT_STATUS" -eq 0 && "$COLLAPSE_STATUS" -eq 0 && + "$FOLDED_SAMPLES" -lt "$MIN_SAMPLES" ]]; then + INSUFFICIENT_SAMPLES=yes +fi + +# Describe the sampling source once so every label/caveat (artifact type, SVG +# comment, text companion) stays consistent: software timers vs a hardware PMU +# event. cpu-clock/task-clock are software; cycles/instructions/etc. are PMU. +case "$EVENT" in +cpu-clock | task-clock) SAMPLE_KIND="software $EVENT sampling" ;; +*) SAMPLE_KIND="$EVENT hardware-PMU sampling" ;; +esac +ARTIFACT_TYPE="flamegraph ($SAMPLE_KIND hot-symbol profile)" +if [[ "$RECORD_STATUS" -ne 0 || "$SCRIPT_STATUS" -ne 0 || "$STACK_COUNT" -eq 0 ]]; then + ARTIFACT_TYPE="constrained-environment validation (partial; no clean sample report)" +elif [[ "$INSUFFICIENT_SAMPLES" == "yes" ]]; then + ARTIFACT_TYPE="constrained-environment validation (partial; insufficient samples for hot-symbol conclusions)" +fi + +PROVENANCE="$(qsl_emit_provenance "$PROVENANCE_SCOPE" "$OUT_SVG" "${PROVENANCE_INPUTS[@]}")" +HOST="$(uname -s) $(uname -m)" +DATE="$(qsl_utc_timestamp)" +SUBTITLE="$ARTIFACT_TYPE | $HOST | $EVENT @ ${FREQ}Hz | ${FOLDED_SAMPLES} samples | ${STACK_COUNT} stacks | $DATE" + +# Render the SVG (deterministic for a fixed folded input + fixed subtitle). +if [[ "$STACK_COUNT" -gt 0 ]]; then + { + echo '' + # Keep the delimiters on their own lines and squeeze any "--" + # out of the interior: a double hyphen is illegal inside an XML comment. + echo "" + # Drop the renderer's own XML declaration; we emitted ours above. + python3 scripts/flamegraph.py \ + --title "QSL Matching-Engine Flame Graph (qsl-bench)" \ + --subtitle "$SUBTITLE" \ + --countname "$EVENT samples" \ + --from-collapsed <"$FOLDED" | tail -n +2 + } >"$SVG_TMP" + qsl_publish_artifact "$SVG_TMP" "$OUT_SVG" +else + # No clean folded stacks. Remove any prior SVG so a constrained rerun cannot + # leave a previous host's flamegraph beside a .txt that says there is no + # sample report — which could be committed as if the two still matched. + rm -f "$OUT_SVG" +fi + +# Text companion: provenance + classification + top folded stacks (human/queryable). +{ + echo "Command: make flamegraph" + echo "Artifact: $ARTIFACT_TYPE" + echo "Hardware: $(uname -m)" + echo "OS: $(uname -s) $(uname -r)" + echo "CPU: $(qsl_cpu_model)" + echo "Compiler: $(qsl_build_compiler_version "$BUILD_DIR")" + echo "Perf: $(perf_version_line)" + echo "Perf paranoid: $(cat /proc/sys/kernel/perf_event_paranoid 2>/dev/null || echo unknown)" + echo "Build type: $(qsl_build_type "$BUILD_DIR")" + echo "$PROVENANCE" + echo "Benchmark binary: $BIN" + echo "Dataset: qsl-bench default synthetic benchmark suite" + echo "Call graph: $CALLGRAPH" + echo "Record event: $EVENT" + echo "Sample freq: $FREQ Hz" + echo "Sample count (folded total): $FOLDED_SAMPLES" + echo "Sample count (perf record est.): $PERF_EST_SAMPLES" + echo "Folded stacks: $STACK_COUNT" + echo "Minimum samples for hot profile: $MIN_SAMPLES" + echo "Insufficient samples: $INSUFFICIENT_SAMPLES" + echo "Record status: $RECORD_STATUS" + echo "Script status: $SCRIPT_STATUS" + echo "Perf access limitation: $PERF_LIMITATION" + echo "Flamegraph SVG: $(qsl_repo_relative_or_empty "$OUT_SVG")" + echo "Perf data: $DATA (generated, not intended for commit)" + echo + if [[ "$ARTIFACT_TYPE" == flamegraph* ]]; then + echo "Caveat: this flamegraph is a $SAMPLE_KIND profile for hot-symbol" + echo "investigation. Frame width is proportional to on-CPU samples, not wall-clock" + echo "latency or throughput, and is hardware/kernel/compiler/build dependent." + else + echo "Caveat: constrained/partial perf validation, not a hot-symbol flamegraph. Treat" + echo "frame widths as unusable until sampling succeeds and the folded sample total" + echo "meets the Minimum samples for hot profile." + fi + echo + echo "Top $TOP_STACKS folded stacks (count stack):" + if [[ -s "$FOLDED" ]]; then + # The final awk limits to $TOP_STACKS rows by reading all input (NR<=top) + # rather than `head`, so `sort` is never sent SIGPIPE under `pipefail`. + awk '{ n=$NF; $NF=""; sub(/[[:space:]]+$/,""); printf "%s\t%s\n", n, $0 }' "$FOLDED" | + sort -t"$(printf '\t')" -k1,1nr | + awk -F"$(printf '\t')" -v top="$TOP_STACKS" 'NR<=top { printf "%8d %s\n", $1, $2 }' + else + echo " (none)" + fi + echo + echo "Benchmark output:" + cat "$BENCH_OUT" +} >"$TXT_TMP" +qsl_publish_artifact "$TXT_TMP" "$OUT_TXT" +echo "wrote $OUT_TXT" +[[ "$STACK_COUNT" -gt 0 ]] && echo "wrote $OUT_SVG" + +# A renderer/parser failure (perf script succeeded but flamegraph.py errored) is +# a generator bug, not a perf sampling limitation — fail hard so partial mode +# cannot publish a Python/parser regression as a constrained-environment artifact. +if [[ "$SCRIPT_STATUS" -eq 0 && "$COLLAPSE_STATUS" -ne 0 ]]; then + echo "error: flamegraph.py --collapse-only failed (status $COLLAPSE_STATUS); this is a renderer/parser failure, not a perf limitation, and partial mode cannot mask it." >&2 + cat "$COLLAPSE_ERR" >&2 + exit 4 +fi +if [[ ("$RECORD_STATUS" -ne 0 || "$SCRIPT_STATUS" -ne 0) && "$PERF_LIMITATION" != "yes" ]]; then + echo "error: perf record/script failed for a reason other than a perf access limitation." >&2 + exit 3 +fi +if [[ "$STACK_COUNT" -eq 0 || "$INSUFFICIENT_SAMPLES" == "yes" ]]; then + if [[ "${QSL_PERF_ALLOW_PARTIAL:-0}" != "1" ]]; then + echo "error: flamegraph did not capture enough samples for a clean profile." >&2 + echo " Re-run on Linux with perf sampling access, or set QSL_PERF_ALLOW_PARTIAL=1" >&2 + echo " only when intentionally documenting a constrained environment." >&2 + exit 3 + fi +fi diff --git a/tests/CMakeLists.txt b/tests/CMakeLists.txt index 4e95e46..cb617a9 100644 --- a/tests/CMakeLists.txt +++ b/tests/CMakeLists.txt @@ -89,6 +89,13 @@ add_test( NAME qsl_common_publish_artifact COMMAND bash "${CMAKE_CURRENT_LIST_DIR}/shell/test_qsl_common.sh") +# Shell unit tests for the dependency-free flamegraph renderer (scripts/flamegraph.py: +# perf-script folding + deterministic SVG rendering) behind `make flamegraph` (#32). +# Portable: needs only python3 (skips cleanly if absent); does not require perf. +add_test( + NAME qsl_flamegraph_render + COMMAND bash "${CMAKE_CURRENT_LIST_DIR}/shell/test_flamegraph.sh") + if(EXISTS "/dev/full") add_test( NAME qsl_replay_generate_append_failure diff --git a/tests/shell/test_flamegraph.sh b/tests/shell/test_flamegraph.sh new file mode 100644 index 0000000..585ba34 --- /dev/null +++ b/tests/shell/test_flamegraph.sh @@ -0,0 +1,153 @@ +#!/usr/bin/env bash +# Unit tests for scripts/flamegraph.py — the dependency-free stackcollapse + SVG +# renderer behind `make flamegraph` (issue #32). +# +# The shell driver (scripts/flamegraph.sh) needs Linux `perf`, which CI does not +# have, so these tests exercise the deterministic, portable core instead: +# 1. `perf script` output folds into correct collapsed stacks (innermost-first +# perf order reversed to root-first, comm at the base, dso + "+0xoffset" +# stripped, C++ symbols with spaces/parens preserved). +# 2. identical stacks aggregate their counts. +# 3. collapsed output is sorted and deterministic. +# 4. the SVG render is well-formed, escapes XML metacharacters, contains the +# expected frames, and is byte-identical across runs (no RNG, no timestamps). +# 5. empty input is handled (exit 1 for SVG, empty for --collapse-only). +# +# Registered with CTest (see tests/CMakeLists.txt); runs under `make check`. +# Run directly: bash tests/shell/test_flamegraph.sh + +set -uo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)" +FG="$REPO_ROOT/scripts/flamegraph.py" + +if ! command -v python3 >/dev/null 2>&1; then + echo "SKIP: python3 not found; flamegraph renderer tests skipped" + exit 0 +fi + +PASS=0 +FAIL=0 + +expect_eq() { + local name="$1" expected="$2" actual="$3" + if [[ "$actual" == "$expected" ]]; then + printf 'PASS: %s\n' "$name" + PASS=$((PASS + 1)) + else + printf 'FAIL: %s\n expected: %q\n actual: %q\n' "$name" "$expected" "$actual" + FAIL=$((FAIL + 1)) + fi +} + +expect_contains() { + local name="$1" needle="$2" haystack="$3" + if [[ "$haystack" == *"$needle"* ]]; then + printf 'PASS: %s\n' "$name" + PASS=$((PASS + 1)) + else + printf 'FAIL: %s\n missing: %q\n' "$name" "$needle" + FAIL=$((FAIL + 1)) + fi +} + +expect_not_contains() { + local name="$1" needle="$2" haystack="$3" + if [[ "$haystack" != *"$needle"* ]]; then + printf 'PASS: %s\n' "$name" + PASS=$((PASS + 1)) + else + printf 'FAIL: %s\n unexpected: %q\n' "$name" "$needle" + FAIL=$((FAIL + 1)) + fi +} + +# Build a synthetic `perf script` block. Frame lines must start with a TAB; the +# header line for each sample must start in column 0. +TAB=$'\t' +make_perf_script() { + printf '%s\n' \ + "qsl-bench 100 1.0: 1000 cpu-clock:u:" \ + "${TAB}415cd0 qsl::engine::OrderBook::add_limit(unsigned long, qsl::core::Side)+0x310 (/path/qsl-bench)" \ + "${TAB}402887 main+0x127 (/path/qsl-bench)" \ + "" \ + "qsl-bench 100 2.0: 1000 cpu-clock:u:" \ + "${TAB}415cd0 qsl::engine::OrderBook::add_limit(unsigned long, qsl::core::Side)+0x300 (/path/qsl-bench)" \ + "${TAB}402887 main+0x100 (/path/qsl-bench)" \ + "" \ + "qsl-bench 100 3.0: 1000 cpu-clock:u:" \ + "${TAB}aaaa cfree+0x5 (/usr/lib64/libc.so.6)" \ + "${TAB}402887 main+0x10 (/path/qsl-bench)" \ + "" +} + +# --- Folding (stackcollapse) ------------------------------------------------ + +FOLDED="$(make_perf_script | python3 "$FG" --collapse-only)" + +# Innermost-first perf order is reversed to root-first, comm prepended, dso and +# "+0xoffset" stripped. The two add_limit samples (different offsets) collapse to +# one stack with count 2. +expect_contains "add_limit stack folds with comm at base, offset+dso stripped, count 2" \ + 'qsl-bench;main;qsl::engine::OrderBook::add_limit(unsigned long, qsl::core::Side) 2' \ + "$FOLDED" +expect_contains "libc leaf folds to one sample" \ + 'qsl-bench;main;cfree 1' \ + "$FOLDED" +expect_not_contains "dso paths are stripped from frames" "/usr/lib64/libc.so.6" "$FOLDED" +expect_not_contains "raw +0x offsets are stripped from frames" "+0x" "$FOLDED" + +# Collapsed output is sorted (deterministic) and stable across runs. +FOLDED2="$(make_perf_script | python3 "$FG" --collapse-only)" +expect_eq "collapse-only is deterministic" "$FOLDED" "$FOLDED2" +SORTED="$(printf '%s\n' "$FOLDED" | LC_ALL=C sort)" +expect_eq "collapse-only output is sorted" "$SORTED" "$FOLDED" + +# --- SVG rendering ---------------------------------------------------------- + +SVG="$(make_perf_script | python3 "$FG" --title "T" --subtitle "S")" +expect_contains "svg has XML declaration" '' "$SVG" +expect_contains "svg carries the title" '>T' "$SVG" +expect_contains "svg renders the add_limit frame" 'add_limit' "$SVG" +expect_contains "svg renders rect frames" 'class="frame"' "$SVG" + +# Deterministic: byte-identical across two renders of the same input. +SVG2="$(make_perf_script | python3 "$FG" --title "T" --subtitle "S")" +expect_eq "svg render is deterministic" "$SVG" "$SVG2" + +# XML metacharacters in frame names are escaped, not emitted raw. +ESC_SVG="$(printf 'bench;a&c 3\n' | python3 "$FG" --from-collapsed)" +expect_contains "frame names are XML-escaped" '<b>&c' "$ESC_SVG" +expect_not_contains "raw unescaped angle bracket is not emitted in a frame title" 'a<b>' "$ESC_SVG" + +# --- Collapsed input parsing ------------------------------------------------ + +# A tab-separated stack that itself contains spaces must split on the count, not +# on an interior space. +TAB_COLLAPSED="$(printf 'main;foo(unsigned int)\t7\n' | python3 "$FG" --from-collapsed --collapse-only)" +expect_eq "tab-separated collapsed line keeps its count" \ + 'main;foo(unsigned int) 7' "$TAB_COLLAPSED" + +# Non-positive counts are ignored; a stack with only such counts yields nothing. +NONPOS="$(printf 'a;b 0\nc;d -3\n' | python3 "$FG" --from-collapsed --collapse-only)" +expect_eq "non-positive collapsed counts are dropped" "" "$NONPOS" + +printf 'a;b 0\n' | python3 "$FG" --from-collapsed >/dev/null 2>&1 +rc=$? +expect_eq "all-non-positive collapsed input fails SVG with exit 1" "1" "$rc" + +# --- Empty input ------------------------------------------------------------ + +EMPTY_COLLAPSE="$(printf '' | python3 "$FG" --collapse-only)" +expect_eq "empty input yields empty collapse" "" "$EMPTY_COLLAPSE" + +printf '' | python3 "$FG" >/dev/null 2>&1 +rc=$? +expect_eq "empty input fails SVG render with exit 1" "1" "$rc" + +# --- Summary ---------------------------------------------------------------- + +printf '\nResults: %d passed, %d failed\n' "$PASS" "$FAIL" +[[ "$FAIL" -eq 0 ]]