Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
587778d
docs: sync resume anchors and PMU claims to v0.2.0 (Codex #127/#128 f…
div0rce Jun 21, 2026
0c3b401
perf: add flamegraph generator and make target (#32)
div0rce Jun 21, 2026
beec2d0
perf: add generated flamegraph artifact on bare-metal Fedora Asahi (#32)
div0rce Jun 21, 2026
872600a
perf: harden flamegraph collapsed-stack parsing (Codex review)
div0rce Jun 21, 2026
0201d54
perf: regenerate flamegraph artifact after parser hardening
div0rce Jun 22, 2026
52de5b8
refactor: improve flamegraph.py code health (CodeScene gate)
div0rce Jun 22, 2026
d4be2da
perf: regenerate flamegraph artifact after code-health refactor
div0rce Jun 22, 2026
4aec1d0
refactor: flatten flamegraph.py remaining complexity (CodeScene)
div0rce Jun 22, 2026
3905059
perf: regenerate flamegraph artifact after complexity flattening
div0rce Jun 22, 2026
dfa4da2
docs: record resume-anchor sync in PROGRESS current-state (Codex #129)
div0rce Jun 22, 2026
6ef5015
Merge branch 'docs/codex-resume-anchor-sync' into perf/flamegraph-art…
div0rce Jun 22, 2026
31070b1
perf: harden flamegraph.sh classification + sample gating (Codex #130)
div0rce Jun 22, 2026
06b7675
perf: regenerate flamegraph artifact after classification hardening
div0rce Jun 22, 2026
4a2aa67
docs: scope partial-PMU claim to perf-stat; perf-record is a software…
div0rce Jun 22, 2026
2199820
Merge branch 'docs/codex-resume-anchor-sync' into perf/flamegraph-art…
div0rce Jun 22, 2026
5093beb
docs: embed the flamegraph as a visible image in the README
div0rce Jun 22, 2026
b8351de
Merge remote-tracking branch 'origin/main' into perf/flamegraph-artifact
div0rce Jun 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -385,6 +385,7 @@ Keep this synchronized with the Makefile.
- `make bench-recovery` — run M46 recovery benchmarking (full-replay restart vs book rebuild)
- `make perf-stat` — run Linux `perf stat` workflow where supported
- `make perf-record` — run Linux `perf record/report` workflow where supported
- `make flamegraph` — render a Linux `perf` call-graph flamegraph (SVG) where supported
- `make numa-study` — run Linux CPU-affinity / scheduler-migration / NUMA-locality study where supported
- `make false-sharing-study` — run benchmark-only packed-vs-padded SPSC cursor contention study
- `make profile-io` — run Linux syscall/socket-path profiling where supported
Expand Down
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -385,6 +385,7 @@ Keep this synchronized with the Makefile.
- `make bench-recovery` — run M46 recovery benchmarking (full-replay restart vs book rebuild)
- `make perf-stat` — run Linux `perf stat` workflow where supported
- `make perf-record` — run Linux `perf record/report` workflow where supported
- `make flamegraph` — render a Linux `perf` call-graph flamegraph (SVG) where supported
- `make numa-study` — run Linux CPU-affinity / scheduler-migration / NUMA-locality study where supported
- `make false-sharing-study` — run benchmark-only packed-vs-padded SPSC cursor contention study
- `make profile-io` — run Linux syscall/socket-path profiling where supported
Expand Down
4 changes: 3 additions & 1 deletion MILESTONES.md
Original file line number Diff line number Diff line change
Expand Up @@ -484,7 +484,9 @@ Do not pull backlog items into earlier PRs.
- FIX-like text protocol adapter. (#29)
- Web dashboard for visualization. (#30)
- Docker packaging. (#31)
- Perf/flamegraph docs. (#32)
- Perf/flamegraph docs. (#32) — **done**: `make flamegraph` renders a perf call-graph flamegraph
via the dependency-free `scripts/flamegraph.py` (`results/flamegraph.svg` + `.txt`), unit-tested in
`tests/shell/test_flamegraph.sh`. Full hardware cache-PMU evidence stays in #90.
- GitHub Pages documentation site. (#33)

### Differential-testing follow-ups (prioritized)
Expand Down
9 changes: 8 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.PHONY: configure build test check fmt fmt-check tidy bench bench-diff bench-allocator bench-storage bench-recovery perf-stat perf-record numa-study false-sharing-study profile-io socket-stress socket-load dpdk-check nic-offload-check crash-recovery concurrency-stress asan tsan demo check-fixtures check-manifest determinism divergence-demo clean
.PHONY: configure build test check fmt fmt-check tidy bench bench-diff bench-allocator bench-storage bench-recovery perf-stat perf-record flamegraph numa-study false-sharing-study profile-io socket-stress socket-load dpdk-check nic-offload-check crash-recovery concurrency-stress asan tsan demo check-fixtures check-manifest determinism divergence-demo clean

BUILD_DIR := build/dev

Expand Down Expand Up @@ -63,6 +63,13 @@ perf-record:
cmake --build --preset bench --target qsl-bench
QSL_BENCH_BIN=build/bench/qsl-bench bash scripts/perf_record.sh

# Issue #32: render a perf call-graph flamegraph (SVG) from the benchmark harness. Linux-only.
flamegraph:
@test "$$(uname -s)" = "Linux" || { echo "error: make flamegraph requires Linux perf; current OS is $$(uname -s)." >&2; exit 2; }
cmake --preset bench
cmake --build --preset bench --target qsl-bench
QSL_BENCH_BIN=build/bench/qsl-bench bash scripts/flamegraph.sh

# M43: CPU-affinity / scheduler-migration / NUMA locality study. Linux-only.
numa-study:
@if test "$$(uname -s)" != "Linux"; then \
Expand Down
16 changes: 16 additions & 0 deletions PROGRESS.md
Original file line number Diff line number Diff line change
Expand Up @@ -370,6 +370,22 @@ Lower priority:
(E-core) PMU carries live counts — the `apple_blizzard_pmu/...` rows read `<not counted>` in
`results/perf_stat_linux.txt` because the single-threaded benchmark stays on the Avalanche P-cores.
Docs/memory only; no code or artifacts changed.
- [2026-06-21] Issue #32 flamegraph profiling artifact (`perf/flamegraph-artifact`, stacked on the
Codex-followup branch). Added `make flamegraph` → `scripts/flamegraph.sh`, which records
`perf record --call-graph dwarf -F 4000 -g -e cpu-clock` on `qsl-bench` and renders
`results/flamegraph.svg` (+ `results/flamegraph.txt` provenance/classification companion). The
fold + SVG render live in `scripts/flamegraph.py`, a dependency-free stdlib-only stackcollapse +
flamegraph renderer (no vendored Perl FlameGraph toolkit), deterministic by design (frames sorted
by name; colors a pure function of the name; no RNG/timestamps in the drawn body). DWARF call
graphs are used because the Release `bench` preset omits frame pointers; application symbols
(`OrderBook::add_limit`, `MatchingEngine::new_limit`, the replay path, …) still resolve from the
symtab. Added `tests/shell/test_flamegraph.sh` (CTest-registered, python3-only, skips cleanly if
absent) covering folding (offset/dso stripping, perf-order reversal, comm-at-base, count
aggregation, sortedness), SVG well-formedness, XML escaping, determinism, and empty-input
handling; `make check` 242/242. The committed `results/flamegraph.svg`/`.txt` were generated on
the bare-metal Fedora Asahi host (aarch64) from the clean committed tree (`Dirty inputs: no`).
This is a software cpu-clock sampling hot-symbol profile, not a latency/throughput claim; full
hardware cache-PMU evidence stays in #90. Do not merge from automation; human squash-merges.
- [2026-06-03] M35: implemented a multi-client TCP connection-scaling load test (`scripts/socket_load.sh`, `make socket-load`, Linux-only) driving N concurrent `qsl-client`s against the portable TCP and epoll (M34) gateways; `results/socket_load_summary.txt` is Docker-generated and constrained. A `/code-review` (3 finder agents) caught and fixed real measurement-integrity bugs before the PR: a failed trial's `wall=0` no longer poisons the reported best (only trials whose gateway served count toward the min); the `completed` column reports the WORST per-trial completion, not the last, so partial/total trial failures are surfaced rather than masked; a per-client `timeout` bounds a hang if the gateway dies; and `QSL_LOAD_TRIALS` is validated. Post-PR hardening uses fresh monotonic ports per gateway start, retries transient startup/serve failures on new ports, and refuses to write a partial artifact unless `QSL_LOAD_ALLOW_PARTIAL=1` is set intentionally; the refreshed artifact records `Dirty tree: no`. The scaling-shape claim remains constrained to loopback connection setup, not a demonstrated production-capacity advantage for either transport. Deferred follow-up: a shared `scripts/lib` to remove the dirty-tree / `wait_ready` / gateway-stop duplication across the three socket scripts.
- [2026-06-03] M35: started after M34 (#98) squash-merged (commit 9e3750b). Scope: multi-client load / socket-pressure testing of the gateway/feed path (TCP/UDP stress, socket-buffer pressure, connection scaling, backpressure) building on M34's epoll multi-client path and M30's socket tooling. Constraints: scripts/tests document load shape + environment; results must distinguish kernel/socket pressure from user-space engine cost; no production-capacity claims (honest constrained-environment framing, like M29/M30).
- [2026-06-04] M35: PR #100 squash-merged to `main` as a86b701 after all CI jobs and review checks were green. M35 is now landed; original M36 NUMA remains deferred until the repository-health refactor analysis is completed or explicitly skipped by the human.
Expand Down
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,23 @@ Reproduce with `make bench` (numbers will differ by machine). The differential-t
[`results/differential.txt`](results/differential.txt) — kept separate so it does not disturb
the core numbers above.

### Flamegraph

Where on-CPU time goes in the `qsl-bench` synthetic suite, rendered by `make flamegraph`
(`scripts/flamegraph.sh` → the dependency-free `scripts/flamegraph.py` — no external FlameGraph
toolchain):

[![qsl-bench cpu-clock flamegraph](results/flamegraph.svg)](results/flamegraph.svg)

This is a **software cpu-clock sampling** hot-symbol profile, **not** PMU evidence: frame width is
proportional to on-CPU samples (329 folded across 159 stacks on this run), not wall-clock latency or
throughput, and it is hardware/kernel/compiler/build dependent. The hot frames are protocol
`decode_new_order`, gateway session framing, `MatchingEngine::new_limit`, and order-book
cancel/allocation. Provenance and classification are in
[`results/flamegraph.txt`](results/flamegraph.txt); methodology in
[docs/perf_analysis.md](docs/perf_analysis.md). GitHub renders the SVG statically; download the raw
file for interactive zoom and search.

## Limitations

- **Synthetic and local.** No real market data, no real venue connectivity, no order types
Expand Down
34 changes: 32 additions & 2 deletions docs/perf_analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,30 @@ default is intentional: many CI, VM, and container environments do not expose ha
to unprivileged processes, and the benchmark harness is short enough that a lower frequency can
miss the minimum sample count needed for meaningful hot-symbol ordering.

Render a flamegraph (issue #32):

```bash
make flamegraph
```

This runs `scripts/flamegraph.sh`, which records call-graph samples
(`perf record --call-graph dwarf -F 4000 -g -e cpu-clock`), folds them, and renders an SVG to
`results/flamegraph.svg` plus a text companion `results/flamegraph.txt` (provenance, classification,
and the top folded stacks). DWARF call graphs are used so stacks unwind correctly even though the
`bench` (Release) preset omits frame pointers — the application symbols (`OrderBook::add_limit`,
`MatchingEngine::new_limit`, the replay path, …) resolve from the symbol table without changing the
optimization level under measurement.

The folding and SVG rendering live in `scripts/flamegraph.py`, a dependency-free Python script
(standard library only) that reimplements the `stackcollapse` + flamegraph data model rather than
vendoring Brendan Gregg's Perl toolkit, so the artifact is reproducible from this repository alone.
The renderer is deterministic — frames are sorted by name and colors are a pure function of the
frame name (no RNG, no timestamps in the drawn body) — and is unit-tested in
`tests/shell/test_flamegraph.sh` (registered with CTest, runs under `make check`). Frame width is
proportional to on-CPU samples; this is a software cpu-clock sampling profile for **hot-symbol
investigation**, not a latency or throughput measurement. Set `QSL_FLAMEGRAPH_EVENT=cycles` to
sample the hardware PMU cycles event instead, where the host exposes it.

## Required Environment

Both scripts are Linux-only and fail before running on non-Linux hosts. `perf stat` also fails
Expand Down Expand Up @@ -113,8 +137,14 @@ counters, permission-limited sampling, or a sample report that is explicitly mar
- `results/perf_report_linux.txt` records benchmark output, `perf record` stderr, and
`perf report --stdio` output. It is useful as a hot-symbol profile only when `No samples: no`,
`Insufficient samples: no`, and `Sample count` is at least `Minimum samples for hot profile`.
- `build/perf/qsl-bench.perf.data` is generated by `make perf-record` and is intentionally not
committed; it is host-specific binary profiler data.
- `results/flamegraph.svg` is the rendered flamegraph from `make flamegraph`; `results/flamegraph.txt`
is its provenance/classification companion (and lists the top folded stacks). Treat frame widths as
a hot-symbol guide only when the `.txt` reports a `flamegraph (...)` `Artifact:` and a `Sample
count` at least `Minimum samples for hot profile`; a `constrained-environment validation` label
means sampling did not capture enough stacks to trust.
- `build/perf/qsl-bench.perf.data` and `build/perf/qsl-bench.flame.data` are generated by
`make perf-record` / `make flamegraph` and are intentionally not committed; they are host-specific
binary profiler data.

Each artifact includes hardware, kernel, compiler, perf version, build type, dataset, command,
event set, and source-digest provenance. The `Source digest` is the authoritative source identity;
Expand Down
6 changes: 6 additions & 0 deletions results/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,12 @@ Benchmark results produced by `make bench` and scripts under `scripts/`.
- `perf_report_linux.txt` — Linux `perf record/report` hot-symbol output for the benchmark
harness (`make perf-record`). It is useful as a hot-symbol profile only when the file says
`No samples: no`, `Insufficient samples: no`, and the sample count meets the reported minimum.
- `flamegraph.svg` / `flamegraph.txt` — Linux `perf` call-graph flamegraph (`make flamegraph`,
issue #32) rendered by the dependency-free `scripts/flamegraph.py`. The `.svg` is the visual
(frame width ∝ on-CPU samples) with provenance in a leading XML comment; the `.txt` carries
provenance, the `Artifact:` classification, and the top folded stacks. It is a software cpu-clock
sampling profile for hot-symbol investigation, not a latency/throughput claim — trust frame widths
only when the `.txt` reports a `flamegraph (...)` artifact with enough samples.
- `numa_affinity_study.txt` — Linux CPU-affinity / scheduler-migration / NUMA-locality study
output (`make numa-study`). It must self-classify as `full-linux-numa`, `linux-constrained`, or
`unsupported-host`; only `full-linux-numa` is full NUMA evidence.
Expand Down
31 changes: 31 additions & 0 deletions results/flamegraph.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
59 changes: 59 additions & 0 deletions results/flamegraph.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
Command: make flamegraph
Artifact: flamegraph (software cpu-clock sampling hot-symbol profile)
Hardware: aarch64
OS: Linux 6.19.14-400.asahi.fc44.aarch64+16k
CPU: Avalanche-M2
Compiler: c++ (GCC) 16.1.1 20260515 (Red Hat 16.1.1-2)
Perf: perf version 6.19.14-400.asahi.fc44.aarch64
Perf paranoid: 2
Build type: Release
Provenance version: 1
Git commit (informational): 31070b1
Source digest: sha256:6aa521e6295a99f9dbf7dee9e5bcef04e93174ed12c3e8de9b991a8bfc14c809
Source digest scope: flamegraph-benchmark
Dirty inputs: no
Generated output: results/flamegraph.svg
Date: 2026-06-22T02:18:23Z
Benchmark binary: build/bench/qsl-bench
Dataset: qsl-bench default synthetic benchmark suite
Call graph: dwarf
Record event: cpu-clock
Sample freq: 4000 Hz
Sample count (folded total): 329
Sample count (perf record est.): 329
Folded stacks: 159
Minimum samples for hot profile: 200
Insufficient samples: no
Record status: 0
Script status: 0
Perf access limitation: no
Flamegraph SVG: results/flamegraph.svg
Perf data: build/perf/qsl-bench.flame.data (generated, not intended for commit)

Caveat: this flamegraph is a software cpu-clock sampling profile for hot-symbol
investigation. Frame width is proportional to on-CPU samples, not wall-clock
latency or throughput, and is hardware/kernel/compiler/build dependent.

Top 15 folded stacks (count stack):
15 qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::protocol::decode_new_order(std::span<std::byte const, 18446744073709551615ul>)
11 qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::gateway::Session::on_bytes(std::span<std::byte const, 18446744073709551615ul>);qsl::gateway::Session::on_bytes(std::span<std::byte const, 18446744073709551615ul>, std::vector<std::byte, std::allocator<std::byte> >&, unsigned long);qsl::gateway::Session::process_frame(std::span<std::byte const, 18446744073709551615ul>, std::vector<std::byte, std::allocator<std::byte> >&, unsigned long);qsl::gateway::OrderGateway::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce);qsl::engine::MatchingEngine::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce)
11 qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::replay::generate_flow(unsigned long, unsigned int, unsigned long);qsl::engine::MatchingEngine::contains(unsigned int, unsigned long) const
8 qsl-bench;__libc_start_call_main;[unknown];[unknown];cfree@GLIBC_2.17
7 qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::engine::OrderBook::cancel(unsigned long);decltype(auto) qsl::engine::OrderBook::dispatch_storage<qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}>(qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}&&) [clone .isra.0];qsl::engine::OrderBook::erase_resting_order(qsl::engine::OrderBook::Locator const&)
6 qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::gateway::Session::on_bytes(std::span<std::byte const, 18446744073709551615ul>);qsl::gateway::Session::on_bytes(std::span<std::byte const, 18446744073709551615ul>, std::vector<std::byte, std::allocator<std::byte> >&, unsigned long);qsl::gateway::Session::process_frame(std::span<std::byte const, 18446744073709551615ul>, std::vector<std::byte, std::allocator<std::byte> >&, unsigned long);qsl::gateway::OrderGateway::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce)
6 qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::replay::apply(qsl::engine::MatchingEngine&, std::variant<qsl::replay::RegisterSymbol, qsl::replay::NewLimit, qsl::replay::NewMarket, qsl::replay::Cancel, qsl::replay::Modify> const&);qsl::engine::MatchingEngine::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce)
5 qsl-bench;qsl::engine::OrderBook::erase_resting_order(qsl::engine::OrderBook::Locator const&);operator delete(void*, std::align_val_t)@plt
5 qsl-bench;qsl::engine::OrderBook::erase_resting_order(qsl::engine::OrderBook::Locator const&);operator delete(void*, unsigned long, std::align_val_t)@plt
5 qsl-bench;std::_Hashtable<unsigned long, std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, std::pmr::polymorphic_allocator<std::pair<unsigned long const, qsl::engine::OrderBook::Locator> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node<std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, false>*);operator delete(void*, unsigned long, std::align_val_t)@plt
5 qsl-bench;[unknown];[unknown];operator new(unsigned long);malloc@plt
5 qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::replay::generate_flow(unsigned long, unsigned int, unsigned long);qsl::engine::OrderBook::contains(unsigned long) const
4 qsl-bench;decltype(auto) qsl::engine::OrderBook::dispatch_storage<qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}>(qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}&&) [clone .isra.0];[unknown];[unknown];cfree@GLIBC_2.17
4 qsl-bench;main;[unknown];[unknown];operator new(unsigned long);malloc
4 qsl-bench;operator new(unsigned long);malloc@plt

Benchmark output:
order_book add/mod/cancel 200000 ops 132.8 ns/op 7531861 ops/sec
protocol encode+decode 500000 ops 20.5 ns/op 48773893 ops/sec
gateway session (fill) 200000 ops 127.4 ns/op 7848348 ops/sec
matching engine flow 5004 items 101.6 ns/item 9840697 items/sec
replay command log 5004 items 112.0 ns/item 8928265 items/sec
Loading
Loading