div0rce · div0rce · Jun 22, 2026 · Jun 21, 2026 · Jun 21, 2026 · Jun 21, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -385,6 +385,7 @@ Keep this synchronized with the Makefile.
 - `make bench-recovery` — run M46 recovery benchmarking (full-replay restart vs book rebuild)
 - `make perf-stat` — run Linux `perf stat` workflow where supported
 - `make perf-record` — run Linux `perf record/report` workflow where supported
+- `make flamegraph` — render a Linux `perf` call-graph flamegraph (SVG) where supported
 - `make numa-study` — run Linux CPU-affinity / scheduler-migration / NUMA-locality study where supported
 - `make false-sharing-study` — run benchmark-only packed-vs-padded SPSC cursor contention study
 - `make profile-io` — run Linux syscall/socket-path profiling where supported

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -385,6 +385,7 @@ Keep this synchronized with the Makefile.
 - `make bench-recovery` — run M46 recovery benchmarking (full-replay restart vs book rebuild)
 - `make perf-stat` — run Linux `perf stat` workflow where supported
 - `make perf-record` — run Linux `perf record/report` workflow where supported
+- `make flamegraph` — render a Linux `perf` call-graph flamegraph (SVG) where supported
 - `make numa-study` — run Linux CPU-affinity / scheduler-migration / NUMA-locality study where supported
 - `make false-sharing-study` — run benchmark-only packed-vs-padded SPSC cursor contention study
 - `make profile-io` — run Linux syscall/socket-path profiling where supported

diff --git a/MILESTONES.md b/MILESTONES.md
@@ -484,7 +484,9 @@ Do not pull backlog items into earlier PRs.
 - FIX-like text protocol adapter. (#29)
 - Web dashboard for visualization. (#30)
 - Docker packaging. (#31)
-- Perf/flamegraph docs. (#32)
+- Perf/flamegraph docs. (#32) — **done**: `make flamegraph` renders a perf call-graph flamegraph
+  via the dependency-free `scripts/flamegraph.py` (`results/flamegraph.svg` + `.txt`), unit-tested in
+  `tests/shell/test_flamegraph.sh`. Full hardware cache-PMU evidence stays in #90.
 - GitHub Pages documentation site. (#33)
 
 ### Differential-testing follow-ups (prioritized)

diff --git a/Makefile b/Makefile
@@ -1,4 +1,4 @@
-.PHONY: configure build test check fmt fmt-check tidy bench bench-diff bench-allocator bench-storage bench-recovery perf-stat perf-record numa-study false-sharing-study profile-io socket-stress socket-load dpdk-check nic-offload-check crash-recovery concurrency-stress asan tsan demo check-fixtures check-manifest determinism divergence-demo clean
+.PHONY: configure build test check fmt fmt-check tidy bench bench-diff bench-allocator bench-storage bench-recovery perf-stat perf-record flamegraph numa-study false-sharing-study profile-io socket-stress socket-load dpdk-check nic-offload-check crash-recovery concurrency-stress asan tsan demo check-fixtures check-manifest determinism divergence-demo clean
 
 BUILD_DIR := build/dev
 
@@ -63,6 +63,13 @@ perf-record:
 	cmake --build --preset bench --target qsl-bench
 	QSL_BENCH_BIN=build/bench/qsl-bench bash scripts/perf_record.sh
 
+# Issue #32: render a perf call-graph flamegraph (SVG) from the benchmark harness. Linux-only.
+flamegraph:
+	@test "$$(uname -s)" = "Linux" || { echo "error: make flamegraph requires Linux perf; current OS is $$(uname -s)." >&2; exit 2; }
+	cmake --preset bench
+	cmake --build --preset bench --target qsl-bench
+	QSL_BENCH_BIN=build/bench/qsl-bench bash scripts/flamegraph.sh
+
 # M43: CPU-affinity / scheduler-migration / NUMA locality study. Linux-only.
 numa-study:
 	@if test "$$(uname -s)" != "Linux"; then \

diff --git a/PROGRESS.md b/PROGRESS.md
@@ -370,6 +370,22 @@ Lower priority:
   (E-core) PMU carries live counts — the `apple_blizzard_pmu/...` rows read `<not counted>` in
   `results/perf_stat_linux.txt` because the single-threaded benchmark stays on the Avalanche P-cores.
   Docs/memory only; no code or artifacts changed.
+- [2026-06-21] Issue #32 flamegraph profiling artifact (`perf/flamegraph-artifact`, stacked on the
+  Codex-followup branch). Added `make flamegraph` → `scripts/flamegraph.sh`, which records
+  `perf record --call-graph dwarf -F 4000 -g -e cpu-clock` on `qsl-bench` and renders
+  `results/flamegraph.svg` (+ `results/flamegraph.txt` provenance/classification companion). The
+  fold + SVG render live in `scripts/flamegraph.py`, a dependency-free stdlib-only stackcollapse +
+  flamegraph renderer (no vendored Perl FlameGraph toolkit), deterministic by design (frames sorted
+  by name; colors a pure function of the name; no RNG/timestamps in the drawn body). DWARF call
+  graphs are used because the Release `bench` preset omits frame pointers; application symbols
+  (`OrderBook::add_limit`, `MatchingEngine::new_limit`, the replay path, …) still resolve from the
+  symtab. Added `tests/shell/test_flamegraph.sh` (CTest-registered, python3-only, skips cleanly if
+  absent) covering folding (offset/dso stripping, perf-order reversal, comm-at-base, count
+  aggregation, sortedness), SVG well-formedness, XML escaping, determinism, and empty-input
+  handling; `make check` 242/242. The committed `results/flamegraph.svg`/`.txt` were generated on
+  the bare-metal Fedora Asahi host (aarch64) from the clean committed tree (`Dirty inputs: no`).
+  This is a software cpu-clock sampling hot-symbol profile, not a latency/throughput claim; full
+  hardware cache-PMU evidence stays in #90. Do not merge from automation; human squash-merges.
 - [2026-06-03] M35: implemented a multi-client TCP connection-scaling load test (`scripts/socket_load.sh`, `make socket-load`, Linux-only) driving N concurrent `qsl-client`s against the portable TCP and epoll (M34) gateways; `results/socket_load_summary.txt` is Docker-generated and constrained. A `/code-review` (3 finder agents) caught and fixed real measurement-integrity bugs before the PR: a failed trial's `wall=0` no longer poisons the reported best (only trials whose gateway served count toward the min); the `completed` column reports the WORST per-trial completion, not the last, so partial/total trial failures are surfaced rather than masked; a per-client `timeout` bounds a hang if the gateway dies; and `QSL_LOAD_TRIALS` is validated. Post-PR hardening uses fresh monotonic ports per gateway start, retries transient startup/serve failures on new ports, and refuses to write a partial artifact unless `QSL_LOAD_ALLOW_PARTIAL=1` is set intentionally; the refreshed artifact records `Dirty tree: no`. The scaling-shape claim remains constrained to loopback connection setup, not a demonstrated production-capacity advantage for either transport. Deferred follow-up: a shared `scripts/lib` to remove the dirty-tree / `wait_ready` / gateway-stop duplication across the three socket scripts.
 - [2026-06-03] M35: started after M34 (#98) squash-merged (commit 9e3750b). Scope: multi-client load / socket-pressure testing of the gateway/feed path (TCP/UDP stress, socket-buffer pressure, connection scaling, backpressure) building on M34's epoll multi-client path and M30's socket tooling. Constraints: scripts/tests document load shape + environment; results must distinguish kernel/socket pressure from user-space engine cost; no production-capacity claims (honest constrained-environment framing, like M29/M30).
 - [2026-06-04] M35: PR #100 squash-merged to `main` as a86b701 after all CI jobs and review checks were green. M35 is now landed; original M36 NUMA remains deferred until the repository-health refactor analysis is completed or explicitly skipped by the human.

diff --git a/README.md b/README.md
@@ -109,6 +109,23 @@ Reproduce with `make bench` (numbers will differ by machine). The differential-t
 [`results/differential.txt`](results/differential.txt) — kept separate so it does not disturb
 the core numbers above.
 
+### Flamegraph
+
+Where on-CPU time goes in the `qsl-bench` synthetic suite, rendered by `make flamegraph`
+(`scripts/flamegraph.sh` → the dependency-free `scripts/flamegraph.py` — no external FlameGraph
+toolchain):
+
+[![qsl-bench cpu-clock flamegraph](results/flamegraph.svg)](results/flamegraph.svg)
+
+This is a **software cpu-clock sampling** hot-symbol profile, **not** PMU evidence: frame width is
+proportional to on-CPU samples (329 folded across 159 stacks on this run), not wall-clock latency or
+throughput, and it is hardware/kernel/compiler/build dependent. The hot frames are protocol
+`decode_new_order`, gateway session framing, `MatchingEngine::new_limit`, and order-book
+cancel/allocation. Provenance and classification are in
+[`results/flamegraph.txt`](results/flamegraph.txt); methodology in
+[docs/perf_analysis.md](docs/perf_analysis.md). GitHub renders the SVG statically; download the raw
+file for interactive zoom and search.
+
 ## Limitations
 
 - **Synthetic and local.** No real market data, no real venue connectivity, no order types

diff --git a/docs/perf_analysis.md b/docs/perf_analysis.md
@@ -55,6 +55,30 @@ default is intentional: many CI, VM, and container environments do not expose ha
 to unprivileged processes, and the benchmark harness is short enough that a lower frequency can
 miss the minimum sample count needed for meaningful hot-symbol ordering.
 
+Render a flamegraph (issue #32):
+
+```bash
+make flamegraph
+```
+
+This runs `scripts/flamegraph.sh`, which records call-graph samples
+(`perf record --call-graph dwarf -F 4000 -g -e cpu-clock`), folds them, and renders an SVG to
+`results/flamegraph.svg` plus a text companion `results/flamegraph.txt` (provenance, classification,
+and the top folded stacks). DWARF call graphs are used so stacks unwind correctly even though the
+`bench` (Release) preset omits frame pointers — the application symbols (`OrderBook::add_limit`,
+`MatchingEngine::new_limit`, the replay path, …) resolve from the symbol table without changing the
+optimization level under measurement.
+
+The folding and SVG rendering live in `scripts/flamegraph.py`, a dependency-free Python script
+(standard library only) that reimplements the `stackcollapse` + flamegraph data model rather than
+vendoring Brendan Gregg's Perl toolkit, so the artifact is reproducible from this repository alone.
+The renderer is deterministic — frames are sorted by name and colors are a pure function of the
+frame name (no RNG, no timestamps in the drawn body) — and is unit-tested in
+`tests/shell/test_flamegraph.sh` (registered with CTest, runs under `make check`). Frame width is
+proportional to on-CPU samples; this is a software cpu-clock sampling profile for **hot-symbol
+investigation**, not a latency or throughput measurement. Set `QSL_FLAMEGRAPH_EVENT=cycles` to
+sample the hardware PMU cycles event instead, where the host exposes it.
+
 ## Required Environment
 
 Both scripts are Linux-only and fail before running on non-Linux hosts. `perf stat` also fails
@@ -113,8 +137,14 @@ counters, permission-limited sampling, or a sample report that is explicitly mar
 - `results/perf_report_linux.txt` records benchmark output, `perf record` stderr, and
   `perf report --stdio` output. It is useful as a hot-symbol profile only when `No samples: no`,
   `Insufficient samples: no`, and `Sample count` is at least `Minimum samples for hot profile`.
-- `build/perf/qsl-bench.perf.data` is generated by `make perf-record` and is intentionally not
-  committed; it is host-specific binary profiler data.
+- `results/flamegraph.svg` is the rendered flamegraph from `make flamegraph`; `results/flamegraph.txt`
+  is its provenance/classification companion (and lists the top folded stacks). Treat frame widths as
+  a hot-symbol guide only when the `.txt` reports a `flamegraph (...)` `Artifact:` and a `Sample
+  count` at least `Minimum samples for hot profile`; a `constrained-environment validation` label
+  means sampling did not capture enough stacks to trust.
+- `build/perf/qsl-bench.perf.data` and `build/perf/qsl-bench.flame.data` are generated by
+  `make perf-record` / `make flamegraph` and are intentionally not committed; they are host-specific
+  binary profiler data.
 
 Each artifact includes hardware, kernel, compiler, perf version, build type, dataset, command,
 event set, and source-digest provenance. The `Source digest` is the authoritative source identity;

diff --git a/results/README.md b/results/README.md
@@ -23,6 +23,12 @@ Benchmark results produced by `make bench` and scripts under `scripts/`.
 - `perf_report_linux.txt` — Linux `perf record/report` hot-symbol output for the benchmark
   harness (`make perf-record`). It is useful as a hot-symbol profile only when the file says
   `No samples: no`, `Insufficient samples: no`, and the sample count meets the reported minimum.
+- `flamegraph.svg` / `flamegraph.txt` — Linux `perf` call-graph flamegraph (`make flamegraph`,
+  issue #32) rendered by the dependency-free `scripts/flamegraph.py`. The `.svg` is the visual
+  (frame width ∝ on-CPU samples) with provenance in a leading XML comment; the `.txt` carries
+  provenance, the `Artifact:` classification, and the top folded stacks. It is a software cpu-clock
+  sampling profile for hot-symbol investigation, not a latency/throughput claim — trust frame widths
+  only when the `.txt` reports a `flamegraph (...)` artifact with enough samples.
 - `numa_affinity_study.txt` — Linux CPU-affinity / scheduler-migration / NUMA-locality study
   output (`make numa-study`). It must self-classify as `full-linux-numa`, `linux-constrained`, or
   `unsupported-host`; only `full-linux-numa` is full NUMA evidence.

diff --git a/results/flamegraph.svg b/results/flamegraph.svg
diff --git a/results/flamegraph.txt b/results/flamegraph.txt
@@ -0,0 +1,59 @@
+Command:       make flamegraph
+Artifact:      flamegraph (software cpu-clock sampling hot-symbol profile)
+Hardware:      aarch64
+OS:            Linux 6.19.14-400.asahi.fc44.aarch64+16k
+CPU:           Avalanche-M2
+Compiler:      c++ (GCC) 16.1.1 20260515 (Red Hat 16.1.1-2)
+Perf:          perf version 6.19.14-400.asahi.fc44.aarch64
+Perf paranoid: 2
+Build type:    Release
+Provenance version: 1
+Git commit (informational): 31070b1
+Source digest: sha256:6aa521e6295a99f9dbf7dee9e5bcef04e93174ed12c3e8de9b991a8bfc14c809
+Source digest scope: flamegraph-benchmark
+Dirty inputs: no
+Generated output: results/flamegraph.svg
+Date: 2026-06-22T02:18:23Z
+Benchmark binary: build/bench/qsl-bench
+Dataset:       qsl-bench default synthetic benchmark suite
+Call graph:    dwarf
+Record event:  cpu-clock
+Sample freq:   4000 Hz
+Sample count (folded total):      329
+Sample count (perf record est.):  329
+Folded stacks: 159
+Minimum samples for hot profile: 200
+Insufficient samples: no
+Record status: 0
+Script status: 0
+Perf access limitation: no
+Flamegraph SVG: results/flamegraph.svg
+Perf data:     build/perf/qsl-bench.flame.data (generated, not intended for commit)
+
+Caveat: this flamegraph is a software cpu-clock sampling profile for hot-symbol
+investigation. Frame width is proportional to on-CPU samples, not wall-clock
+latency or throughput, and is hardware/kernel/compiler/build dependent.
+
+Top 15 folded stacks (count  stack):
+      15  qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::protocol::decode_new_order(std::span<std::byte const, 18446744073709551615ul>)
+      11  qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::gateway::Session::on_bytes(std::span<std::byte const, 18446744073709551615ul>);qsl::gateway::Session::on_bytes(std::span<std::byte const, 18446744073709551615ul>, std::vector<std::byte, std::allocator<std::byte> >&, unsigned long);qsl::gateway::Session::process_frame(std::span<std::byte const, 18446744073709551615ul>, std::vector<std::byte, std::allocator<std::byte> >&, unsigned long);qsl::gateway::OrderGateway::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce);qsl::engine::MatchingEngine::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce)
+      11  qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::replay::generate_flow(unsigned long, unsigned int, unsigned long);qsl::engine::MatchingEngine::contains(unsigned int, unsigned long) const
+       8  qsl-bench;__libc_start_call_main;[unknown];[unknown];cfree@GLIBC_2.17
+       7  qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::engine::OrderBook::cancel(unsigned long);decltype(auto) qsl::engine::OrderBook::dispatch_storage<qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}>(qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}&&) [clone .isra.0];qsl::engine::OrderBook::erase_resting_order(qsl::engine::OrderBook::Locator const&)
+       6  qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::gateway::Session::on_bytes(std::span<std::byte const, 18446744073709551615ul>);qsl::gateway::Session::on_bytes(std::span<std::byte const, 18446744073709551615ul>, std::vector<std::byte, std::allocator<std::byte> >&, unsigned long);qsl::gateway::Session::process_frame(std::span<std::byte const, 18446744073709551615ul>, std::vector<std::byte, std::allocator<std::byte> >&, unsigned long);qsl::gateway::OrderGateway::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce)
+       6  qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::replay::apply(qsl::engine::MatchingEngine&, std::variant<qsl::replay::RegisterSymbol, qsl::replay::NewLimit, qsl::replay::NewMarket, qsl::replay::Cancel, qsl::replay::Modify> const&);qsl::engine::MatchingEngine::new_limit(unsigned int, unsigned long, qsl::core::Side, long, unsigned int, qsl::core::TimeInForce)
+       5  qsl-bench;qsl::engine::OrderBook::erase_resting_order(qsl::engine::OrderBook::Locator const&);operator delete(void*, std::align_val_t)@plt
+       5  qsl-bench;qsl::engine::OrderBook::erase_resting_order(qsl::engine::OrderBook::Locator const&);operator delete(void*, unsigned long, std::align_val_t)@plt
+       5  qsl-bench;std::_Hashtable<unsigned long, std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, std::pmr::polymorphic_allocator<std::pair<unsigned long const, qsl::engine::OrderBook::Locator> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node<std::pair<unsigned long const, qsl::engine::OrderBook::Locator>, false>*);operator delete(void*, unsigned long, std::align_val_t)@plt
+       5  qsl-bench;[unknown];[unknown];operator new(unsigned long);malloc@plt
+       5  qsl-bench;[unknown];[unknown];[unknown];__libc_start_call_main;main;qsl::replay::generate_flow(unsigned long, unsigned int, unsigned long);qsl::engine::OrderBook::contains(unsigned long) const
+       4  qsl-bench;decltype(auto) qsl::engine::OrderBook::dispatch_storage<qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}>(qsl::engine::OrderBook::cancel(unsigned long)::{lambda()#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::IntrusiveStore&)#1}&&, qsl::engine::OrderBook::cancel(unsigned long)::{lambda(qsl::engine::OrderBook::ContiguousStore&)#1}&&) [clone .isra.0];[unknown];[unknown];cfree@GLIBC_2.17
+       4  qsl-bench;main;[unknown];[unknown];operator new(unsigned long);malloc
+       4  qsl-bench;operator new(unsigned long);malloc@plt
+
+Benchmark output:
+order_book add/mod/cancel     200000 ops        132.8 ns/op        7531861 ops/sec
+protocol encode+decode        500000 ops         20.5 ns/op       48773893 ops/sec
+gateway session (fill)        200000 ops        127.4 ns/op        7848348 ops/sec
+matching engine flow            5004 items      101.6 ns/item      9840697 items/sec
+replay command log              5004 items      112.0 ns/item      8928265 items/sec