perf: clean high-density flamegraph (fp build, profile workload, no [unknown]) + engine event-buffer reuse#135
perf: clean high-density flamegraph (fp build, profile workload, no [unknown]) + engine event-buffer reuse#135div0rce wants to merge 3 commits into
Conversation
…er op Every MatchingEngine mutator (new_limit/new_market/cancel/modify) allocated and freed a fresh std::vector<EngineEvent> on every call — the dominant hot-path allocation (free was the #1 symbol in the perf flamegraph). Reuse one cleared scratch buffer per engine and return it by const reference; the reference is valid until the next mutating call, which every consumer already respects (the gateway/replay own their result by copying; profile/read paths consume the temporary immediately). Deterministic output is unchanged: `make determinism` confirms byte-identical fixtures across gcc/clang and vs the committed copies, and the OCaml differential suite still agrees. make check / asan 263/263, tsan 20/20. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nown] frames Fixes three problems with the committed flamegraph: large unresolved [unknown] frames, only ~329 samples, and a too-short (~80ms) capture. - Dedicated `flamegraph` CMake preset (build/flamegraph, -fno-omit-frame-pointer -g) recorded with `perf --call-graph fp`. Frame-pointer unwinding fully symbolizes the app + C-runtime startup chain that DWARF left as [unknown] towers. The latency `bench` build (results/latest.txt) is untouched. - New `qsl-bench profile [seconds]` subcommand: a warm, bounded, deterministic steady-state order flow sampled for 5s by default, so the capture carries ~20k samples instead of ~329. QSL_FLAMEGRAPH_SECONDS / QSL_BENCH_STORAGE tune duration and order-book storage mode. - flamegraph.py folds the one residual unresolvable frame — the glibc allocator boundary (Fedora libc has no frame pointers) — into its caller by default (--keep-unknown disables it): the sample is preserved and the named neighbours (the app frame and cfree/operator new) stay, so the rendered flamegraph has zero [unknown] frames. Unit-tested in test_flamegraph.sh. Result: build/flamegraph profile, fp call-graph, ~20k samples / 5s, zero [unknown]. README + docs/perf_analysis.md updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bare-metal Fedora Asahi (aarch64) regeneration from the clean committed tree (Dirty inputs: no) via the new build/flamegraph fp build + `qsl-bench profile` workload: 20,001 cpu-clock samples over 5s, fully-symbolized stacks, no [unknown] frames. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
📝 WalkthroughWalkthroughIntroduces a dedicated ChangesFlamegraph Profiling Infrastructure
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Gates Failed
Prevent hotspot decline
(1 hotspot with Complex Method, Overall Code Complexity)
Enforce advisory code health rules
(1 file with Complex Method, Overall Code Complexity)
Our agent can fix these. Install it.
Gates Passed
4 Quality Gates Passed
Reason for failure
| Prevent hotspot decline | Violations | Code Health Impact | |
|---|---|---|---|
| main.cpp | 2 rules in this hotspot | 10.00 → 9.07 | Suppress |
| Enforce advisory code health rules | Violations | Code Health Impact | |
|---|---|---|---|
| main.cpp | 2 advisory rules | 10.00 → 9.07 | Suppress |
Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.
| void run_profile_workload(int argc, char **argv) { | ||
| using namespace qsl; | ||
| using core::Side; | ||
| using core::TimeInForce; | ||
|
|
||
| double seconds = 5.0; | ||
| if (argc >= 3) { | ||
| seconds = std::strtod(argv[2], nullptr); | ||
| } else if (const char *e = std::getenv("QSL_BENCH_PROFILE_SECONDS")) { | ||
| seconds = std::strtod(e, nullptr); | ||
| } | ||
| if (!(seconds > 0.0)) { | ||
| seconds = 5.0; | ||
| } | ||
|
|
||
| const auto storage = profile_storage_from_env(); | ||
| engine::MatchingEngine eng{storage}; | ||
| const auto sym = eng.register_symbol("AAPL"); | ||
|
|
||
| constexpr std::size_t kRing = 512; // bounded resting depth | ||
| std::vector<core::OrderId> ring; | ||
| ring.reserve(kRing); | ||
| std::size_t head = 0; | ||
|
|
||
| // splitmix64 keeps the flow reproducible across runs/compilers without <random> overhead. | ||
| std::uint64_t state = 0x9E3779B97F4A7C15ULL; | ||
| const auto next = [&state] { | ||
| state += 0x9E3779B97F4A7C15ULL; | ||
| std::uint64_t z = state; | ||
| z = (z ^ (z >> 30)) * 0xBF58476D1CE4E5B9ULL; | ||
| z = (z ^ (z >> 27)) * 0x94D049BB133111EBULL; | ||
| return z ^ (z >> 31); | ||
| }; | ||
|
|
||
| core::OrderId id = 1; | ||
| std::uint64_t ops = 0; | ||
| const auto t0 = clock_type::now(); | ||
| const auto deadline = t0 + std::chrono::duration_cast<clock_type::duration>( | ||
| std::chrono::duration<double>(seconds)); | ||
| while (clock_type::now() < deadline) { | ||
| for (int k = 0; k < 4096; ++k) { // batch between clock reads | ||
| const std::uint64_t r = next(); | ||
| const Side side = ((r & 1U) != 0U) ? Side::Buy : Side::Sell; | ||
| const core::Price price = 100 + static_cast<core::Price>((r >> 1) % 64); // [100,164) | ||
| const auto qty = 1 + static_cast<core::Quantity>((r >> 8) % 8); | ||
| const core::OrderId oid = id++; | ||
| g_sink += eng.new_limit(sym, oid, side, price, qty, TimeInForce::GTC).size(); | ||
| if (ring.size() < kRing) { | ||
| ring.push_back(oid); | ||
| } else { | ||
| g_sink += eng.cancel(sym, ring[head]).size(); | ||
| ring[head] = oid; | ||
| head = (head + 1) % kRing; | ||
| } | ||
| if ((r & 7U) == 0U && !ring.empty()) { | ||
| const core::OrderId mid = ring[(head + (ring.size() / 2)) % ring.size()]; | ||
| g_sink += eng.modify(sym, mid, price, qty).size(); | ||
| } | ||
| if ((r & 15U) == 0U) { | ||
| g_sink += | ||
| eng.new_market(sym, id++, ((r & 2U) != 0U) ? Side::Sell : Side::Buy, 3).size(); | ||
| } | ||
| ++ops; | ||
| } | ||
| } | ||
| const double secs = std::chrono::duration<double>(clock_type::now() - t0).count(); | ||
| std::printf("profile workload: storage=%s ops=%llu elapsed=%.3fs (%.0f ops/sec) resting~%zu\n", | ||
| storage_name(storage), static_cast<unsigned long long>(ops), secs, | ||
| static_cast<double>(ops) / secs, ring.size()); | ||
| } |
There was a problem hiding this comment.
❌ New issue: Complex Method
run_profile_workload has a cyclomatic complexity of 12, threshold = 9
| #include <cstddef> | ||
| #include <cstdint> | ||
| #include <cstdio> | ||
| #include <cstdlib> |
There was a problem hiding this comment.
❌ New issue: Overall Code Complexity
This module has a mean cyclomatic complexity of 5.25 across 8 functions. The mean complexity threshold is 4
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/qsl-bench/main.cpp`:
- Around line 149-157: The validation check `!(seconds > 0.0)` does not properly
guard against non-finite values (inf, -inf, nan) that can be returned by
std::strtod, as these values will pass the comparison check and lead to
undefined behavior when converted to clock_type::duration. Add an explicit
std::isfinite(seconds) check alongside the existing positivity check to ensure
the parsed seconds value is both finite and greater than zero. Apply this fix in
two locations: in the main validation block after parsing from argv or
environment variable, and in the other location mentioned around lines 181-182.
In `@scripts/flamegraph.py`:
- Around line 91-98: The current logic in the unknown frame filtering removes
all "[unknown]" frames from the stack, but according to the comment it should
only remove the single expected allocator-boundary case. Modify the list
comprehension that creates the `kept` variable to only remove the first
occurrence of "[unknown]" in the frames list instead of filtering out all
instances. This preserves unrelated unresolved regions while still folding the
documented lone boundary case into its caller.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1cc958f7-e49f-46b3-8400-68a148b36c58
⛔ Files ignored due to path filters (1)
results/flamegraph.svgis excluded by!**/*.svg
📒 Files selected for processing (11)
CMakePresets.jsonMakefileREADME.mdapps/qsl-bench/main.cppdocs/perf_analysis.mdinclude/qsl/engine/matching_engine.hppresults/flamegraph.txtscripts/flamegraph.pyscripts/flamegraph.shsrc/engine/matching_engine.cpptests/shell/test_flamegraph.sh
| double seconds = 5.0; | ||
| if (argc >= 3) { | ||
| seconds = std::strtod(argv[2], nullptr); | ||
| } else if (const char *e = std::getenv("QSL_BENCH_PROFILE_SECONDS")) { | ||
| seconds = std::strtod(e, nullptr); | ||
| } | ||
| if (!(seconds > 0.0)) { | ||
| seconds = 5.0; | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify the parse/convert path and whether finite guards are present.
rg -n -C3 'QSL_BENCH_PROFILE_SECONDS|strtod|duration_cast<clock_type::duration>|isfinite' apps/qsl-bench/main.cppRepository: div0rce/quant-systems-lab
Length of output: 864
Guard profile duration parsing against non-finite values before converting to clock_type::duration.
std::strtod can yield inf, -inf, or nan, and the current check !(seconds > 0.0) does not reject non-finite values (e.g., !(inf > 0.0) is false). Converting unbounded or non-finite double durations to clock_type::duration risks undefined behavior. Add explicit std::isfinite() validation before duration_cast.
Suggested fix
+#include <cmath>
@@
- if (!(seconds > 0.0)) {
+ if (!std::isfinite(seconds) || seconds <= 0.0) {
seconds = 5.0;
}
+ if (seconds > 3600.0) {
+ seconds = 3600.0;
+ }
@@
- const auto deadline = t0 + std::chrono::duration_cast<clock_type::duration>(
- std::chrono::duration<double>(seconds));
+ const auto budget = std::chrono::duration<double>(seconds);
+ const auto deadline = t0 + std::chrono::duration_cast<clock_type::duration>(budget);Also applies to: 181-182
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/qsl-bench/main.cpp` around lines 149 - 157, The validation check
`!(seconds > 0.0)` does not properly guard against non-finite values (inf, -inf,
nan) that can be returned by std::strtod, as these values will pass the
comparison check and lead to undefined behavior when converted to
clock_type::duration. Add an explicit std::isfinite(seconds) check alongside the
existing positivity check to ensure the parsed seconds value is both finite and
greater than zero. Apply this fix in two locations: in the main validation block
after parsing from argv or environment variable, and in the other location
mentioned around lines 181-182.
| if self._drop_unknown: | ||
| # Frame-pointer unwinding emits a single unresolvable "[unknown]" frame at the | ||
| # glibc allocator boundary (Fedora's libc is built without frame pointers). Fold it | ||
| # into its caller: the sample is preserved and the real neighbours (the app frame and | ||
| # the named libc symbol such as cfree/operator new) stay in the stack. | ||
| kept = [f for f in frames if f != "[unknown]"] | ||
| self.dropped_unknown += len(frames) - len(kept) | ||
| frames = kept |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Restrict unknown-frame folding to the documented lone-boundary case.
Current logic removes every "[unknown]" frame in a stack, which can over-collapse unrelated unresolved regions and distort folded stacks. This should only fold the single expected allocator-boundary frame case.
Proposed fix
- if self._drop_unknown:
- # Frame-pointer unwinding emits a single unresolvable "[unknown]" frame at the
- # glibc allocator boundary (Fedora's libc is built without frame pointers). Fold it
- # into its caller: the sample is preserved and the real neighbours (the app frame and
- # the named libc symbol such as cfree/operator new) stay in the stack.
- kept = [f for f in frames if f != "[unknown]"]
- self.dropped_unknown += len(frames) - len(kept)
- frames = kept
+ if self._drop_unknown:
+ # Only fold the lone allocator-boundary unknown-frame case.
+ unknown_positions = [i for i, f in enumerate(frames) if f == "[unknown]"]
+ if len(unknown_positions) == 1:
+ idx = unknown_positions[0]
+ if 0 < idx < len(frames) - 1:
+ del frames[idx]
+ self.dropped_unknown += 1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self._drop_unknown: | |
| # Frame-pointer unwinding emits a single unresolvable "[unknown]" frame at the | |
| # glibc allocator boundary (Fedora's libc is built without frame pointers). Fold it | |
| # into its caller: the sample is preserved and the real neighbours (the app frame and | |
| # the named libc symbol such as cfree/operator new) stay in the stack. | |
| kept = [f for f in frames if f != "[unknown]"] | |
| self.dropped_unknown += len(frames) - len(kept) | |
| frames = kept | |
| if self._drop_unknown: | |
| # Only fold the lone allocator-boundary unknown-frame case. | |
| unknown_positions = [i for i, f in enumerate(frames) if f == "[unknown]"] | |
| if len(unknown_positions) == 1: | |
| idx = unknown_positions[0] | |
| if 0 < idx < len(frames) - 1: | |
| del frames[idx] | |
| self.dropped_unknown += 1 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/flamegraph.py` around lines 91 - 98, The current logic in the unknown
frame filtering removes all "[unknown]" frames from the stack, but according to
the comment it should only remove the single expected allocator-boundary case.
Modify the list comprehension that creates the `kept` variable to only remove
the first occurrence of "[unknown]" in the frames list instead of filtering out
all instances. This preserves unrelated unresolved regions while still folding
the documented lone boundary case into its caller.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 34e84e0060
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # glibc allocator boundary (Fedora's libc is built without frame pointers). Fold it | ||
| # into its caller: the sample is preserved and the real neighbours (the app frame and | ||
| # the named libc symbol such as cfree/operator new) stay in the stack. | ||
| kept = [f for f in frames if f != "[unknown]"] |
There was a problem hiding this comment.
Preserve unresolved frames instead of deleting them
On hosts where perf script emits [unknown] for anything beyond the single expected glibc allocator boundary (for example missing symbols or broken unwinding in an application/shared-library frame), this list comprehension removes every unresolved frame before rendering. That makes the artifact claim zero unknowns while silently splicing non-adjacent frames together, so the folded stacks and hot-symbol conclusions can be misleading rather than showing the unwind gap or failing the quality check.
Useful? React with 👍 / 👎.
| // every mutating call, so the hot path issues no per-operation heap allocation. The reference | ||
| // is valid only until the next mutating call on this engine; consumers that must retain events | ||
| // (the gateway result, replay accumulation) copy out, which they already do by owning a vector. | ||
| const std::vector<EngineEvent> &new_limit(SymbolId symbol, OrderId id, Side side, Price price, |
There was a problem hiding this comment.
Avoid exposing reused event storage to publisher callbacks
When a caller passes this returned reference directly to MarketDataPublisher::publish (as existing examples do with pub.publish(eng, eng.new_limit(...))) and a subscriber synchronously submits another command to the same engine, the next mutator clears/reuses events_ while publish is still iterating its span. The old by-value result kept the published event range independent, but this change can make callbacks skip events or read destroyed/overwritten EngineEvents; copy before invoking callbacks or return an owning result for this path.
Useful? React with 👍 / 👎.
Fixes the flamegraph quality problems and removes the dominant hot-path allocation it revealed.
Flamegraph: three problems fixed
[unknown]frames. Switched to a dedicated frame-pointer build (flamegraphCMake preset →build/flamegraph,-fno-omit-frame-pointer -g) recorded withperf --call-graph fp. This fully symbolizes the app + C-runtime startup chain that DWARF left as[unknown]towers. The one residual unresolvable frame — the glibc allocator boundary (Fedora libc has no frame pointers) — is folded into its caller byflamegraph.py(sample preserved;cfree/operator newstay named;--keep-unknownopts out). Result: zero[unknown].qsl-bench profile [seconds]subcommand drives a warm, bounded, deterministic steady-state order flow. Result: ~20k samples.QSL_FLAMEGRAPH_SECONDS).QSL_BENCH_STORAGEselects the order-book storage mode.The latency
benchbuild andresults/latest.txtare untouched (separate build dir), so committed latency numbers are unaffected.Engine optimization
The flamegraph showed
freeas the #1 symbol: everyMatchingEnginemutator allocated/freed a freshstd::vector<EngineEvent>per call. Now the engine reuses one cleared scratch buffer and returns it byconst&(valid until the next mutating call — every consumer already copies to retain, or consumes immediately). Removes the dominant per-op allocation.Validation
make check263/263,make asan263/263,make tsan20/20make determinism— fixtures byte-identical across gcc/clang and vs committed copies (deterministic output preserved by the engine change)check-fixtures/check-manifestcleanDirty inputs: no🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
profilebenchmark subcommand for order-book storage profiling with configurable duration--keep-unknownoption to preserve unresolved frames in flamegraph outputImprovements