Aroesler1 · Aroesler1 · Apr 6, 2026 · Apr 6, 2026
diff --git a/README.md b/README.md
@@ -7,6 +7,7 @@ This repository implements a small, deterministic C++ limit-order-book engine fo
 - aggregated bid/ask levels plus order-ID lookup
 - two price-level backends: `std::map` and flat sorted `std::vector`
 - rolling analytics and CSV export after every processed message
+- optional post-replay prediction summary reporting by message horizon
 - deterministic C++ and Python integration tests
 - replay benchmark tooling and a hand-maintained benchmark reproducibility note
 
@@ -61,6 +62,19 @@ Export analytics rows after every processed message:
 
 If `--backend both` is selected, the CLI writes one CSV per backend by suffixing the output path.
 
+Emit a separate prediction summary after replay without changing the analytics CSV rows:
+
+```bash
+"$build_dir/lob_engine" \
+  data/AAPL_sample_messages.csv \
+  --backend map \
+  --analytics-out "$build_dir/analytics.csv" \
+  --prediction-report-out "$build_dir/prediction_report.csv" \
+  --prediction-horizons 100,500
+```
+
+`--prediction-report-out` requires `--prediction-horizons`. If both flags are omitted, prediction work stays disabled.
+
 ## Analytics
 
 Each processed message produces a row with:
@@ -78,6 +92,8 @@ The default rolling windows match the project objective:
 - trailing `1000` messages for trade-based metrics
 - trailing `300` seconds for realized volatility
 
+Prediction reporting is a separate CSV keyed by message horizon. For each row `t`, the label is the sign of the first non-zero mid-price move found in `t+1 ... t+H` relative to mid at `t`. Rows with invalid current mid or no non-zero future move inside the horizon are skipped. The report includes labeled sample counts, up/down move counts, hit rate from `sign(order_imbalance_top5)` on non-zero-signal rows, and information coefficient computed as the Pearson correlation between the raw top-5 imbalance value and the future move sign. Zero-signal rows stay in the labeled sample and IC calculation but increment `skipped_zero_signal` so they are excluded from the hit-rate denominator.
+
 ## Backends
 
 Two backends are implemented behind the same `OrderBook` interface:

diff --git a/include/lob/analytics.hpp b/include/lob/analytics.hpp
@@ -1,21 +1,221 @@
 #pragma once
 
 #include <cstddef>
+#include <initializer_list>
+#include <limits>
 #include <memory>
 #include <optional>
 #include <string>
+#include <utility>
 #include <vector>
 
 #include "lob/order_book.hpp"
 #include "lob/types.hpp"
 
 namespace lob {
 
+struct OptionalStringSetting {
+    OptionalStringSetting() = default;
+    OptionalStringSetting(std::nullopt_t) noexcept {}
+    OptionalStringSetting(const std::optional<std::string>& text)
+        : value_(text) {}
+    OptionalStringSetting(std::optional<std::string>&& text) noexcept
+        : value_(std::move(text)) {}
+    OptionalStringSetting(const std::string& text)
+        : value_(text) {}
+    OptionalStringSetting(std::string&& text) noexcept
+        : value_(std::move(text)) {}
+    OptionalStringSetting(const char* text)
+        : value_(text == nullptr ? std::optional<std::string>{} : std::optional<std::string>{text}) {}
+
+    OptionalStringSetting& operator=(std::nullopt_t) noexcept {
+        value_.reset();
+        return *this;
+    }
+
+    OptionalStringSetting& operator=(const std::optional<std::string>& text) {
+        value_ = text;
+        return *this;
+    }
+
+    OptionalStringSetting& operator=(std::optional<std::string>&& text) noexcept {
+        value_ = std::move(text);
+        return *this;
+    }
+
+    OptionalStringSetting& operator=(const std::string& text) {
+        value_ = text;
+        return *this;
+    }
+
+    OptionalStringSetting& operator=(std::string&& text) noexcept {
+        value_ = std::move(text);
+        return *this;
+    }
+
+    OptionalStringSetting& operator=(const char* text) {
+        value_ = text == nullptr ? std::optional<std::string>{} : std::optional<std::string>{text};
+        return *this;
+    }
+
+    bool has_value() const noexcept {
+        return value_.has_value();
+    }
+
+    bool empty() const noexcept {
+        return !value_.has_value() || value_->empty();
+    }
+
+    void reset() noexcept {
+        value_.reset();
+    }
+
+    const std::string& value() const {
+        return value_.value();
+    }
+
+    std::string value_or(std::string default_value) const {
+        return value_.value_or(std::move(default_value));
+    }
+
+    const std::string& operator*() const {
+        return value();
+    }
+
+    std::string& operator*() {
+        return value_.value();
+    }
+
+    const std::string* operator->() const {
+        return &value();
+    }
+
+    std::string* operator->() {
+        return &value_.value();
+    }
+
+    explicit operator bool() const noexcept {
+        return value_.has_value();
+    }
+
+    friend bool operator==(const OptionalStringSetting& lhs, std::nullopt_t) noexcept {
+        return !lhs.value_.has_value();
+    }
+
+    friend bool operator==(std::nullopt_t, const OptionalStringSetting& rhs) noexcept {
+        return rhs == std::nullopt;
+    }
+
+    friend bool operator!=(const OptionalStringSetting& lhs, std::nullopt_t) noexcept {
+        return !(lhs == std::nullopt);
+    }
+
+    friend bool operator!=(std::nullopt_t, const OptionalStringSetting& rhs) noexcept {
+        return !(rhs == std::nullopt);
+    }
+
+    friend bool operator==(const OptionalStringSetting& lhs, const std::string& rhs) {
+        return lhs.value_ == rhs;
+    }
+
+    friend bool operator==(const std::string& lhs, const OptionalStringSetting& rhs) {
+        return rhs == lhs;
+    }
+
+    friend bool operator!=(const OptionalStringSetting& lhs, const std::string& rhs) {
+        return !(lhs == rhs);
+    }
+
+    friend bool operator!=(const std::string& lhs, const OptionalStringSetting& rhs) {
+        return !(rhs == lhs);
+    }
+
+    friend bool operator==(const OptionalStringSetting& lhs, const char* rhs) {
+        return lhs == std::string(rhs == nullptr ? "" : rhs);
+    }
+
+    friend bool operator==(const char* lhs, const OptionalStringSetting& rhs) {
+        return rhs == lhs;
+    }
+
+    friend bool operator!=(const OptionalStringSetting& lhs, const char* rhs) {
+        return !(lhs == rhs);
+    }
+
+    friend bool operator!=(const char* lhs, const OptionalStringSetting& rhs) {
+        return !(rhs == lhs);
+    }
+
+    friend bool operator==(const OptionalStringSetting& lhs, const std::optional<std::string>& rhs) {
+        return lhs.value_ == rhs;
+    }
+
+    friend bool operator==(const std::optional<std::string>& lhs, const OptionalStringSetting& rhs) {
+        return rhs == lhs;
+    }
+
+    friend bool operator!=(const OptionalStringSetting& lhs, const std::optional<std::string>& rhs) {
+        return !(lhs == rhs);
+    }
+
+    friend bool operator!=(const std::optional<std::string>& lhs, const OptionalStringSetting& rhs) {
+        return !(rhs == lhs);
+    }
+
+private:
+    std::optional<std::string> value_{};
+};
+
 struct AnalyticsConfig {
     std::size_t trade_window_messages{1000};
     double realized_vol_window_seconds{300.0};
     std::size_t depth_levels{10};
     std::size_t expected_messages{0};
+    std::vector<std::size_t> prediction_horizons{};
+    OptionalStringSetting prediction_report_out{};
+    std::vector<int> prediction_horizons_messages{};
+
+    bool prediction_report_output_enabled() const noexcept {
+        return prediction_report_out.has_value() && !prediction_report_out.empty();
+    }
+
+    std::vector<std::size_t> resolved_prediction_horizons() const {
+        const bool use_message_horizons = !prediction_horizons_messages.empty();
+        std::vector<std::size_t> horizons;
+
+        if (use_message_horizons) {
+            horizons.reserve(prediction_horizons_messages.size());
+            for (const int horizon : prediction_horizons_messages) {
+                if (horizon > 0) {
+                    horizons.push_back(static_cast<std::size_t>(horizon));
+                }
+            }
+            return horizons;
+        }
+
+        horizons.reserve(prediction_horizons.size());
+        for (const std::size_t horizon : prediction_horizons) {
+            if (horizon > 0 &&
+                horizon <= static_cast<std::size_t>(std::numeric_limits<int>::max())) {
+                horizons.push_back(horizon);
+            }
+        }
+        return horizons;
+    }
+
+    std::vector<int> resolved_prediction_horizons_messages() const {
+        const std::vector<std::size_t> resolved_horizons = resolved_prediction_horizons();
+        std::vector<int> horizons;
+        horizons.reserve(resolved_horizons.size());
+        for (const std::size_t horizon : resolved_horizons) {
+            horizons.push_back(static_cast<int>(horizon));
+        }
+        return horizons;
+    }
+
+    bool prediction_reporting_enabled() const {
+        return prediction_report_output_enabled() && !resolved_prediction_horizons().empty();
+    }
 };
 
 struct AnalyticsRow {
@@ -36,6 +236,29 @@ struct AnalyticsRow {
     std::optional<double> rolling_realized_vol;
 };
 
+struct PredictionSnapshot {
+    std::size_t message_index{0};
+    std::optional<double> mid_price;
+    std::optional<double> order_imbalance_top5;
+};
+
+struct PredictionSummaryRow {
+    std::size_t horizon_messages{0};
+    std::size_t total_rows_seen{0};
+    std::size_t eligible_rows_with_valid_mid{0};
+    std::size_t labeled_rows{0};
+    std::size_t skipped_no_valid_mid{0};
+    std::size_t skipped_no_future_move_within_horizon{0};
+    std::size_t skipped_zero_signal{0};
+    std::size_t up_moves{0};
+    std::size_t down_moves{0};
+    std::size_t correct_predictions{0};
+    std::size_t incorrect_predictions{0};
+    double hit_rate{0.0};
+    double information_coefficient{0.0};
+    double coverage_vs_total{0.0};
+};
+
 class AnalyticsEngine {
 public:
     explicit AnalyticsEngine(AnalyticsConfig config = {});
@@ -64,4 +287,22 @@ std::vector<AnalyticsRow> replay_with_analytics(
 
 void write_analytics_csv(const std::vector<AnalyticsRow>& rows, const std::string& output_path);
 
+std::vector<PredictionSnapshot> collect_prediction_snapshots(const std::vector<AnalyticsRow>& rows);
+
+std::vector<PredictionSummaryRow> summarize_prediction_horizons(
+    const std::vector<PredictionSnapshot>& snapshots,
+    const std::vector<std::size_t>& horizons);
+
+std::vector<PredictionSummaryRow> summarize_prediction_horizons(
+    const std::vector<PredictionSnapshot>& snapshots,
+    const std::vector<int>& horizons);
+
+std::vector<PredictionSummaryRow> summarize_prediction_horizons(
+    const std::vector<PredictionSnapshot>& snapshots,
+    std::initializer_list<int> horizons);
+
+void write_prediction_report_csv(
+    const std::vector<PredictionSummaryRow>& rows,
+    const std::string& output_path);
+
 }  // namespace lob
diff --git a/report/benchmark_report.md b/report/benchmark_report.md
@@ -43,6 +43,27 @@ ctest --test-dir "$build_dir" --output-on-failure -C Release
 python -m pytest tests -q --tb=short
 ```
 
+## Prediction Reporting Feature Gate
+
+The new prediction labeling/reporting path is outside the replay-only benchmark timer and remains optional. The core `lob_benchmark` command is still the replay hot-path check:
+
+```bash
+taskset -c 0 "$build_dir/lob_benchmark" --dataset data/AAPL_sample_messages.csv --backend both --reserve on --depth 5 --repeat 100000
+```
+
+To exercise the same dataset through the normal replay CLI with prediction reporting disabled versus enabled:
+
+```bash
+"$build_dir/lob_engine" data/AAPL_sample_messages.csv --backend map --analytics-out "$build_dir/analytics.csv"
+"$build_dir/lob_engine" data/AAPL_sample_messages.csv --backend map --analytics-out "$build_dir/analytics.csv" --prediction-report-out "$build_dir/prediction_report.csv" --prediction-horizons 100
+```
+
+Expected behavior:
+
+- without prediction flags, the CLI emits the existing analytics CSV only
+- with prediction flags, the analytics CSV stays unchanged and a separate prediction report CSV is added
+- any extra work is feature-gated to the prediction-enabled CLI path; the replay-only benchmark command above remains valid and unchanged
+
 ## Measurement methodology
 
 - baseline variant: clean `origin/main` tree at commit `d627b73`