dativo-io · sergeyenin · Jun 3, 2026 · Jun 3, 2026 · cursor · Jun 3, 2026
diff --git a/LIMITATIONS.md b/LIMITATIONS.md
@@ -53,4 +53,4 @@ A valid signature proves that this evidence record was signed with the deploymen
 - [Evidence store](docs/explanation/evidence-store.md) — how records are created, signed, and verified
 - [Evidence integrity specification](docs/reference/evidence-integrity-spec.md) — byte-exact fields, serialization, signing, and independent verification
 - [Threat model](docs/reference/threat-model.md) — attack surface, trust boundaries, and key-management assumptions
-- Reproducible benchmarks are forthcoming.
+- [Reproducible benchmarks](docs/reference/benchmarks.md) — run `make benchmarks` on your hardware; retry/fallback overhead not included until Epic #113 lands.
diff --git a/Makefile b/Makefile
@@ -16,7 +16,7 @@ ifeq ($(UNAME_S),Darwin)
   GO_ENV := env -u CC CC=/usr/bin/clang CGO_ENABLED=1
 endif
 
-.PHONY: help build install test test-integration test-e2e test-smoke test-all test-ssot-gate conformance lint fmt clean vet mod-tidy check docker-build demo-gateway demo-full demo-clean verify-flow0 nosec-count
+.PHONY: help build install test test-integration test-e2e test-smoke test-all test-ssot-gate conformance benchmarks lint fmt clean vet mod-tidy check docker-build demo-gateway demo-full demo-clean verify-flow0 nosec-count
 
 # Conformance suite: the evidence + policy paths whose passing test/subtest
 # count is published as Talon's honest conformance number. See
@@ -63,6 +63,9 @@ conformance: ## Run the evidence + policy conformance suite and print the passin
 	if [ $$rc -ne 0 ]; then printf '%s\n' "$$out" | tail -20; echo "conformance: FAILED ($$count passing before failure)"; exit 1; fi; \
 	echo "Conformance: $$count passing tests across evidence + policy paths ($(CONFORMANCE_PKGS))"
 
+benchmarks: ## Run reproducible micro-benchmarks (gateway overhead, PII scan, evidence write)
+	@bash scripts/run-benchmarks.sh
+
 lint: ## Run linter
 	@golangci-lint run ./...
 

diff --git a/README.md b/README.md
@@ -127,7 +127,7 @@ sequenceDiagram
     Talon-->>Client: GovernedResponse
 ```
 
-Pipeline overhead is typically under 15ms excluding upstream latency. Full byte-level breakdown: [What Talon does to your request](docs/explanation/what-talon-does-to-your-request.md).
+Pipeline overhead is typically under 15ms excluding upstream latency. Reproduce on your machine: `make benchmarks`. Full byte-level breakdown: [What Talon does to your request](docs/explanation/what-talon-does-to-your-request.md) · [Benchmarks](docs/reference/benchmarks.md).
 
 ---
 

diff --git a/docs/README.md b/docs/README.md
@@ -70,6 +70,7 @@ Choose the shortest path for your situation:
 | [Evidence integrity specification](reference/evidence-integrity-spec.md) | Normative signed-record spec: fields, canonical serialization, HMAC-SHA256 signing, and the independent verification procedure. |
 | [Threat model](reference/threat-model.md) | STRIDE-style attack surface, trust boundaries, threats/mitigations, and key-management assumptions for the gateway path. |
 | [Conformance suite & count](reference/conformance.md) | What counts as a conformance test for the evidence + policy paths, and how to reproduce the published count with `make conformance`. |
+| [Reproducible benchmarks](reference/benchmarks.md) | Gateway pipeline overhead, PII scan latency, and evidence write throughput (`make benchmarks`). |
 | [Authentication and key scopes](reference/authentication-and-key-scopes.md) | Which keys authenticate which endpoint families (gateway vs control plane vs dashboard). |
 | [Gateway dashboard](reference/gateway-dashboard.md) | Dashboard endpoints, metrics API schema, snapshot fields, and authentication. |
 | [Operational control plane](reference/operational-control-plane.md) | Run management (list/kill/pause/resume), tenant lockdown, runtime overrides, tool approval gates. |
@@ -98,6 +99,7 @@ Choose the shortest path for your situation:
 | [Evidence store](explanation/evidence-store.md) | HMAC integrity model and verification flow. |
 | [Evidence integrity specification](reference/evidence-integrity-spec.md) | Byte-exact spec so a third party can independently verify a record. |
 | [Conformance suite & count](reference/conformance.md) | Reproducible passing-test count for the evidence + policy paths (`make conformance`). |
+| [Reproducible benchmarks](reference/benchmarks.md) | `make benchmarks` — gateway overhead, PII scan, evidence write on your hardware. |
 | [Evidence integrity 5-minute proof](tutorials/evidence-integrity-demo.md) | Fast proof moment for auditors/operators, including offline signed-export verification. |
 | [Threat model](reference/threat-model.md) | Attack surface, trust boundaries, and what the HMAC signature does and does not prove. |
 | [Security policy](../SECURITY.md) | Vulnerability reporting process and security scope. |

diff --git a/docs/explanation/what-talon-does-to-your-request.md b/docs/explanation/what-talon-does-to-your-request.md
@@ -243,7 +243,13 @@ caller's daily/monthly accumulator (in-memory counter, periodically flushed).
 
 ## Throughput And Benchmarking
 
-Use this quick benchmark harness to measure your own environment. Throughput depends on message size, PII pattern density, and upstream provider latency.
+**Micro-benchmarks (reproducible from a clean checkout):** run `make benchmarks` or see
+[Reproducible benchmarks](../reference/benchmarks.md) for gateway pipeline overhead,
+PII scan latency, and evidence write throughput on your hardware.
+
+**End-to-end load (optional):** use this harness when you need concurrent requests
+through a running gateway. Throughput depends on message size, PII pattern density,
+and upstream provider latency.
 
 ```bash
 # 1) Start local proof environment

diff --git a/docs/reference/benchmarks.md b/docs/reference/benchmarks.md
@@ -0,0 +1,66 @@
+# Reproducible Benchmarks
+
+**Status:** stable · **Scope:** gateway pipeline overhead, PII scan latency, evidence write throughput.
+
+The README states that pipeline overhead is typically **under 15 ms excluding upstream
+latency**. This document defines how to reproduce the micro-benchmarks behind that claim,
+what each number measures, and what is intentionally out of scope.
+
+The authoritative numbers for a given machine are whatever `make benchmarks` prints
+when you run it locally. Results vary with CPU, Go version, SQLite build, and load;
+do not treat a single snapshot as a SLA.
+
+## Quick start
+
+```bash
+make benchmarks
+```
+
+Or with a saved snapshot file:
+
+```bash
+scripts/run-benchmarks.sh -o /tmp/talon-benchmarks.md
+```
+
+Requirements: Go 1.22+ (project pins 1.25.x in CI), CGO enabled (SQLite), repo root checkout.
+
+## What we measure
+
+| Metric | Go benchmark | Package | What it includes |
+|--------|--------------|---------|------------------|
+| **Gateway pipeline overhead** | `BenchmarkGatewayPipelineOverhead` | `internal/gateway` | One non-streaming `ServeHTTP` round trip: route, caller auth, request extract, PII scan, OPA policy evaluation, forward to a **local** `httptest` mock upstream, response PII scan, signed evidence write, metrics. Representative payload includes EU email + IBAN patterns. |
+| **PII scan latency** | `BenchmarkPIIScan` | `internal/classifier` | One `Scanner.Scan` on fixed text (email, IBAN, card). Isolates classifier cost without HTTP or SQLite. |
+| **Evidence write throughput** | `BenchmarkEvidenceStore` | `internal/evidence` | One `Generator.Generate` (HMAC-signed SQLite insert) per iteration. Isolates evidence path without gateway HTTP. |
+
+### What is excluded
+
+- **WAN upstream RTT** — the gateway benchmark uses an in-process mock server; add your provider latency separately.
+- **Retry / fallback routing** — not benchmarked until Epic #113 ([#138](https://github.com/dativo-io/talon/issues/138) / [#139](https://github.com/dativo-io/talon/issues/139)) lands.
+- **Streaming responses** — benchmarks use non-streaming JSON completions only.
+- **Attachment extraction / injection scan** — not in the default payload; add fixtures if you need that dimension.
+
+## Method
+
+1. **Toolchain:** `go test -bench=… -benchmem -benchtime=2s -count=5 -run=^$` over `./internal/gateway/...`, `./internal/classifier/...`, and `./internal/evidence/...`.
+2. **Cache:** `-count=5` runs five iterations; the script reports the **last** `ns/op` line per benchmark (median-of-runs is a reasonable stability check; inspect raw output in stderr for spread).
+3. **Hardware:** `scripts/run-benchmarks.sh` records `go version`, `uname`, and CPU model in the emitted table. Paste that block when publishing numbers externally.
+4. **Comparison to the 15 ms budget:** See the step table in [What Talon does to your request](../explanation/what-talon-does-to-your-request.md). Gateway overhead should be **below 15 ms** on a modern laptop/desktop when upstream is local; production adds network, disk contention, and concurrent load.
+
+## Interpreting results
+
+- **Gateway ms/req** — wall-clock per governed request with mock upstream. If this is consistently above 15 ms on your hardware, profile before citing the README claim in customer-facing material.
+- **PII ms/scan** — scales with input length and pattern density; the fixed benchmark string is a regression anchor, not a worst case.
+- **Evidence writes/s** — inverse of `ns/op` for `BenchmarkEvidenceStore`; useful for capacity planning on evidence-heavy workloads.
+
+## Source locations
+
+- Gateway: [`internal/gateway/bench_test.go`](../../internal/gateway/bench_test.go)
+- PII: [`internal/classifier/pii_test.go`](../../internal/classifier/pii_test.go) (`BenchmarkPIIScan`)
+- Evidence: [`internal/evidence/store_test.go`](../../internal/evidence/store_test.go) (`BenchmarkEvidenceStore`)
+- Runner: [`scripts/run-benchmarks.sh`](../../scripts/run-benchmarks.sh)
+
+## Related proof-bar docs
+
+- [Conformance suite & count](conformance.md) — reproducible test count for evidence + policy paths
+- [Evidence integrity specification](evidence-integrity-spec.md) — signed record format
+- [Threat model](threat-model.md) — trust boundaries the benchmarks do not replace
diff --git a/internal/gateway/bench_test.go b/internal/gateway/bench_test.go
@@ -0,0 +1,108 @@
+package gateway
+
+import (
+	"bytes"
+	"context"
+	"net/http"
+	"net/http/httptest"
+	"path/filepath"
+	"testing"
+
+	"github.com/dativo-io/talon/internal/classifier"
+	"github.com/dativo-io/talon/internal/evidence"
+	"github.com/dativo-io/talon/internal/policy"
+	"github.com/dativo-io/talon/internal/secrets"
+	"github.com/dativo-io/talon/internal/testutil"
+	"github.com/go-chi/chi/v5"
+)
+
+// BenchmarkGatewayPipelineOverhead measures end-to-end gateway wall time for one
+// non-streaming chat completion through ServeHTTP, with a local mock upstream (no
+// WAN RTT). This approximates Talon pipeline overhead for a typical payload:
+// route, caller auth, PII scan, policy evaluation, forward, response PII scan,
+// evidence write, and metrics.
+func BenchmarkGatewayPipelineOverhead(b *testing.B) {
+	upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		w.Header().Set("Content-Type", "application/json")
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte(`{"choices":[{"message":{"content":"ok"}}],"usage":{"prompt_tokens":10,"completion_tokens":5}}`))
+	}))
+	defer upstream.Close()
+
+	dir := b.TempDir()
+	cfg := &GatewayConfig{
+		Enabled:      true,
+		ListenPrefix: "/v1/proxy",
+		Mode:         ModeEnforce,
+		Providers: map[string]ProviderConfig{
+			"ollama": {Enabled: true, BaseURL: upstream.URL},
+		},
+		Callers: []CallerConfig{
+			{
+				Name: "bench-caller", TenantKey: "talon-gw-bench", TenantID: "default",
+				PolicyOverrides: &CallerPolicyOverrides{
+					AllowedModels: []string{"llama2"},
+					MaxDailyCost:  1000,
+				},
+			},
+		},
+		ServerDefaults: ServerDefaults{DefaultPIIAction: "warn"},
+		RateLimits: RateLimitsConfig{
+			GlobalRequestsPerMin:    1_000_000,
+			PerCallerRequestsPerMin: 1_000_000,
+		},
+		Timeouts: TimeoutsConfig{
+			ConnectTimeout:    "5s",
+			RequestTimeout:    "30s",
+			StreamIdleTimeout: "60s",
+		},
+	}
+
+	evStore, err := evidence.NewStore(filepath.Join(dir, "e.db"), testutil.TestSigningKey)
+	if err != nil {
+		b.Fatal(err)
+	}
+	defer evStore.Close()
+
+	secStore, err := secrets.NewSecretStore(filepath.Join(dir, "s.db"), testutil.TestEncryptionKey)
+	if err != nil {
+		b.Fatal(err)
+	}
+	defer secStore.Close()
+
+	cls := classifier.MustNewScanner()
+	policyEngine, err := policy.NewGatewayEngine(context.Background())
+	if err != nil {
+		b.Fatal(err)
+	}
+
+	gw, err := NewGateway(cfg, cls, evStore, secStore, policyEngine, nil)
+	if err != nil {
+		b.Fatal(err)
+	}
+
+	router := chi.NewRouter()
+	router.Route("/v1/proxy", func(r chi.Router) {
+		r.Handle("/*", gw)
+	})
+
+	// Representative user text with EU PII patterns (email + IBAN).
+	body := []byte(`{"model":"llama2","messages":[{"role":"user","content":"Contact hans.mueller@acme.de about IBAN DE89370400440532013000"}]}`)
+
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		req, err := http.NewRequestWithContext(context.Background(), http.MethodPost,
+			"http://test/v1/proxy/ollama/v1/chat/completions", bytes.NewReader(body))
+		if err != nil {
+			b.Fatal(err)
+		}
+		req.Header.Set("Authorization", "Bearer talon-gw-bench")
+		req.Header.Set("Content-Type", "application/json")
+
+		w := httptest.NewRecorder()
+		router.ServeHTTP(w, req)
+		if w.Code != http.StatusOK {
+			b.Fatalf("status %d: %s", w.Code, w.Body.String())
+		}
+	}
+}
diff --git a/scripts/run-benchmarks.sh b/scripts/run-benchmarks.sh
@@ -0,0 +1,128 @@
+#!/usr/bin/env bash
+#
+# run-benchmarks.sh — reproducible micro-benchmarks for Talon proof-bar metrics.
+#
+# Measures:
+#   - Gateway pipeline overhead (ServeHTTP + local mock upstream, no WAN RTT)
+#   - PII scan latency (classifier)
+#   - Evidence write throughput (signed SQLite record per op)
+#
+# Usage:
+#   scripts/run-benchmarks.sh              # print markdown table to stdout
+#   scripts/run-benchmarks.sh -o FILE.md   # also write table to FILE
+#
+# See docs/reference/benchmarks.md for methodology and how to interpret results.
+#
+set -euo pipefail
+
+cd "$(dirname "$0")/.."
+
+OUTPUT=""
+BENCH_TIME="${BENCH_TIME:-2s}"
+BENCH_COUNT="${BENCH_COUNT:-5}"
+BENCH_PKGS="./internal/gateway/... ./internal/classifier/... ./internal/evidence/..."
+BENCH_REGEX='Benchmark(GatewayPipelineOverhead|PIIScan|EvidenceStore)$'
+
+while getopts "o:" opt; do
+  case "$opt" in
+    o) OUTPUT="$OPTARG" ;;
+    *) echo "Usage: $0 [-o outfile.md]" >&2; exit 2 ;;
+  esac
+done
+
+if [ "$(uname -s)" = "Darwin" ]; then
+  GO_ENV=(env -u CC CC=/usr/bin/clang CGO_ENABLED=1)
+else
+  GO_ENV=(env CGO_ENABLED=1)
+fi
+
+bench_out=$("${GO_ENV[@]}" go test \
+  -bench="$BENCH_REGEX" \
+  -benchmem \
+  -benchtime="$BENCH_TIME" \
+  -count="$BENCH_COUNT" \
+  -run='^$' \
+  $BENCH_PKGS 2>&1) || {
+  echo "$bench_out" >&2
+  exit 1
+}
+
+# Parse last result line per benchmark name (go test -count repeats runs).
+parse_ns_per_op() {
+  local name="$1"
+  printf '%s\n' "$bench_out" \
+    | awk -v n="$name" '$1 ~ "^Benchmark" n && $1 !~ "/" { last=$3 } END { print last+0 }'
+}
+
+parse_allocs() {
+  local name="$1"
+  printf '%s\n' "$bench_out" \
+    | awk -v n="$name" '$1 ~ "^Benchmark" n && $1 !~ "/" && $0 ~ /allocs\/op/ { last=$(NF-1)" "$NF } END { print last }'
+}
+
+ns_to_ms() {
+  awk -v ns="$1" 'BEGIN { if (ns+0 <= 0) { print "n/a"; exit } printf "%.2f", ns/1e6 }'
+}
+
+ns_to_ops_per_sec() {
+  awk -v ns="$1" 'BEGIN { if (ns+0 <= 0) { print "n/a"; exit } printf "%.0f", 1e9/ns }'
+}
+
+gw_ns=$(parse_ns_per_op GatewayPipelineOverhead)
+pii_ns=$(parse_ns_per_op PIIScan)
+ev_ns=$(parse_ns_per_op EvidenceStore)
+
+gw_ms=$(ns_to_ms "$gw_ns")
+pii_ms=$(ns_to_ms "$pii_ns")
+ev_ms=$(ns_to_ms "$ev_ns")
+ev_ops=$(ns_to_ops_per_sec "$ev_ns")
+
+gw_allocs=$(parse_allocs GatewayPipelineOverhead)
+pii_allocs=$(parse_allocs PIIScan)
+ev_allocs=$(parse_allocs EvidenceStore)
+
+commit=$(git rev-parse --short HEAD 2>/dev/null || echo "unknown")
+generated=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
+go_ver=$("${GO_ENV[@]}" go version | sed 's/^go version //')
+os_info=$(uname -srm 2>/dev/null || uname -a)
+cpu_info=$(sysctl -n machdep.cpu.brand_string 2>/dev/null || grep -m1 'model name' /proc/cpuinfo 2>/dev/null | cut -d: -f2- | xargs || echo "unknown")
+
+table=$(cat <<EOF
+## Benchmark results (generated)
+
+| Metric | Benchmark | Median (last of ${BENCH_COUNT} runs) | Allocs/op |
+|--------|-----------|--------------------------------------|-----------|
+| Gateway pipeline overhead | \`BenchmarkGatewayPipelineOverhead\` | **${gw_ms} ms**/req | ${gw_allocs:-n/a} |
+| PII scan latency | \`BenchmarkPIIScan\` | **${pii_ms} ms**/scan | ${pii_allocs:-n/a} |
+| Evidence write throughput | \`BenchmarkEvidenceStore\` | **${ev_ops} writes/s** (~${ev_ms} ms/write) | ${ev_allocs:-n/a} |
+
+**Environment:** ${go_ver} · ${os_info} · ${cpu_info}  
+**Commit:** \`${commit}\` · **Generated:** ${generated}  
+**Settings:** \`-benchtime=${BENCH_TIME}\` \`-count=${BENCH_COUNT}\` \`-benchmem\`
+
+Gateway overhead uses a local \`httptest\` upstream (no WAN RTT). Compare to the README
+"< 15 ms excluding upstream" claim and the step budget in
+[What Talon does to your request](../explanation/what-talon-does-to-your-request.md).
+
+Retry/fallback decision overhead is **not** included until Epic #113 (#138/#139) lands.
+EOF
+)
+
+echo "$table"
+
+if [ -n "$OUTPUT" ]; then
+  mkdir -p "$(dirname "$OUTPUT")"
+  {
+    echo "# Talon benchmark snapshot"
+    echo ""
+    echo "$table"
+    echo ""
+    echo "Reproduce: \`make benchmarks\` or \`scripts/run-benchmarks.sh\`."
+  } >"$OUTPUT"
+  echo "Wrote $OUTPUT" >&2
+fi
+
+# Raw bench output for auditors who want the full go test lines.
+echo "" >&2
+echo "--- raw go test -bench output ---" >&2
+printf '%s\n' "$bench_out" | grep -E '^Benchmark|^[0-9]+ ' >&2
-Original file line number
+Diff line change
@@ Expand Up / @@ -127,7 +127,7 @@ sequenceDiagram @@
         Talon-->>Client: GovernedResponse
     ```
-    Pipeline overhead is typically under 15ms excluding upstream latency. Full byte-level breakdown: [What Talon does to your request](docs/explanation/what-talon-does-to-your-request.md).
+    Pipeline overhead is typically under 15ms excluding upstream latency. Reproduce on your machine: `make benchmarks`. Full byte-level breakdown: [What Talon does to your request](docs/explanation/what-talon-does-to-your-request.md) · [Benchmarks](docs/reference/benchmarks.md).
     ---
@@ Expand Down @@