diff --git a/LIMITATIONS.md b/LIMITATIONS.md index 4207bfe..f6719d7 100644 --- a/LIMITATIONS.md +++ b/LIMITATIONS.md @@ -53,4 +53,4 @@ A valid signature proves that this evidence record was signed with the deploymen - [Evidence store](docs/explanation/evidence-store.md) — how records are created, signed, and verified - [Evidence integrity specification](docs/reference/evidence-integrity-spec.md) — byte-exact fields, serialization, signing, and independent verification - [Threat model](docs/reference/threat-model.md) — attack surface, trust boundaries, and key-management assumptions -- Reproducible benchmarks are forthcoming. +- [Reproducible benchmarks](docs/reference/benchmarks.md) — run `make benchmarks` on your hardware; retry/fallback overhead not included until Epic #113 lands. diff --git a/Makefile b/Makefile index 0eeac47..ed70f3d 100644 --- a/Makefile +++ b/Makefile @@ -16,7 +16,7 @@ ifeq ($(UNAME_S),Darwin) GO_ENV := env -u CC CC=/usr/bin/clang CGO_ENABLED=1 endif -.PHONY: help build install test test-integration test-e2e test-smoke test-all test-ssot-gate conformance lint fmt clean vet mod-tidy check docker-build demo-gateway demo-full demo-clean verify-flow0 nosec-count +.PHONY: help build install test test-integration test-e2e test-smoke test-all test-ssot-gate conformance benchmarks lint fmt clean vet mod-tidy check docker-build demo-gateway demo-full demo-clean verify-flow0 nosec-count # Conformance suite: the evidence + policy paths whose passing test/subtest # count is published as Talon's honest conformance number. See @@ -63,6 +63,9 @@ conformance: ## Run the evidence + policy conformance suite and print the passin if [ $$rc -ne 0 ]; then printf '%s\n' "$$out" | tail -20; echo "conformance: FAILED ($$count passing before failure)"; exit 1; fi; \ echo "Conformance: $$count passing tests across evidence + policy paths ($(CONFORMANCE_PKGS))" +benchmarks: ## Run reproducible micro-benchmarks (gateway overhead, PII scan, evidence write) + @bash scripts/run-benchmarks.sh + lint: ## Run linter @golangci-lint run ./... diff --git a/README.md b/README.md index 49db1f1..89a2473 100644 --- a/README.md +++ b/README.md @@ -127,7 +127,7 @@ sequenceDiagram Talon-->>Client: GovernedResponse ``` -Pipeline overhead is typically under 15ms excluding upstream latency. Full byte-level breakdown: [What Talon does to your request](docs/explanation/what-talon-does-to-your-request.md). +Pipeline overhead is typically under 15ms excluding upstream latency. Reproduce on your machine: `make benchmarks`. Full byte-level breakdown: [What Talon does to your request](docs/explanation/what-talon-does-to-your-request.md) · [Benchmarks](docs/reference/benchmarks.md). --- diff --git a/docs/README.md b/docs/README.md index 46d8478..1826e5c 100644 --- a/docs/README.md +++ b/docs/README.md @@ -70,6 +70,7 @@ Choose the shortest path for your situation: | [Evidence integrity specification](reference/evidence-integrity-spec.md) | Normative signed-record spec: fields, canonical serialization, HMAC-SHA256 signing, and the independent verification procedure. | | [Threat model](reference/threat-model.md) | STRIDE-style attack surface, trust boundaries, threats/mitigations, and key-management assumptions for the gateway path. | | [Conformance suite & count](reference/conformance.md) | What counts as a conformance test for the evidence + policy paths, and how to reproduce the published count with `make conformance`. | +| [Reproducible benchmarks](reference/benchmarks.md) | Gateway pipeline overhead, PII scan latency, and evidence write throughput (`make benchmarks`). | | [Authentication and key scopes](reference/authentication-and-key-scopes.md) | Which keys authenticate which endpoint families (gateway vs control plane vs dashboard). | | [Gateway dashboard](reference/gateway-dashboard.md) | Dashboard endpoints, metrics API schema, snapshot fields, and authentication. | | [Operational control plane](reference/operational-control-plane.md) | Run management (list/kill/pause/resume), tenant lockdown, runtime overrides, tool approval gates. | @@ -98,6 +99,7 @@ Choose the shortest path for your situation: | [Evidence store](explanation/evidence-store.md) | HMAC integrity model and verification flow. | | [Evidence integrity specification](reference/evidence-integrity-spec.md) | Byte-exact spec so a third party can independently verify a record. | | [Conformance suite & count](reference/conformance.md) | Reproducible passing-test count for the evidence + policy paths (`make conformance`). | +| [Reproducible benchmarks](reference/benchmarks.md) | `make benchmarks` — gateway overhead, PII scan, evidence write on your hardware. | | [Evidence integrity 5-minute proof](tutorials/evidence-integrity-demo.md) | Fast proof moment for auditors/operators, including offline signed-export verification. | | [Threat model](reference/threat-model.md) | Attack surface, trust boundaries, and what the HMAC signature does and does not prove. | | [Security policy](../SECURITY.md) | Vulnerability reporting process and security scope. | diff --git a/docs/explanation/what-talon-does-to-your-request.md b/docs/explanation/what-talon-does-to-your-request.md index c132738..6fcc7c1 100644 --- a/docs/explanation/what-talon-does-to-your-request.md +++ b/docs/explanation/what-talon-does-to-your-request.md @@ -243,7 +243,13 @@ caller's daily/monthly accumulator (in-memory counter, periodically flushed). ## Throughput And Benchmarking -Use this quick benchmark harness to measure your own environment. Throughput depends on message size, PII pattern density, and upstream provider latency. +**Micro-benchmarks (reproducible from a clean checkout):** run `make benchmarks` or see +[Reproducible benchmarks](../reference/benchmarks.md) for gateway pipeline overhead, +PII scan latency, and evidence write throughput on your hardware. + +**End-to-end load (optional):** use this harness when you need concurrent requests +through a running gateway. Throughput depends on message size, PII pattern density, +and upstream provider latency. ```bash # 1) Start local proof environment diff --git a/docs/reference/benchmarks.md b/docs/reference/benchmarks.md new file mode 100644 index 0000000..85d8a17 --- /dev/null +++ b/docs/reference/benchmarks.md @@ -0,0 +1,66 @@ +# Reproducible Benchmarks + +**Status:** stable · **Scope:** gateway pipeline overhead, PII scan latency, evidence write throughput. + +The README states that pipeline overhead is typically **under 15 ms excluding upstream +latency**. This document defines how to reproduce the micro-benchmarks behind that claim, +what each number measures, and what is intentionally out of scope. + +The authoritative numbers for a given machine are whatever `make benchmarks` prints +when you run it locally. Results vary with CPU, Go version, SQLite build, and load; +do not treat a single snapshot as a SLA. + +## Quick start + +```bash +make benchmarks +``` + +Or with a saved snapshot file: + +```bash +scripts/run-benchmarks.sh -o /tmp/talon-benchmarks.md +``` + +Requirements: Go 1.22+ (project pins 1.25.x in CI), CGO enabled (SQLite), repo root checkout. + +## What we measure + +| Metric | Go benchmark | Package | What it includes | +|--------|--------------|---------|------------------| +| **Gateway pipeline overhead** | `BenchmarkGatewayPipelineOverhead` | `internal/gateway` | One non-streaming `ServeHTTP` round trip: route, caller auth, request extract, PII scan, OPA policy evaluation, forward to a **local** `httptest` mock upstream, response PII scan, signed evidence write, metrics. Representative payload includes EU email + IBAN patterns. | +| **PII scan latency** | `BenchmarkPIIScan` | `internal/classifier` | One `Scanner.Scan` on fixed text (email, IBAN, card). Isolates classifier cost without HTTP or SQLite. | +| **Evidence write throughput** | `BenchmarkEvidenceStore` | `internal/evidence` | One `Generator.Generate` (HMAC-signed SQLite insert) per iteration. Isolates evidence path without gateway HTTP. | + +### What is excluded + +- **WAN upstream RTT** — the gateway benchmark uses an in-process mock server; add your provider latency separately. +- **Retry / fallback routing** — not benchmarked until Epic #113 ([#138](https://github.com/dativo-io/talon/issues/138) / [#139](https://github.com/dativo-io/talon/issues/139)) lands. +- **Streaming responses** — benchmarks use non-streaming JSON completions only. +- **Attachment extraction / injection scan** — not in the default payload; add fixtures if you need that dimension. + +## Method + +1. **Toolchain:** `go test -bench=… -benchmem -benchtime=2s -count=5 -run=^$` over `./internal/gateway/...`, `./internal/classifier/...`, and `./internal/evidence/...`. +2. **Cache:** `-count=5` runs five iterations; the script reports the **last** `ns/op` line per benchmark (median-of-runs is a reasonable stability check; inspect raw output in stderr for spread). +3. **Hardware:** `scripts/run-benchmarks.sh` records `go version`, `uname`, and CPU model in the emitted table. Paste that block when publishing numbers externally. +4. **Comparison to the 15 ms budget:** See the step table in [What Talon does to your request](../explanation/what-talon-does-to-your-request.md). Gateway overhead should be **below 15 ms** on a modern laptop/desktop when upstream is local; production adds network, disk contention, and concurrent load. + +## Interpreting results + +- **Gateway ms/req** — wall-clock per governed request with mock upstream. If this is consistently above 15 ms on your hardware, profile before citing the README claim in customer-facing material. +- **PII ms/scan** — scales with input length and pattern density; the fixed benchmark string is a regression anchor, not a worst case. +- **Evidence writes/s** — inverse of `ns/op` for `BenchmarkEvidenceStore`; useful for capacity planning on evidence-heavy workloads. + +## Source locations + +- Gateway: [`internal/gateway/bench_test.go`](../../internal/gateway/bench_test.go) +- PII: [`internal/classifier/pii_test.go`](../../internal/classifier/pii_test.go) (`BenchmarkPIIScan`) +- Evidence: [`internal/evidence/store_test.go`](../../internal/evidence/store_test.go) (`BenchmarkEvidenceStore`) +- Runner: [`scripts/run-benchmarks.sh`](../../scripts/run-benchmarks.sh) + +## Related proof-bar docs + +- [Conformance suite & count](conformance.md) — reproducible test count for evidence + policy paths +- [Evidence integrity specification](evidence-integrity-spec.md) — signed record format +- [Threat model](threat-model.md) — trust boundaries the benchmarks do not replace diff --git a/internal/gateway/bench_test.go b/internal/gateway/bench_test.go new file mode 100644 index 0000000..26f9f30 --- /dev/null +++ b/internal/gateway/bench_test.go @@ -0,0 +1,108 @@ +package gateway + +import ( + "bytes" + "context" + "net/http" + "net/http/httptest" + "path/filepath" + "testing" + + "github.com/dativo-io/talon/internal/classifier" + "github.com/dativo-io/talon/internal/evidence" + "github.com/dativo-io/talon/internal/policy" + "github.com/dativo-io/talon/internal/secrets" + "github.com/dativo-io/talon/internal/testutil" + "github.com/go-chi/chi/v5" +) + +// BenchmarkGatewayPipelineOverhead measures end-to-end gateway wall time for one +// non-streaming chat completion through ServeHTTP, with a local mock upstream (no +// WAN RTT). This approximates Talon pipeline overhead for a typical payload: +// route, caller auth, PII scan, policy evaluation, forward, response PII scan, +// evidence write, and metrics. +func BenchmarkGatewayPipelineOverhead(b *testing.B) { + upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.Header().Set("Content-Type", "application/json") + w.WriteHeader(http.StatusOK) + _, _ = w.Write([]byte(`{"choices":[{"message":{"content":"ok"}}],"usage":{"prompt_tokens":10,"completion_tokens":5}}`)) + })) + defer upstream.Close() + + dir := b.TempDir() + cfg := &GatewayConfig{ + Enabled: true, + ListenPrefix: "/v1/proxy", + Mode: ModeEnforce, + Providers: map[string]ProviderConfig{ + "ollama": {Enabled: true, BaseURL: upstream.URL}, + }, + Callers: []CallerConfig{ + { + Name: "bench-caller", TenantKey: "talon-gw-bench", TenantID: "default", + PolicyOverrides: &CallerPolicyOverrides{ + AllowedModels: []string{"llama2"}, + MaxDailyCost: 1000, + }, + }, + }, + ServerDefaults: ServerDefaults{DefaultPIIAction: "warn"}, + RateLimits: RateLimitsConfig{ + GlobalRequestsPerMin: 1_000_000, + PerCallerRequestsPerMin: 1_000_000, + }, + Timeouts: TimeoutsConfig{ + ConnectTimeout: "5s", + RequestTimeout: "30s", + StreamIdleTimeout: "60s", + }, + } + + evStore, err := evidence.NewStore(filepath.Join(dir, "e.db"), testutil.TestSigningKey) + if err != nil { + b.Fatal(err) + } + defer evStore.Close() + + secStore, err := secrets.NewSecretStore(filepath.Join(dir, "s.db"), testutil.TestEncryptionKey) + if err != nil { + b.Fatal(err) + } + defer secStore.Close() + + cls := classifier.MustNewScanner() + policyEngine, err := policy.NewGatewayEngine(context.Background()) + if err != nil { + b.Fatal(err) + } + + gw, err := NewGateway(cfg, cls, evStore, secStore, policyEngine, nil) + if err != nil { + b.Fatal(err) + } + + router := chi.NewRouter() + router.Route("/v1/proxy", func(r chi.Router) { + r.Handle("/*", gw) + }) + + // Representative user text with EU PII patterns (email + IBAN). + body := []byte(`{"model":"llama2","messages":[{"role":"user","content":"Contact hans.mueller@acme.de about IBAN DE89370400440532013000"}]}`) + + b.ResetTimer() + for i := 0; i < b.N; i++ { + req, err := http.NewRequestWithContext(context.Background(), http.MethodPost, + "http://test/v1/proxy/ollama/v1/chat/completions", bytes.NewReader(body)) + if err != nil { + b.Fatal(err) + } + req.Header.Set("Authorization", "Bearer talon-gw-bench") + req.Header.Set("Content-Type", "application/json") + + w := httptest.NewRecorder() + router.ServeHTTP(w, req) + if w.Code != http.StatusOK { + b.Fatalf("status %d: %s", w.Code, w.Body.String()) + } + } +} diff --git a/scripts/run-benchmarks.sh b/scripts/run-benchmarks.sh new file mode 100755 index 0000000..e0b3254 --- /dev/null +++ b/scripts/run-benchmarks.sh @@ -0,0 +1,128 @@ +#!/usr/bin/env bash +# +# run-benchmarks.sh — reproducible micro-benchmarks for Talon proof-bar metrics. +# +# Measures: +# - Gateway pipeline overhead (ServeHTTP + local mock upstream, no WAN RTT) +# - PII scan latency (classifier) +# - Evidence write throughput (signed SQLite record per op) +# +# Usage: +# scripts/run-benchmarks.sh # print markdown table to stdout +# scripts/run-benchmarks.sh -o FILE.md # also write table to FILE +# +# See docs/reference/benchmarks.md for methodology and how to interpret results. +# +set -euo pipefail + +cd "$(dirname "$0")/.." + +OUTPUT="" +BENCH_TIME="${BENCH_TIME:-2s}" +BENCH_COUNT="${BENCH_COUNT:-5}" +BENCH_PKGS="./internal/gateway/... ./internal/classifier/... ./internal/evidence/..." +BENCH_REGEX='Benchmark(GatewayPipelineOverhead|PIIScan|EvidenceStore)$' + +while getopts "o:" opt; do + case "$opt" in + o) OUTPUT="$OPTARG" ;; + *) echo "Usage: $0 [-o outfile.md]" >&2; exit 2 ;; + esac +done + +if [ "$(uname -s)" = "Darwin" ]; then + GO_ENV=(env -u CC CC=/usr/bin/clang CGO_ENABLED=1) +else + GO_ENV=(env CGO_ENABLED=1) +fi + +bench_out=$("${GO_ENV[@]}" go test \ + -bench="$BENCH_REGEX" \ + -benchmem \ + -benchtime="$BENCH_TIME" \ + -count="$BENCH_COUNT" \ + -run='^$' \ + $BENCH_PKGS 2>&1) || { + echo "$bench_out" >&2 + exit 1 +} + +# Parse last result line per benchmark name (go test -count repeats runs). +parse_ns_per_op() { + local name="$1" + printf '%s\n' "$bench_out" \ + | awk -v n="$name" '$1 ~ "^Benchmark" n && $1 !~ "/" { last=$3 } END { print last+0 }' +} + +parse_allocs() { + local name="$1" + printf '%s\n' "$bench_out" \ + | awk -v n="$name" '$1 ~ "^Benchmark" n && $1 !~ "/" && $0 ~ /allocs\/op/ { last=$(NF-1)" "$NF } END { print last }' +} + +ns_to_ms() { + awk -v ns="$1" 'BEGIN { if (ns+0 <= 0) { print "n/a"; exit } printf "%.2f", ns/1e6 }' +} + +ns_to_ops_per_sec() { + awk -v ns="$1" 'BEGIN { if (ns+0 <= 0) { print "n/a"; exit } printf "%.0f", 1e9/ns }' +} + +gw_ns=$(parse_ns_per_op GatewayPipelineOverhead) +pii_ns=$(parse_ns_per_op PIIScan) +ev_ns=$(parse_ns_per_op EvidenceStore) + +gw_ms=$(ns_to_ms "$gw_ns") +pii_ms=$(ns_to_ms "$pii_ns") +ev_ms=$(ns_to_ms "$ev_ns") +ev_ops=$(ns_to_ops_per_sec "$ev_ns") + +gw_allocs=$(parse_allocs GatewayPipelineOverhead) +pii_allocs=$(parse_allocs PIIScan) +ev_allocs=$(parse_allocs EvidenceStore) + +commit=$(git rev-parse --short HEAD 2>/dev/null || echo "unknown") +generated=$(date -u +"%Y-%m-%dT%H:%M:%SZ") +go_ver=$("${GO_ENV[@]}" go version | sed 's/^go version //') +os_info=$(uname -srm 2>/dev/null || uname -a) +cpu_info=$(sysctl -n machdep.cpu.brand_string 2>/dev/null || grep -m1 'model name' /proc/cpuinfo 2>/dev/null | cut -d: -f2- | xargs || echo "unknown") + +table=$(cat <"$OUTPUT" + echo "Wrote $OUTPUT" >&2 +fi + +# Raw bench output for auditors who want the full go test lines. +echo "" >&2 +echo "--- raw go test -bench output ---" >&2 +printf '%s\n' "$bench_out" | grep -E '^Benchmark|^[0-9]+ ' >&2