-
Notifications
You must be signed in to change notification settings - Fork 11
docs(benchmarks): reproducible gateway, PII, and evidence benchmarks (#119) #164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
| # Reproducible Benchmarks | ||
|
|
||
| **Status:** stable · **Scope:** gateway pipeline overhead, PII scan latency, evidence write throughput. | ||
|
|
||
| The README states that pipeline overhead is typically **under 15 ms excluding upstream | ||
| latency**. This document defines how to reproduce the micro-benchmarks behind that claim, | ||
| what each number measures, and what is intentionally out of scope. | ||
|
|
||
| The authoritative numbers for a given machine are whatever `make benchmarks` prints | ||
| when you run it locally. Results vary with CPU, Go version, SQLite build, and load; | ||
| do not treat a single snapshot as a SLA. | ||
|
|
||
| ## Quick start | ||
|
|
||
| ```bash | ||
| make benchmarks | ||
| ``` | ||
|
|
||
| Or with a saved snapshot file: | ||
|
|
||
| ```bash | ||
| scripts/run-benchmarks.sh -o /tmp/talon-benchmarks.md | ||
| ``` | ||
|
|
||
| Requirements: Go 1.22+ (project pins 1.25.x in CI), CGO enabled (SQLite), repo root checkout. | ||
|
|
||
| ## What we measure | ||
|
|
||
| | Metric | Go benchmark | Package | What it includes | | ||
| |--------|--------------|---------|------------------| | ||
| | **Gateway pipeline overhead** | `BenchmarkGatewayPipelineOverhead` | `internal/gateway` | One non-streaming `ServeHTTP` round trip: route, caller auth, request extract, PII scan, OPA policy evaluation, forward to a **local** `httptest` mock upstream, response PII scan, signed evidence write, metrics. Representative payload includes EU email + IBAN patterns. | | ||
| | **PII scan latency** | `BenchmarkPIIScan` | `internal/classifier` | One `Scanner.Scan` on fixed text (email, IBAN, card). Isolates classifier cost without HTTP or SQLite. | | ||
| | **Evidence write throughput** | `BenchmarkEvidenceStore` | `internal/evidence` | One `Generator.Generate` (HMAC-signed SQLite insert) per iteration. Isolates evidence path without gateway HTTP. | | ||
|
|
||
| ### What is excluded | ||
|
|
||
| - **WAN upstream RTT** — the gateway benchmark uses an in-process mock server; add your provider latency separately. | ||
| - **Retry / fallback routing** — not benchmarked until Epic #113 ([#138](https://github.com/dativo-io/talon/issues/138) / [#139](https://github.com/dativo-io/talon/issues/139)) lands. | ||
| - **Streaming responses** — benchmarks use non-streaming JSON completions only. | ||
| - **Attachment extraction / injection scan** — not in the default payload; add fixtures if you need that dimension. | ||
|
|
||
| ## Method | ||
|
|
||
| 1. **Toolchain:** `go test -bench=… -benchmem -benchtime=2s -count=5 -run=^$` over `./internal/gateway/...`, `./internal/classifier/...`, and `./internal/evidence/...`. | ||
| 2. **Cache:** `-count=5` runs five iterations; the script reports the **last** `ns/op` line per benchmark (median-of-runs is a reasonable stability check; inspect raw output in stderr for spread). | ||
| 3. **Hardware:** `scripts/run-benchmarks.sh` records `go version`, `uname`, and CPU model in the emitted table. Paste that block when publishing numbers externally. | ||
| 4. **Comparison to the 15 ms budget:** See the step table in [What Talon does to your request](../explanation/what-talon-does-to-your-request.md). Gateway overhead should be **below 15 ms** on a modern laptop/desktop when upstream is local; production adds network, disk contention, and concurrent load. | ||
|
|
||
| ## Interpreting results | ||
|
|
||
| - **Gateway ms/req** — wall-clock per governed request with mock upstream. If this is consistently above 15 ms on your hardware, profile before citing the README claim in customer-facing material. | ||
| - **PII ms/scan** — scales with input length and pattern density; the fixed benchmark string is a regression anchor, not a worst case. | ||
| - **Evidence writes/s** — inverse of `ns/op` for `BenchmarkEvidenceStore`; useful for capacity planning on evidence-heavy workloads. | ||
|
|
||
| ## Source locations | ||
|
|
||
| - Gateway: [`internal/gateway/bench_test.go`](../../internal/gateway/bench_test.go) | ||
| - PII: [`internal/classifier/pii_test.go`](../../internal/classifier/pii_test.go) (`BenchmarkPIIScan`) | ||
| - Evidence: [`internal/evidence/store_test.go`](../../internal/evidence/store_test.go) (`BenchmarkEvidenceStore`) | ||
| - Runner: [`scripts/run-benchmarks.sh`](../../scripts/run-benchmarks.sh) | ||
|
|
||
| ## Related proof-bar docs | ||
|
|
||
| - [Conformance suite & count](conformance.md) — reproducible test count for evidence + policy paths | ||
| - [Evidence integrity specification](evidence-integrity-spec.md) — signed record format | ||
| - [Threat model](threat-model.md) — trust boundaries the benchmarks do not replace |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| package gateway | ||
|
|
||
| import ( | ||
| "bytes" | ||
| "context" | ||
| "net/http" | ||
| "net/http/httptest" | ||
| "path/filepath" | ||
| "testing" | ||
|
|
||
| "github.com/dativo-io/talon/internal/classifier" | ||
| "github.com/dativo-io/talon/internal/evidence" | ||
| "github.com/dativo-io/talon/internal/policy" | ||
| "github.com/dativo-io/talon/internal/secrets" | ||
| "github.com/dativo-io/talon/internal/testutil" | ||
| "github.com/go-chi/chi/v5" | ||
| ) | ||
|
|
||
| // BenchmarkGatewayPipelineOverhead measures end-to-end gateway wall time for one | ||
| // non-streaming chat completion through ServeHTTP, with a local mock upstream (no | ||
| // WAN RTT). This approximates Talon pipeline overhead for a typical payload: | ||
| // route, caller auth, PII scan, policy evaluation, forward, response PII scan, | ||
| // evidence write, and metrics. | ||
| func BenchmarkGatewayPipelineOverhead(b *testing.B) { | ||
| upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { | ||
| w.Header().Set("Content-Type", "application/json") | ||
| w.WriteHeader(http.StatusOK) | ||
| _, _ = w.Write([]byte(`{"choices":[{"message":{"content":"ok"}}],"usage":{"prompt_tokens":10,"completion_tokens":5}}`)) | ||
| })) | ||
| defer upstream.Close() | ||
|
|
||
| dir := b.TempDir() | ||
| cfg := &GatewayConfig{ | ||
| Enabled: true, | ||
| ListenPrefix: "/v1/proxy", | ||
| Mode: ModeEnforce, | ||
| Providers: map[string]ProviderConfig{ | ||
| "ollama": {Enabled: true, BaseURL: upstream.URL}, | ||
| }, | ||
| Callers: []CallerConfig{ | ||
| { | ||
| Name: "bench-caller", TenantKey: "talon-gw-bench", TenantID: "default", | ||
| PolicyOverrides: &CallerPolicyOverrides{ | ||
| AllowedModels: []string{"llama2"}, | ||
| MaxDailyCost: 1000, | ||
| }, | ||
| }, | ||
| }, | ||
| ServerDefaults: ServerDefaults{DefaultPIIAction: "warn"}, | ||
| RateLimits: RateLimitsConfig{ | ||
| GlobalRequestsPerMin: 1_000_000, | ||
| PerCallerRequestsPerMin: 1_000_000, | ||
| }, | ||
| Timeouts: TimeoutsConfig{ | ||
| ConnectTimeout: "5s", | ||
| RequestTimeout: "30s", | ||
| StreamIdleTimeout: "60s", | ||
| }, | ||
| } | ||
|
|
||
| evStore, err := evidence.NewStore(filepath.Join(dir, "e.db"), testutil.TestSigningKey) | ||
| if err != nil { | ||
| b.Fatal(err) | ||
| } | ||
| defer evStore.Close() | ||
|
|
||
| secStore, err := secrets.NewSecretStore(filepath.Join(dir, "s.db"), testutil.TestEncryptionKey) | ||
| if err != nil { | ||
| b.Fatal(err) | ||
| } | ||
| defer secStore.Close() | ||
|
|
||
| cls := classifier.MustNewScanner() | ||
| policyEngine, err := policy.NewGatewayEngine(context.Background()) | ||
| if err != nil { | ||
| b.Fatal(err) | ||
| } | ||
|
|
||
| gw, err := NewGateway(cfg, cls, evStore, secStore, policyEngine, nil) | ||
| if err != nil { | ||
| b.Fatal(err) | ||
| } | ||
|
|
||
| router := chi.NewRouter() | ||
| router.Route("/v1/proxy", func(r chi.Router) { | ||
| r.Handle("/*", gw) | ||
| }) | ||
|
|
||
| // Representative user text with EU PII patterns (email + IBAN). | ||
| body := []byte(`{"model":"llama2","messages":[{"role":"user","content":"Contact hans.mueller@acme.de about IBAN DE89370400440532013000"}]}`) | ||
|
|
||
| b.ResetTimer() | ||
| for i := 0; i < b.N; i++ { | ||
| req, err := http.NewRequestWithContext(context.Background(), http.MethodPost, | ||
| "http://test/v1/proxy/ollama/v1/chat/completions", bytes.NewReader(body)) | ||
| if err != nil { | ||
| b.Fatal(err) | ||
| } | ||
| req.Header.Set("Authorization", "Bearer talon-gw-bench") | ||
| req.Header.Set("Content-Type", "application/json") | ||
|
|
||
| w := httptest.NewRecorder() | ||
| router.ServeHTTP(w, req) | ||
| if w.Code != http.StatusOK { | ||
| b.Fatalf("status %d: %s", w.Code, w.Body.String()) | ||
| } | ||
| } | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,128 @@ | ||
| #!/usr/bin/env bash | ||
| # | ||
| # run-benchmarks.sh — reproducible micro-benchmarks for Talon proof-bar metrics. | ||
| # | ||
| # Measures: | ||
| # - Gateway pipeline overhead (ServeHTTP + local mock upstream, no WAN RTT) | ||
| # - PII scan latency (classifier) | ||
| # - Evidence write throughput (signed SQLite record per op) | ||
| # | ||
| # Usage: | ||
| # scripts/run-benchmarks.sh # print markdown table to stdout | ||
| # scripts/run-benchmarks.sh -o FILE.md # also write table to FILE | ||
| # | ||
| # See docs/reference/benchmarks.md for methodology and how to interpret results. | ||
| # | ||
| set -euo pipefail | ||
|
|
||
| cd "$(dirname "$0")/.." | ||
|
|
||
| OUTPUT="" | ||
| BENCH_TIME="${BENCH_TIME:-2s}" | ||
| BENCH_COUNT="${BENCH_COUNT:-5}" | ||
| BENCH_PKGS="./internal/gateway/... ./internal/classifier/... ./internal/evidence/..." | ||
| BENCH_REGEX='Benchmark(GatewayPipelineOverhead|PIIScan|EvidenceStore)$' | ||
|
|
||
| while getopts "o:" opt; do | ||
| case "$opt" in | ||
| o) OUTPUT="$OPTARG" ;; | ||
| *) echo "Usage: $0 [-o outfile.md]" >&2; exit 2 ;; | ||
| esac | ||
| done | ||
|
|
||
| if [ "$(uname -s)" = "Darwin" ]; then | ||
| GO_ENV=(env -u CC CC=/usr/bin/clang CGO_ENABLED=1) | ||
| else | ||
| GO_ENV=(env CGO_ENABLED=1) | ||
| fi | ||
|
|
||
| bench_out=$("${GO_ENV[@]}" go test \ | ||
| -bench="$BENCH_REGEX" \ | ||
| -benchmem \ | ||
| -benchtime="$BENCH_TIME" \ | ||
| -count="$BENCH_COUNT" \ | ||
| -run='^$' \ | ||
| $BENCH_PKGS 2>&1) || { | ||
| echo "$bench_out" >&2 | ||
| exit 1 | ||
| } | ||
|
|
||
| # Parse last result line per benchmark name (go test -count repeats runs). | ||
| parse_ns_per_op() { | ||
| local name="$1" | ||
| printf '%s\n' "$bench_out" \ | ||
| | awk -v n="$name" '$1 ~ "^Benchmark" n && $1 !~ "/" { last=$3 } END { print last+0 }' | ||
| } | ||
|
|
||
| parse_allocs() { | ||
| local name="$1" | ||
| printf '%s\n' "$bench_out" \ | ||
| | awk -v n="$name" '$1 ~ "^Benchmark" n && $1 !~ "/" && $0 ~ /allocs\/op/ { last=$(NF-1)" "$NF } END { print last }' | ||
| } | ||
|
|
||
| ns_to_ms() { | ||
| awk -v ns="$1" 'BEGIN { if (ns+0 <= 0) { print "n/a"; exit } printf "%.2f", ns/1e6 }' | ||
| } | ||
|
|
||
| ns_to_ops_per_sec() { | ||
| awk -v ns="$1" 'BEGIN { if (ns+0 <= 0) { print "n/a"; exit } printf "%.0f", 1e9/ns }' | ||
| } | ||
|
|
||
| gw_ns=$(parse_ns_per_op GatewayPipelineOverhead) | ||
| pii_ns=$(parse_ns_per_op PIIScan) | ||
| ev_ns=$(parse_ns_per_op EvidenceStore) | ||
|
|
||
| gw_ms=$(ns_to_ms "$gw_ns") | ||
| pii_ms=$(ns_to_ms "$pii_ns") | ||
| ev_ms=$(ns_to_ms "$ev_ns") | ||
| ev_ops=$(ns_to_ops_per_sec "$ev_ns") | ||
|
|
||
| gw_allocs=$(parse_allocs GatewayPipelineOverhead) | ||
| pii_allocs=$(parse_allocs PIIScan) | ||
| ev_allocs=$(parse_allocs EvidenceStore) | ||
|
|
||
| commit=$(git rev-parse --short HEAD 2>/dev/null || echo "unknown") | ||
| generated=$(date -u +"%Y-%m-%dT%H:%M:%SZ") | ||
| go_ver=$("${GO_ENV[@]}" go version | sed 's/^go version //') | ||
| os_info=$(uname -srm 2>/dev/null || uname -a) | ||
| cpu_info=$(sysctl -n machdep.cpu.brand_string 2>/dev/null || grep -m1 'model name' /proc/cpuinfo 2>/dev/null | cut -d: -f2- | xargs || echo "unknown") | ||
|
|
||
| table=$(cat <<EOF | ||
| ## Benchmark results (generated) | ||
|
|
||
| | Metric | Benchmark | Median (last of ${BENCH_COUNT} runs) | Allocs/op | | ||
| |--------|-----------|--------------------------------------|-----------| | ||
| | Gateway pipeline overhead | \`BenchmarkGatewayPipelineOverhead\` | **${gw_ms} ms**/req | ${gw_allocs:-n/a} | | ||
| | PII scan latency | \`BenchmarkPIIScan\` | **${pii_ms} ms**/scan | ${pii_allocs:-n/a} | | ||
| | Evidence write throughput | \`BenchmarkEvidenceStore\` | **${ev_ops} writes/s** (~${ev_ms} ms/write) | ${ev_allocs:-n/a} | | ||
|
|
||
| **Environment:** ${go_ver} · ${os_info} · ${cpu_info} | ||
| **Commit:** \`${commit}\` · **Generated:** ${generated} | ||
| **Settings:** \`-benchtime=${BENCH_TIME}\` \`-count=${BENCH_COUNT}\` \`-benchmem\` | ||
|
|
||
| Gateway overhead uses a local \`httptest\` upstream (no WAN RTT). Compare to the README | ||
| "< 15 ms excluding upstream" claim and the step budget in | ||
| [What Talon does to your request](../explanation/what-talon-does-to-your-request.md). | ||
|
|
||
| Retry/fallback decision overhead is **not** included until Epic #113 (#138/#139) lands. | ||
| EOF | ||
| ) | ||
|
|
||
| echo "$table" | ||
|
|
||
| if [ -n "$OUTPUT" ]; then | ||
| mkdir -p "$(dirname "$OUTPUT")" | ||
| { | ||
| echo "# Talon benchmark snapshot" | ||
| echo "" | ||
| echo "$table" | ||
| echo "" | ||
| echo "Reproduce: \`make benchmarks\` or \`scripts/run-benchmarks.sh\`." | ||
| } >"$OUTPUT" | ||
| echo "Wrote $OUTPUT" >&2 | ||
| fi | ||
|
|
||
| # Raw bench output for auditors who want the full go test lines. | ||
| echo "" >&2 | ||
| echo "--- raw go test -bench output ---" >&2 | ||
| printf '%s\n' "$bench_out" | grep -E '^Benchmark|^[0-9]+ ' >&2 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gateway bench cost query drift
Medium Severity
BenchmarkGatewayPipelineOverheadtimes repeatedServeHTTPcalls against one SQLite evidence store that grows every iteration. Each request runscallerCostTotals, which scans accumulating rows viaCostByAgent, so measuredns/oprises during the run and overstates steady per-request overhead versus a fixed-size store.Reviewed by Cursor Bugbot for commit fa7270f. Configure here.