Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion LIMITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,4 +53,4 @@ A valid signature proves that this evidence record was signed with the deploymen
- [Evidence store](docs/explanation/evidence-store.md) — how records are created, signed, and verified
- [Evidence integrity specification](docs/reference/evidence-integrity-spec.md) — byte-exact fields, serialization, signing, and independent verification
- [Threat model](docs/reference/threat-model.md) — attack surface, trust boundaries, and key-management assumptions
- Reproducible benchmarks are forthcoming.
- [Reproducible benchmarks](docs/reference/benchmarks.md) — run `make benchmarks` on your hardware; retry/fallback overhead not included until Epic #113 lands.
5 changes: 4 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ ifeq ($(UNAME_S),Darwin)
GO_ENV := env -u CC CC=/usr/bin/clang CGO_ENABLED=1
endif

.PHONY: help build install test test-integration test-e2e test-smoke test-all test-ssot-gate conformance lint fmt clean vet mod-tidy check docker-build demo-gateway demo-full demo-clean verify-flow0 nosec-count
.PHONY: help build install test test-integration test-e2e test-smoke test-all test-ssot-gate conformance benchmarks lint fmt clean vet mod-tidy check docker-build demo-gateway demo-full demo-clean verify-flow0 nosec-count

# Conformance suite: the evidence + policy paths whose passing test/subtest
# count is published as Talon's honest conformance number. See
Expand Down Expand Up @@ -63,6 +63,9 @@ conformance: ## Run the evidence + policy conformance suite and print the passin
if [ $$rc -ne 0 ]; then printf '%s\n' "$$out" | tail -20; echo "conformance: FAILED ($$count passing before failure)"; exit 1; fi; \
echo "Conformance: $$count passing tests across evidence + policy paths ($(CONFORMANCE_PKGS))"

benchmarks: ## Run reproducible micro-benchmarks (gateway overhead, PII scan, evidence write)
@bash scripts/run-benchmarks.sh

lint: ## Run linter
@golangci-lint run ./...

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ sequenceDiagram
Talon-->>Client: GovernedResponse
```

Pipeline overhead is typically under 15ms excluding upstream latency. Full byte-level breakdown: [What Talon does to your request](docs/explanation/what-talon-does-to-your-request.md).
Pipeline overhead is typically under 15ms excluding upstream latency. Reproduce on your machine: `make benchmarks`. Full byte-level breakdown: [What Talon does to your request](docs/explanation/what-talon-does-to-your-request.md) · [Benchmarks](docs/reference/benchmarks.md).

---

Expand Down
2 changes: 2 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ Choose the shortest path for your situation:
| [Evidence integrity specification](reference/evidence-integrity-spec.md) | Normative signed-record spec: fields, canonical serialization, HMAC-SHA256 signing, and the independent verification procedure. |
| [Threat model](reference/threat-model.md) | STRIDE-style attack surface, trust boundaries, threats/mitigations, and key-management assumptions for the gateway path. |
| [Conformance suite & count](reference/conformance.md) | What counts as a conformance test for the evidence + policy paths, and how to reproduce the published count with `make conformance`. |
| [Reproducible benchmarks](reference/benchmarks.md) | Gateway pipeline overhead, PII scan latency, and evidence write throughput (`make benchmarks`). |
| [Authentication and key scopes](reference/authentication-and-key-scopes.md) | Which keys authenticate which endpoint families (gateway vs control plane vs dashboard). |
| [Gateway dashboard](reference/gateway-dashboard.md) | Dashboard endpoints, metrics API schema, snapshot fields, and authentication. |
| [Operational control plane](reference/operational-control-plane.md) | Run management (list/kill/pause/resume), tenant lockdown, runtime overrides, tool approval gates. |
Expand Down Expand Up @@ -98,6 +99,7 @@ Choose the shortest path for your situation:
| [Evidence store](explanation/evidence-store.md) | HMAC integrity model and verification flow. |
| [Evidence integrity specification](reference/evidence-integrity-spec.md) | Byte-exact spec so a third party can independently verify a record. |
| [Conformance suite & count](reference/conformance.md) | Reproducible passing-test count for the evidence + policy paths (`make conformance`). |
| [Reproducible benchmarks](reference/benchmarks.md) | `make benchmarks` — gateway overhead, PII scan, evidence write on your hardware. |
| [Evidence integrity 5-minute proof](tutorials/evidence-integrity-demo.md) | Fast proof moment for auditors/operators, including offline signed-export verification. |
| [Threat model](reference/threat-model.md) | Attack surface, trust boundaries, and what the HMAC signature does and does not prove. |
| [Security policy](../SECURITY.md) | Vulnerability reporting process and security scope. |
Expand Down
8 changes: 7 additions & 1 deletion docs/explanation/what-talon-does-to-your-request.md
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,13 @@ caller's daily/monthly accumulator (in-memory counter, periodically flushed).

## Throughput And Benchmarking

Use this quick benchmark harness to measure your own environment. Throughput depends on message size, PII pattern density, and upstream provider latency.
**Micro-benchmarks (reproducible from a clean checkout):** run `make benchmarks` or see
[Reproducible benchmarks](../reference/benchmarks.md) for gateway pipeline overhead,
PII scan latency, and evidence write throughput on your hardware.

**End-to-end load (optional):** use this harness when you need concurrent requests
through a running gateway. Throughput depends on message size, PII pattern density,
and upstream provider latency.

```bash
# 1) Start local proof environment
Expand Down
66 changes: 66 additions & 0 deletions docs/reference/benchmarks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Reproducible Benchmarks

**Status:** stable · **Scope:** gateway pipeline overhead, PII scan latency, evidence write throughput.

The README states that pipeline overhead is typically **under 15 ms excluding upstream
latency**. This document defines how to reproduce the micro-benchmarks behind that claim,
what each number measures, and what is intentionally out of scope.

The authoritative numbers for a given machine are whatever `make benchmarks` prints
when you run it locally. Results vary with CPU, Go version, SQLite build, and load;
do not treat a single snapshot as a SLA.

## Quick start

```bash
make benchmarks
```

Or with a saved snapshot file:

```bash
scripts/run-benchmarks.sh -o /tmp/talon-benchmarks.md
```

Requirements: Go 1.22+ (project pins 1.25.x in CI), CGO enabled (SQLite), repo root checkout.

## What we measure

| Metric | Go benchmark | Package | What it includes |
|--------|--------------|---------|------------------|
| **Gateway pipeline overhead** | `BenchmarkGatewayPipelineOverhead` | `internal/gateway` | One non-streaming `ServeHTTP` round trip: route, caller auth, request extract, PII scan, OPA policy evaluation, forward to a **local** `httptest` mock upstream, response PII scan, signed evidence write, metrics. Representative payload includes EU email + IBAN patterns. |
| **PII scan latency** | `BenchmarkPIIScan` | `internal/classifier` | One `Scanner.Scan` on fixed text (email, IBAN, card). Isolates classifier cost without HTTP or SQLite. |
| **Evidence write throughput** | `BenchmarkEvidenceStore` | `internal/evidence` | One `Generator.Generate` (HMAC-signed SQLite insert) per iteration. Isolates evidence path without gateway HTTP. |

### What is excluded

- **WAN upstream RTT** — the gateway benchmark uses an in-process mock server; add your provider latency separately.
- **Retry / fallback routing** — not benchmarked until Epic #113 ([#138](https://github.com/dativo-io/talon/issues/138) / [#139](https://github.com/dativo-io/talon/issues/139)) lands.
- **Streaming responses** — benchmarks use non-streaming JSON completions only.
- **Attachment extraction / injection scan** — not in the default payload; add fixtures if you need that dimension.

## Method

1. **Toolchain:** `go test -bench=… -benchmem -benchtime=2s -count=5 -run=^$` over `./internal/gateway/...`, `./internal/classifier/...`, and `./internal/evidence/...`.
2. **Cache:** `-count=5` runs five iterations; the script reports the **last** `ns/op` line per benchmark (median-of-runs is a reasonable stability check; inspect raw output in stderr for spread).
3. **Hardware:** `scripts/run-benchmarks.sh` records `go version`, `uname`, and CPU model in the emitted table. Paste that block when publishing numbers externally.
4. **Comparison to the 15 ms budget:** See the step table in [What Talon does to your request](../explanation/what-talon-does-to-your-request.md). Gateway overhead should be **below 15 ms** on a modern laptop/desktop when upstream is local; production adds network, disk contention, and concurrent load.

## Interpreting results

- **Gateway ms/req** — wall-clock per governed request with mock upstream. If this is consistently above 15 ms on your hardware, profile before citing the README claim in customer-facing material.
- **PII ms/scan** — scales with input length and pattern density; the fixed benchmark string is a regression anchor, not a worst case.
- **Evidence writes/s** — inverse of `ns/op` for `BenchmarkEvidenceStore`; useful for capacity planning on evidence-heavy workloads.

## Source locations

- Gateway: [`internal/gateway/bench_test.go`](../../internal/gateway/bench_test.go)
- PII: [`internal/classifier/pii_test.go`](../../internal/classifier/pii_test.go) (`BenchmarkPIIScan`)
- Evidence: [`internal/evidence/store_test.go`](../../internal/evidence/store_test.go) (`BenchmarkEvidenceStore`)
- Runner: [`scripts/run-benchmarks.sh`](../../scripts/run-benchmarks.sh)

## Related proof-bar docs

- [Conformance suite & count](conformance.md) — reproducible test count for evidence + policy paths
- [Evidence integrity specification](evidence-integrity-spec.md) — signed record format
- [Threat model](threat-model.md) — trust boundaries the benchmarks do not replace
108 changes: 108 additions & 0 deletions internal/gateway/bench_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
package gateway

import (
"bytes"
"context"
"net/http"
"net/http/httptest"
"path/filepath"
"testing"

"github.com/dativo-io/talon/internal/classifier"
"github.com/dativo-io/talon/internal/evidence"
"github.com/dativo-io/talon/internal/policy"
"github.com/dativo-io/talon/internal/secrets"
"github.com/dativo-io/talon/internal/testutil"
"github.com/go-chi/chi/v5"
)

// BenchmarkGatewayPipelineOverhead measures end-to-end gateway wall time for one
// non-streaming chat completion through ServeHTTP, with a local mock upstream (no
// WAN RTT). This approximates Talon pipeline overhead for a typical payload:
// route, caller auth, PII scan, policy evaluation, forward, response PII scan,
// evidence write, and metrics.
func BenchmarkGatewayPipelineOverhead(b *testing.B) {
upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte(`{"choices":[{"message":{"content":"ok"}}],"usage":{"prompt_tokens":10,"completion_tokens":5}}`))
}))
defer upstream.Close()

dir := b.TempDir()
cfg := &GatewayConfig{
Enabled: true,
ListenPrefix: "/v1/proxy",
Mode: ModeEnforce,
Providers: map[string]ProviderConfig{
"ollama": {Enabled: true, BaseURL: upstream.URL},
},
Callers: []CallerConfig{
{
Name: "bench-caller", TenantKey: "talon-gw-bench", TenantID: "default",
PolicyOverrides: &CallerPolicyOverrides{
AllowedModels: []string{"llama2"},
MaxDailyCost: 1000,
},
},
},
ServerDefaults: ServerDefaults{DefaultPIIAction: "warn"},
RateLimits: RateLimitsConfig{
GlobalRequestsPerMin: 1_000_000,
PerCallerRequestsPerMin: 1_000_000,
},
Timeouts: TimeoutsConfig{
ConnectTimeout: "5s",
RequestTimeout: "30s",
StreamIdleTimeout: "60s",
},
}

evStore, err := evidence.NewStore(filepath.Join(dir, "e.db"), testutil.TestSigningKey)
if err != nil {
b.Fatal(err)
}
defer evStore.Close()

secStore, err := secrets.NewSecretStore(filepath.Join(dir, "s.db"), testutil.TestEncryptionKey)
if err != nil {
b.Fatal(err)
}
defer secStore.Close()

cls := classifier.MustNewScanner()
policyEngine, err := policy.NewGatewayEngine(context.Background())
if err != nil {
b.Fatal(err)
}

gw, err := NewGateway(cfg, cls, evStore, secStore, policyEngine, nil)
if err != nil {
b.Fatal(err)
}

router := chi.NewRouter()
router.Route("/v1/proxy", func(r chi.Router) {
r.Handle("/*", gw)
})

// Representative user text with EU PII patterns (email + IBAN).
body := []byte(`{"model":"llama2","messages":[{"role":"user","content":"Contact hans.mueller@acme.de about IBAN DE89370400440532013000"}]}`)

b.ResetTimer()
for i := 0; i < b.N; i++ {
req, err := http.NewRequestWithContext(context.Background(), http.MethodPost,
"http://test/v1/proxy/ollama/v1/chat/completions", bytes.NewReader(body))
if err != nil {
b.Fatal(err)
}
req.Header.Set("Authorization", "Bearer talon-gw-bench")
req.Header.Set("Content-Type", "application/json")

w := httptest.NewRecorder()
router.ServeHTTP(w, req)
if w.Code != http.StatusOK {
b.Fatalf("status %d: %s", w.Code, w.Body.String())
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gateway bench cost query drift

Medium Severity

BenchmarkGatewayPipelineOverhead times repeated ServeHTTP calls against one SQLite evidence store that grows every iteration. Each request runs callerCostTotals, which scans accumulating rows via CostByAgent, so measured ns/op rises during the run and overstates steady per-request overhead versus a fixed-size store.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit fa7270f. Configure here.

}
128 changes: 128 additions & 0 deletions scripts/run-benchmarks.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
#!/usr/bin/env bash
#
# run-benchmarks.sh — reproducible micro-benchmarks for Talon proof-bar metrics.
#
# Measures:
# - Gateway pipeline overhead (ServeHTTP + local mock upstream, no WAN RTT)
# - PII scan latency (classifier)
# - Evidence write throughput (signed SQLite record per op)
#
# Usage:
# scripts/run-benchmarks.sh # print markdown table to stdout
# scripts/run-benchmarks.sh -o FILE.md # also write table to FILE
#
# See docs/reference/benchmarks.md for methodology and how to interpret results.
#
set -euo pipefail

cd "$(dirname "$0")/.."

OUTPUT=""
BENCH_TIME="${BENCH_TIME:-2s}"
BENCH_COUNT="${BENCH_COUNT:-5}"
BENCH_PKGS="./internal/gateway/... ./internal/classifier/... ./internal/evidence/..."
BENCH_REGEX='Benchmark(GatewayPipelineOverhead|PIIScan|EvidenceStore)$'

while getopts "o:" opt; do
case "$opt" in
o) OUTPUT="$OPTARG" ;;
*) echo "Usage: $0 [-o outfile.md]" >&2; exit 2 ;;
esac
done

if [ "$(uname -s)" = "Darwin" ]; then
GO_ENV=(env -u CC CC=/usr/bin/clang CGO_ENABLED=1)
else
GO_ENV=(env CGO_ENABLED=1)
fi

bench_out=$("${GO_ENV[@]}" go test \
-bench="$BENCH_REGEX" \
-benchmem \
-benchtime="$BENCH_TIME" \
-count="$BENCH_COUNT" \
-run='^$' \
$BENCH_PKGS 2>&1) || {
echo "$bench_out" >&2
exit 1
}

# Parse last result line per benchmark name (go test -count repeats runs).
parse_ns_per_op() {
local name="$1"
printf '%s\n' "$bench_out" \
| awk -v n="$name" '$1 ~ "^Benchmark" n && $1 !~ "/" { last=$3 } END { print last+0 }'
}

parse_allocs() {
local name="$1"
printf '%s\n' "$bench_out" \
| awk -v n="$name" '$1 ~ "^Benchmark" n && $1 !~ "/" && $0 ~ /allocs\/op/ { last=$(NF-1)" "$NF } END { print last }'
}

ns_to_ms() {
awk -v ns="$1" 'BEGIN { if (ns+0 <= 0) { print "n/a"; exit } printf "%.2f", ns/1e6 }'
}

ns_to_ops_per_sec() {
awk -v ns="$1" 'BEGIN { if (ns+0 <= 0) { print "n/a"; exit } printf "%.0f", 1e9/ns }'
}

gw_ns=$(parse_ns_per_op GatewayPipelineOverhead)
pii_ns=$(parse_ns_per_op PIIScan)
ev_ns=$(parse_ns_per_op EvidenceStore)

gw_ms=$(ns_to_ms "$gw_ns")
pii_ms=$(ns_to_ms "$pii_ns")
ev_ms=$(ns_to_ms "$ev_ns")
ev_ops=$(ns_to_ops_per_sec "$ev_ns")

gw_allocs=$(parse_allocs GatewayPipelineOverhead)
pii_allocs=$(parse_allocs PIIScan)
ev_allocs=$(parse_allocs EvidenceStore)

commit=$(git rev-parse --short HEAD 2>/dev/null || echo "unknown")
generated=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
go_ver=$("${GO_ENV[@]}" go version | sed 's/^go version //')
os_info=$(uname -srm 2>/dev/null || uname -a)
cpu_info=$(sysctl -n machdep.cpu.brand_string 2>/dev/null || grep -m1 'model name' /proc/cpuinfo 2>/dev/null | cut -d: -f2- | xargs || echo "unknown")

table=$(cat <<EOF
## Benchmark results (generated)

| Metric | Benchmark | Median (last of ${BENCH_COUNT} runs) | Allocs/op |
|--------|-----------|--------------------------------------|-----------|
| Gateway pipeline overhead | \`BenchmarkGatewayPipelineOverhead\` | **${gw_ms} ms**/req | ${gw_allocs:-n/a} |
| PII scan latency | \`BenchmarkPIIScan\` | **${pii_ms} ms**/scan | ${pii_allocs:-n/a} |
| Evidence write throughput | \`BenchmarkEvidenceStore\` | **${ev_ops} writes/s** (~${ev_ms} ms/write) | ${ev_allocs:-n/a} |

**Environment:** ${go_ver} · ${os_info} · ${cpu_info}
**Commit:** \`${commit}\` · **Generated:** ${generated}
**Settings:** \`-benchtime=${BENCH_TIME}\` \`-count=${BENCH_COUNT}\` \`-benchmem\`

Gateway overhead uses a local \`httptest\` upstream (no WAN RTT). Compare to the README
"< 15 ms excluding upstream" claim and the step budget in
[What Talon does to your request](../explanation/what-talon-does-to-your-request.md).

Retry/fallback decision overhead is **not** included until Epic #113 (#138/#139) lands.
EOF
)

echo "$table"

if [ -n "$OUTPUT" ]; then
mkdir -p "$(dirname "$OUTPUT")"
{
echo "# Talon benchmark snapshot"
echo ""
echo "$table"
echo ""
echo "Reproduce: \`make benchmarks\` or \`scripts/run-benchmarks.sh\`."
} >"$OUTPUT"
echo "Wrote $OUTPUT" >&2
fi

# Raw bench output for auditors who want the full go test lines.
echo "" >&2
echo "--- raw go test -bench output ---" >&2
printf '%s\n' "$bench_out" | grep -E '^Benchmark|^[0-9]+ ' >&2
Loading