soapbucket · rickcrawford · Jun 18, 2026
diff --git a/docs/ai-gateway.md b/docs/ai-gateway.md
@@ -1,6 +1,6 @@
 # SBproxy AI gateway guide
 
-*Last modified: 2026-06-17*
+*Last modified: 2026-06-18*
 
 SBproxy includes an AI gateway that sits between your application and LLM providers. You get one API endpoint with automatic failover, cost tracking, rate limits, and programmable routing across OpenAI, Anthropic, and other providers. The proxy ships with 66 native providers behind one OpenAI-compatible API, including native Anthropic, Gemini, and Bedrock translators. You bring your own provider keys and the model name passes straight through, so you reach 200+ models without waiting on us to add them.
 
@@ -577,25 +577,40 @@ Stores responses keyed by the SHA-256 of the messages array with TTL and capacit
 
 | Field | Type | Default | Notes |
 |-------|------|---------|-------|
-| `max_entries` | usize | constructor arg | Hard cap on cached responses. The oldest insert is evicted on overflow. |
-| `ttl_secs` | u64 | constructor arg | Seconds before an entry is treated as a miss and removed. |
+| `enabled` | bool | `false` | Opts an origin into semantic-cache lookup and storage. |
+| `threshold` | float | `0.85` | Minimum cosine similarity for a near-duplicate prompt to hit. |
+| `ttl_secs` | u64 | `3600` | Seconds before an entry is treated as a miss and removed. |
+| `max_entries` | usize | `1024` | Hard cap on cached responses. The oldest insert is evicted on overflow. |
+| `source` | string | `provider` | `provider`, `sidecar`, or `inprocess`. |
+| `embedding` | object | unset | Provider and model used when `source: provider`. |
+| `sidecar` | object | unset | gRPC endpoint, model, and timeout used when `source: sidecar`. |
+| `inprocess` | object | unset | ONNX model path, tokenizer path, and memory guard used when `source: inprocess`. |
 
-The semantic cache is configured via per-origin `extensions.semantic_cache` rather than `action.semantic_cache`. Example:
+The semantic cache is configured on each AI origin under `action.semantic_cache`. The default `source: provider` calls the configured embedding provider's `/v1/embeddings` endpoint:
 
 ```yaml
 origins:
   ai.example.com:
     action:
       type: ai_proxy
-      providers: [...]
-    extensions:
+      providers:
+        - name: openai
+          api_key: ${OPENAI_API_KEY}
+          models: [gpt-4o, text-embedding-3-small]
+      routing:
+        strategy: round_robin
       semantic_cache:
         enabled: true
-        ttl_secs: 1200
-        key_template: "{embedding_model}:{lsh_bucket}"
+        threshold: 0.85
+        ttl_secs: 3600
+        max_entries: 1024
+        source: provider
+        embedding:
+          provider: openai
+          model: text-embedding-3-small
 ```
 
-The `extensions` map is opaque to the OSS config parser; runtime components that recognise the key apply it.
+For local embeddings with no provider egress, set `source: sidecar` and run the classifier sidecar with an embedding model. For single-process experiments, `source: inprocess` loads the ONNX model into the proxy process and should be paired with `max_model_bytes`. See [local-inference.md](local-inference.md) and [examples/semantic-cache-local](../examples/semantic-cache-local/sb.yml).
 
 ### Idempotency middleware (RFC 8594)
 

diff --git a/docs/llms-full.txt b/docs/llms-full.txt
@@ -7,7 +7,7 @@ Pairs with `/llms.txt` (the small AI-discoverable feature catalog at `docs/llms.
 Regenerated by `scripts/regen-llms-full.sh`. Generated; do not hand-edit.
 
 Source: https://github.com/soapbucket/sbproxy
-Generated: 2026-06-18T13:10:18Z
+Generated: 2026-06-18T13:46:06Z
 
 ---
 
@@ -3653,7 +3653,7 @@ Exemplars on `sbproxy_ledger_redeem_duration_seconds_bucket` let Grafana jump fr
 
 ## SBproxy AI gateway guide
 
-*Last modified: 2026-06-17*
+*Last modified: 2026-06-18*
 
 SBproxy includes an AI gateway that sits between your application and LLM providers. You get one API endpoint with automatic failover, cost tracking, rate limits, and programmable routing across OpenAI, Anthropic, and other providers. The proxy ships with 66 native providers behind one OpenAI-compatible API, including native Anthropic, Gemini, and Bedrock translators. You bring your own provider keys and the model name passes straight through, so you reach 200+ models without waiting on us to add them.
 
@@ -4230,25 +4230,40 @@ Stores responses keyed by the SHA-256 of the messages array with TTL and capacit
 
 | Field | Type | Default | Notes |
 |-------|------|---------|-------|
-| `max_entries` | usize | constructor arg | Hard cap on cached responses. The oldest insert is evicted on overflow. |
-| `ttl_secs` | u64 | constructor arg | Seconds before an entry is treated as a miss and removed. |
+| `enabled` | bool | `false` | Opts an origin into semantic-cache lookup and storage. |
+| `threshold` | float | `0.85` | Minimum cosine similarity for a near-duplicate prompt to hit. |
+| `ttl_secs` | u64 | `3600` | Seconds before an entry is treated as a miss and removed. |
+| `max_entries` | usize | `1024` | Hard cap on cached responses. The oldest insert is evicted on overflow. |
+| `source` | string | `provider` | `provider`, `sidecar`, or `inprocess`. |
+| `embedding` | object | unset | Provider and model used when `source: provider`. |
+| `sidecar` | object | unset | gRPC endpoint, model, and timeout used when `source: sidecar`. |
+| `inprocess` | object | unset | ONNX model path, tokenizer path, and memory guard used when `source: inprocess`. |
 
-The semantic cache is configured via per-origin `extensions.semantic_cache` rather than `action.semantic_cache`. Example:
+The semantic cache is configured on each AI origin under `action.semantic_cache`. The default `source: provider` calls the configured embedding provider's `/v1/embeddings` endpoint:
 
 ```yaml
 origins:
   ai.example.com:
     action:
       type: ai_proxy
-      providers: [...]
-    extensions:
+      providers:
+        - name: openai
+          api_key: ${OPENAI_API_KEY}
+          models: [gpt-4o, text-embedding-3-small]
+      routing:
+        strategy: round_robin
       semantic_cache:
         enabled: true
-        ttl_secs: 1200
-        key_template: "{embedding_model}:{lsh_bucket}"
+        threshold: 0.85
+        ttl_secs: 3600
+        max_entries: 1024
+        source: provider
+        embedding:
+          provider: openai
+          model: text-embedding-3-small
 ```
 
-The `extensions` map is opaque to the OSS config parser; runtime components that recognise the key apply it.
+For local embeddings with no provider egress, set `source: sidecar` and run the classifier sidecar with an embedding model. For single-process experiments, `source: inprocess` loads the ONNX model into the proxy process and should be paired with `max_model_bytes`. See [local-inference.md](local-inference.md) and [examples/semantic-cache-local](../examples/semantic-cache-local/sb.yml).
 
 ### Idempotency middleware (RFC 8594)
 

diff --git a/e2e/cases/semantic-cache-sidecar/sb.yml b/e2e/cases/semantic-cache-sidecar/sb.yml
@@ -0,0 +1,32 @@
+# yaml-language-server: $schema=../../../schemas/sb-config.schema.json
+#
+# WOR-1226: local semantic-cache sidecar e2e fixture.
+# The test replaces __UPSTREAM__ and __SIDECAR__ with ephemeral local
+# endpoints before starting the proxy.
+
+proxy:
+  http_bind_port: 0
+
+origins:
+  "ai.localhost":
+    action:
+      type: ai_proxy
+      providers:
+        - name: openai
+          api_key: "stub-key"
+          base_url: "__UPSTREAM__"
+          allow_private_base_url: true
+          models:
+            - gpt-4o
+      routing:
+        strategy: round_robin
+      semantic_cache:
+        enabled: true
+        threshold: 0.6
+        ttl_secs: 60
+        max_entries: 64
+        source: sidecar
+        sidecar:
+          endpoint: "__SIDECAR__"
+          model: all-MiniLM-L6-v2
+          timeout_ms: 2000