Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 24 additions & 9 deletions docs/ai-gateway.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# SBproxy AI gateway guide

*Last modified: 2026-06-17*
*Last modified: 2026-06-18*

SBproxy includes an AI gateway that sits between your application and LLM providers. You get one API endpoint with automatic failover, cost tracking, rate limits, and programmable routing across OpenAI, Anthropic, and other providers. The proxy ships with 66 native providers behind one OpenAI-compatible API, including native Anthropic, Gemini, and Bedrock translators. You bring your own provider keys and the model name passes straight through, so you reach 200+ models without waiting on us to add them.

Expand Down Expand Up @@ -577,25 +577,40 @@ Stores responses keyed by the SHA-256 of the messages array with TTL and capacit

| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `max_entries` | usize | constructor arg | Hard cap on cached responses. The oldest insert is evicted on overflow. |
| `ttl_secs` | u64 | constructor arg | Seconds before an entry is treated as a miss and removed. |
| `enabled` | bool | `false` | Opts an origin into semantic-cache lookup and storage. |
| `threshold` | float | `0.85` | Minimum cosine similarity for a near-duplicate prompt to hit. |
| `ttl_secs` | u64 | `3600` | Seconds before an entry is treated as a miss and removed. |
| `max_entries` | usize | `1024` | Hard cap on cached responses. The oldest insert is evicted on overflow. |
| `source` | string | `provider` | `provider`, `sidecar`, or `inprocess`. |
| `embedding` | object | unset | Provider and model used when `source: provider`. |
| `sidecar` | object | unset | gRPC endpoint, model, and timeout used when `source: sidecar`. |
| `inprocess` | object | unset | ONNX model path, tokenizer path, and memory guard used when `source: inprocess`. |

The semantic cache is configured via per-origin `extensions.semantic_cache` rather than `action.semantic_cache`. Example:
The semantic cache is configured on each AI origin under `action.semantic_cache`. The default `source: provider` calls the configured embedding provider's `/v1/embeddings` endpoint:

```yaml
origins:
ai.example.com:
action:
type: ai_proxy
providers: [...]
extensions:
providers:
- name: openai
api_key: ${OPENAI_API_KEY}
models: [gpt-4o, text-embedding-3-small]
routing:
strategy: round_robin
semantic_cache:
enabled: true
ttl_secs: 1200
key_template: "{embedding_model}:{lsh_bucket}"
threshold: 0.85
ttl_secs: 3600
max_entries: 1024
source: provider
embedding:
provider: openai
model: text-embedding-3-small
```

The `extensions` map is opaque to the OSS config parser; runtime components that recognise the key apply it.
For local embeddings with no provider egress, set `source: sidecar` and run the classifier sidecar with an embedding model. For single-process experiments, `source: inprocess` loads the ONNX model into the proxy process and should be paired with `max_model_bytes`. See [local-inference.md](local-inference.md) and [examples/semantic-cache-local](../examples/semantic-cache-local/sb.yml).

### Idempotency middleware (RFC 8594)

Expand Down
35 changes: 25 additions & 10 deletions docs/llms-full.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Pairs with `/llms.txt` (the small AI-discoverable feature catalog at `docs/llms.
Regenerated by `scripts/regen-llms-full.sh`. Generated; do not hand-edit.

Source: https://github.com/soapbucket/sbproxy
Generated: 2026-06-18T13:10:18Z
Generated: 2026-06-18T13:46:06Z

---

Expand Down Expand Up @@ -3653,7 +3653,7 @@ Exemplars on `sbproxy_ledger_redeem_duration_seconds_bucket` let Grafana jump fr

## SBproxy AI gateway guide

*Last modified: 2026-06-17*
*Last modified: 2026-06-18*

SBproxy includes an AI gateway that sits between your application and LLM providers. You get one API endpoint with automatic failover, cost tracking, rate limits, and programmable routing across OpenAI, Anthropic, and other providers. The proxy ships with 66 native providers behind one OpenAI-compatible API, including native Anthropic, Gemini, and Bedrock translators. You bring your own provider keys and the model name passes straight through, so you reach 200+ models without waiting on us to add them.

Expand Down Expand Up @@ -4230,25 +4230,40 @@ Stores responses keyed by the SHA-256 of the messages array with TTL and capacit

| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `max_entries` | usize | constructor arg | Hard cap on cached responses. The oldest insert is evicted on overflow. |
| `ttl_secs` | u64 | constructor arg | Seconds before an entry is treated as a miss and removed. |
| `enabled` | bool | `false` | Opts an origin into semantic-cache lookup and storage. |
| `threshold` | float | `0.85` | Minimum cosine similarity for a near-duplicate prompt to hit. |
| `ttl_secs` | u64 | `3600` | Seconds before an entry is treated as a miss and removed. |
| `max_entries` | usize | `1024` | Hard cap on cached responses. The oldest insert is evicted on overflow. |
| `source` | string | `provider` | `provider`, `sidecar`, or `inprocess`. |
| `embedding` | object | unset | Provider and model used when `source: provider`. |
| `sidecar` | object | unset | gRPC endpoint, model, and timeout used when `source: sidecar`. |
| `inprocess` | object | unset | ONNX model path, tokenizer path, and memory guard used when `source: inprocess`. |

The semantic cache is configured via per-origin `extensions.semantic_cache` rather than `action.semantic_cache`. Example:
The semantic cache is configured on each AI origin under `action.semantic_cache`. The default `source: provider` calls the configured embedding provider's `/v1/embeddings` endpoint:

```yaml
origins:
ai.example.com:
action:
type: ai_proxy
providers: [...]
extensions:
providers:
- name: openai
api_key: ${OPENAI_API_KEY}
models: [gpt-4o, text-embedding-3-small]
routing:
strategy: round_robin
semantic_cache:
enabled: true
ttl_secs: 1200
key_template: "{embedding_model}:{lsh_bucket}"
threshold: 0.85
ttl_secs: 3600
max_entries: 1024
source: provider
embedding:
provider: openai
model: text-embedding-3-small
```

The `extensions` map is opaque to the OSS config parser; runtime components that recognise the key apply it.
For local embeddings with no provider egress, set `source: sidecar` and run the classifier sidecar with an embedding model. For single-process experiments, `source: inprocess` loads the ONNX model into the proxy process and should be paired with `max_model_bytes`. See [local-inference.md](local-inference.md) and [examples/semantic-cache-local](../examples/semantic-cache-local/sb.yml).

### Idempotency middleware (RFC 8594)

Expand Down
32 changes: 32 additions & 0 deletions e2e/cases/semantic-cache-sidecar/sb.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# yaml-language-server: $schema=../../../schemas/sb-config.schema.json
#
# WOR-1226: local semantic-cache sidecar e2e fixture.
# The test replaces __UPSTREAM__ and __SIDECAR__ with ephemeral local
# endpoints before starting the proxy.

proxy:
http_bind_port: 0

origins:
"ai.localhost":
action:
type: ai_proxy
providers:
- name: openai
api_key: "stub-key"
base_url: "__UPSTREAM__"
allow_private_base_url: true
models:
- gpt-4o
routing:
strategy: round_robin
semantic_cache:
enabled: true
threshold: 0.6
ttl_secs: 60
max_entries: 64
source: sidecar
sidecar:
endpoint: "__SIDECAR__"
model: all-MiniLM-L6-v2
timeout_ms: 2000
Loading
Loading