diff --git a/docs/ai-gateway.md b/docs/ai-gateway.md index 526e4152..28b3cd75 100644 --- a/docs/ai-gateway.md +++ b/docs/ai-gateway.md @@ -1,6 +1,6 @@ # SBproxy AI gateway guide -*Last modified: 2026-06-17* +*Last modified: 2026-06-18* SBproxy includes an AI gateway that sits between your application and LLM providers. You get one API endpoint with automatic failover, cost tracking, rate limits, and programmable routing across OpenAI, Anthropic, and other providers. The proxy ships with 66 native providers behind one OpenAI-compatible API, including native Anthropic, Gemini, and Bedrock translators. You bring your own provider keys and the model name passes straight through, so you reach 200+ models without waiting on us to add them. @@ -577,25 +577,40 @@ Stores responses keyed by the SHA-256 of the messages array with TTL and capacit | Field | Type | Default | Notes | |-------|------|---------|-------| -| `max_entries` | usize | constructor arg | Hard cap on cached responses. The oldest insert is evicted on overflow. | -| `ttl_secs` | u64 | constructor arg | Seconds before an entry is treated as a miss and removed. | +| `enabled` | bool | `false` | Opts an origin into semantic-cache lookup and storage. | +| `threshold` | float | `0.85` | Minimum cosine similarity for a near-duplicate prompt to hit. | +| `ttl_secs` | u64 | `3600` | Seconds before an entry is treated as a miss and removed. | +| `max_entries` | usize | `1024` | Hard cap on cached responses. The oldest insert is evicted on overflow. | +| `source` | string | `provider` | `provider`, `sidecar`, or `inprocess`. | +| `embedding` | object | unset | Provider and model used when `source: provider`. | +| `sidecar` | object | unset | gRPC endpoint, model, and timeout used when `source: sidecar`. | +| `inprocess` | object | unset | ONNX model path, tokenizer path, and memory guard used when `source: inprocess`. | -The semantic cache is configured via per-origin `extensions.semantic_cache` rather than `action.semantic_cache`. Example: +The semantic cache is configured on each AI origin under `action.semantic_cache`. The default `source: provider` calls the configured embedding provider's `/v1/embeddings` endpoint: ```yaml origins: ai.example.com: action: type: ai_proxy - providers: [...] - extensions: + providers: + - name: openai + api_key: ${OPENAI_API_KEY} + models: [gpt-4o, text-embedding-3-small] + routing: + strategy: round_robin semantic_cache: enabled: true - ttl_secs: 1200 - key_template: "{embedding_model}:{lsh_bucket}" + threshold: 0.85 + ttl_secs: 3600 + max_entries: 1024 + source: provider + embedding: + provider: openai + model: text-embedding-3-small ``` -The `extensions` map is opaque to the OSS config parser; runtime components that recognise the key apply it. +For local embeddings with no provider egress, set `source: sidecar` and run the classifier sidecar with an embedding model. For single-process experiments, `source: inprocess` loads the ONNX model into the proxy process and should be paired with `max_model_bytes`. See [local-inference.md](local-inference.md) and [examples/semantic-cache-local](../examples/semantic-cache-local/sb.yml). ### Idempotency middleware (RFC 8594) diff --git a/docs/llms-full.txt b/docs/llms-full.txt index 078a3bc3..a9dc1000 100644 --- a/docs/llms-full.txt +++ b/docs/llms-full.txt @@ -7,7 +7,7 @@ Pairs with `/llms.txt` (the small AI-discoverable feature catalog at `docs/llms. Regenerated by `scripts/regen-llms-full.sh`. Generated; do not hand-edit. Source: https://github.com/soapbucket/sbproxy -Generated: 2026-06-18T13:10:18Z +Generated: 2026-06-18T13:46:06Z --- @@ -3653,7 +3653,7 @@ Exemplars on `sbproxy_ledger_redeem_duration_seconds_bucket` let Grafana jump fr ## SBproxy AI gateway guide -*Last modified: 2026-06-17* +*Last modified: 2026-06-18* SBproxy includes an AI gateway that sits between your application and LLM providers. You get one API endpoint with automatic failover, cost tracking, rate limits, and programmable routing across OpenAI, Anthropic, and other providers. The proxy ships with 66 native providers behind one OpenAI-compatible API, including native Anthropic, Gemini, and Bedrock translators. You bring your own provider keys and the model name passes straight through, so you reach 200+ models without waiting on us to add them. @@ -4230,25 +4230,40 @@ Stores responses keyed by the SHA-256 of the messages array with TTL and capacit | Field | Type | Default | Notes | |-------|------|---------|-------| -| `max_entries` | usize | constructor arg | Hard cap on cached responses. The oldest insert is evicted on overflow. | -| `ttl_secs` | u64 | constructor arg | Seconds before an entry is treated as a miss and removed. | +| `enabled` | bool | `false` | Opts an origin into semantic-cache lookup and storage. | +| `threshold` | float | `0.85` | Minimum cosine similarity for a near-duplicate prompt to hit. | +| `ttl_secs` | u64 | `3600` | Seconds before an entry is treated as a miss and removed. | +| `max_entries` | usize | `1024` | Hard cap on cached responses. The oldest insert is evicted on overflow. | +| `source` | string | `provider` | `provider`, `sidecar`, or `inprocess`. | +| `embedding` | object | unset | Provider and model used when `source: provider`. | +| `sidecar` | object | unset | gRPC endpoint, model, and timeout used when `source: sidecar`. | +| `inprocess` | object | unset | ONNX model path, tokenizer path, and memory guard used when `source: inprocess`. | -The semantic cache is configured via per-origin `extensions.semantic_cache` rather than `action.semantic_cache`. Example: +The semantic cache is configured on each AI origin under `action.semantic_cache`. The default `source: provider` calls the configured embedding provider's `/v1/embeddings` endpoint: ```yaml origins: ai.example.com: action: type: ai_proxy - providers: [...] - extensions: + providers: + - name: openai + api_key: ${OPENAI_API_KEY} + models: [gpt-4o, text-embedding-3-small] + routing: + strategy: round_robin semantic_cache: enabled: true - ttl_secs: 1200 - key_template: "{embedding_model}:{lsh_bucket}" + threshold: 0.85 + ttl_secs: 3600 + max_entries: 1024 + source: provider + embedding: + provider: openai + model: text-embedding-3-small ``` -The `extensions` map is opaque to the OSS config parser; runtime components that recognise the key apply it. +For local embeddings with no provider egress, set `source: sidecar` and run the classifier sidecar with an embedding model. For single-process experiments, `source: inprocess` loads the ONNX model into the proxy process and should be paired with `max_model_bytes`. See [local-inference.md](local-inference.md) and [examples/semantic-cache-local](../examples/semantic-cache-local/sb.yml). ### Idempotency middleware (RFC 8594) diff --git a/e2e/cases/semantic-cache-sidecar/sb.yml b/e2e/cases/semantic-cache-sidecar/sb.yml new file mode 100644 index 00000000..a9ec5aad --- /dev/null +++ b/e2e/cases/semantic-cache-sidecar/sb.yml @@ -0,0 +1,32 @@ +# yaml-language-server: $schema=../../../schemas/sb-config.schema.json +# +# WOR-1226: local semantic-cache sidecar e2e fixture. +# The test replaces __UPSTREAM__ and __SIDECAR__ with ephemeral local +# endpoints before starting the proxy. + +proxy: + http_bind_port: 0 + +origins: + "ai.localhost": + action: + type: ai_proxy + providers: + - name: openai + api_key: "stub-key" + base_url: "__UPSTREAM__" + allow_private_base_url: true + models: + - gpt-4o + routing: + strategy: round_robin + semantic_cache: + enabled: true + threshold: 0.6 + ttl_secs: 60 + max_entries: 64 + source: sidecar + sidecar: + endpoint: "__SIDECAR__" + model: all-MiniLM-L6-v2 + timeout_ms: 2000 diff --git a/e2e/tests/semantic_cache_sidecar_e2e.rs b/e2e/tests/semantic_cache_sidecar_e2e.rs new file mode 100644 index 00000000..bada0dbf --- /dev/null +++ b/e2e/tests/semantic_cache_sidecar_e2e.rs @@ -0,0 +1,306 @@ +//! Local sidecar-backed semantic-cache e2e (WOR-1226). +//! +//! The normal e2e suite compiles this test but skips execution unless +//! `SBPROXY_TEST_EMBED_MODEL` and `SBPROXY_TEST_EMBED_TOKENIZER` point at +//! local ONNX/tokenizer artifacts. When enabled, it launches the release +//! classifier sidecar with `--embed-model`, starts the proxy through +//! `ProxyHarness`, then proves a near-duplicate prompt is served from the +//! semantic cache without a second upstream chat call. + +use std::io::{Read, Write}; +use std::net::{TcpListener, TcpStream}; +use std::path::PathBuf; +use std::process::{Child, Command, Stdio}; +use std::sync::atomic::{AtomicUsize, Ordering}; +use std::sync::{Arc, Mutex}; +use std::time::{Duration, Instant}; + +use sbproxy_e2e::ProxyHarness; + +const MODEL_ID: &str = "all-MiniLM-L6-v2"; +const CASE_CONFIG: &str = "e2e/cases/semantic-cache-sidecar/sb.yml"; +const SIDECAR_BIN_ENV: &str = "SBPROXY_CLASSIFIER_SIDECAR_BIN"; +const EMBED_MODEL_ENV: &str = "SBPROXY_TEST_EMBED_MODEL"; +const EMBED_TOKENIZER_ENV: &str = "SBPROXY_TEST_EMBED_TOKENIZER"; + +struct MockProvider { + port: u16, + chat_calls: Arc, + shutdown: Arc>, +} + +impl MockProvider { + fn start() -> Self { + let listener = TcpListener::bind("127.0.0.1:0").expect("bind mock provider"); + let port = listener.local_addr().unwrap().port(); + let chat_calls = Arc::new(AtomicUsize::new(0)); + let shutdown = Arc::new(Mutex::new(false)); + let chat_clone = Arc::clone(&chat_calls); + let shutdown_clone = Arc::clone(&shutdown); + + std::thread::spawn(move || { + for stream in listener.incoming() { + if *shutdown_clone.lock().unwrap() { + break; + } + let mut stream = match stream { + Ok(stream) => stream, + Err(_) => continue, + }; + let chat = Arc::clone(&chat_clone); + std::thread::spawn(move || { + let _ = handle_provider_conn(&mut stream, chat); + }); + } + }); + + Self { + port, + chat_calls, + shutdown, + } + } + + fn base_url(&self) -> String { + format!("http://127.0.0.1:{}", self.port) + } + + fn chat_calls(&self) -> usize { + self.chat_calls.load(Ordering::SeqCst) + } +} + +impl Drop for MockProvider { + fn drop(&mut self) { + *self.shutdown.lock().unwrap() = true; + let _ = TcpStream::connect(("127.0.0.1", self.port)); + } +} + +struct Sidecar { + child: Child, +} + +impl Sidecar { + fn start(model: PathBuf, tokenizer: PathBuf) -> anyhow::Result<(Self, String)> { + let port = pick_free_port()?; + let endpoint = format!("http://127.0.0.1:{port}"); + let spec = format!("{MODEL_ID}={}:{}", model.display(), tokenizer.display()); + let bin = sidecar_binary_path(); + if !bin.is_file() { + anyhow::bail!( + "classifier sidecar binary missing at {}; run `cargo build --release -p sbproxy-classifier-sidecar` or set {SIDECAR_BIN_ENV}", + bin.display() + ); + } + + let mut child = Command::new(&bin) + .arg("--listen") + .arg(format!("127.0.0.1:{port}")) + .arg("--embed-model") + .arg(spec) + .stdin(Stdio::null()) + .stdout(Stdio::null()) + .stderr(Stdio::null()) + .spawn() + .map_err(|e| anyhow::anyhow!("spawn sidecar {}: {e}", bin.display()))?; + + wait_for_sidecar_tcp(&mut child, port, Duration::from_secs(45))?; + Ok((Self { child }, endpoint)) + } +} + +impl Drop for Sidecar { + fn drop(&mut self) { + let _ = self.child.kill(); + let _ = self.child.wait(); + } +} + +fn handle_provider_conn( + stream: &mut TcpStream, + chat_calls: Arc, +) -> std::io::Result<()> { + read_request(stream)?; + let n = chat_calls.fetch_add(1, Ordering::SeqCst) + 1; + let json_body = format!( + r#"{{"id":"chatcmpl-{n}","object":"chat.completion","created":1700000000,"model":"gpt-4o","choices":[{{"index":0,"message":{{"role":"assistant","content":"reply-{n}"}},"finish_reason":"stop"}}],"usage":{{"prompt_tokens":1,"completion_tokens":1,"total_tokens":2}}}}"# + ); + let response = format!( + "HTTP/1.1 200 OK\r\nContent-Type: application/json\r\nContent-Length: {}\r\nConnection: close\r\n\r\n{}", + json_body.len(), + json_body + ); + stream.write_all(response.as_bytes())?; + stream.flush()?; + Ok(()) +} + +fn read_request(stream: &mut TcpStream) -> std::io::Result<()> { + let mut buf = Vec::new(); + let mut tmp = [0u8; 4096]; + loop { + if let Some(end) = find_headers_end(&buf) { + let header_str = String::from_utf8_lossy(&buf[..end]); + let content_len = parse_content_length(&header_str); + if buf.len() >= end + 4 + content_len { + return Ok(()); + } + } + let n = stream.read(&mut tmp)?; + if n == 0 { + return Ok(()); + } + buf.extend_from_slice(&tmp[..n]); + } +} + +fn find_headers_end(buf: &[u8]) -> Option { + buf.windows(4).position(|w| w == b"\r\n\r\n") +} + +fn parse_content_length(headers: &str) -> usize { + headers + .lines() + .find_map(|line| { + line.to_ascii_lowercase() + .strip_prefix("content-length:") + .and_then(|rest| rest.trim().parse().ok()) + }) + .unwrap_or(0) +} + +fn pick_free_port() -> anyhow::Result { + let listener = TcpListener::bind("127.0.0.1:0")?; + Ok(listener.local_addr()?.port()) +} + +fn wait_for_sidecar_tcp(child: &mut Child, port: u16, timeout: Duration) -> anyhow::Result<()> { + let deadline = Instant::now() + timeout; + loop { + if TcpStream::connect(("127.0.0.1", port)).is_ok() { + return Ok(()); + } + if let Some(status) = child.try_wait()? { + anyhow::bail!("classifier sidecar exited before accepting connections: {status}"); + } + if Instant::now() >= deadline { + anyhow::bail!("classifier sidecar did not accept connections within {timeout:?}"); + } + std::thread::sleep(Duration::from_millis(100)); + } +} + +fn sidecar_binary_path() -> PathBuf { + if let Some(path) = std::env::var_os(SIDECAR_BIN_ENV).filter(|v| !v.is_empty()) { + return PathBuf::from(path); + } + workspace_root().join("target/release/sbproxy-classifier-sidecar") +} + +fn workspace_root() -> PathBuf { + PathBuf::from(env!("CARGO_MANIFEST_DIR")) + .parent() + .expect("e2e crate lives under workspace root") + .to_path_buf() +} + +fn embed_fixture_from_env() -> Option<(PathBuf, PathBuf)> { + let model = match std::env::var_os(EMBED_MODEL_ENV) { + Some(value) if !value.is_empty() => PathBuf::from(value), + _ => { + eprintln!("skipping sidecar semantic-cache e2e: {EMBED_MODEL_ENV} is not set"); + return None; + } + }; + let tokenizer = match std::env::var_os(EMBED_TOKENIZER_ENV) { + Some(value) if !value.is_empty() => PathBuf::from(value), + _ => { + eprintln!("skipping sidecar semantic-cache e2e: {EMBED_TOKENIZER_ENV} is not set"); + return None; + } + }; + assert!( + model.is_file(), + "{EMBED_MODEL_ENV} does not point at a file: {}", + model.display() + ); + assert!( + tokenizer.is_file(), + "{EMBED_TOKENIZER_ENV} does not point at a file: {}", + tokenizer.display() + ); + Some((model, tokenizer)) +} + +fn config_for(upstream: &str, sidecar: &str) -> String { + let path = workspace_root().join(CASE_CONFIG); + let raw = std::fs::read_to_string(&path) + .unwrap_or_else(|e| panic!("read case config {}: {e}", path.display())); + raw.replace("__UPSTREAM__", upstream) + .replace("__SIDECAR__", sidecar) +} + +fn chat(prompt: &str) -> serde_json::Value { + serde_json::json!({ + "model": "gpt-4o", + "messages": [{"role": "user", "content": prompt}] + }) +} + +#[test] +fn near_duplicate_prompt_hits_sidecar_semantic_cache() { + let Some((model, tokenizer)) = embed_fixture_from_env() else { + return; + }; + let upstream = MockProvider::start(); + let (_sidecar, sidecar_endpoint) = + Sidecar::start(model, tokenizer).expect("start classifier sidecar"); + let proxy = ProxyHarness::start_with_yaml(&config_for(&upstream.base_url(), &sidecar_endpoint)) + .expect("start proxy"); + + let r1 = proxy + .post_json( + "/v1/chat/completions", + "ai.localhost", + &chat("What is the capital city of France?"), + &[], + ) + .expect("first call"); + assert_eq!(r1.status, 200); + let b1 = String::from_utf8(r1.body).unwrap(); + assert!( + b1.contains("reply-1"), + "first reply should be reply-1: {b1}" + ); + assert_ne!( + r1.headers.get("x-semcache").map(|s| s.as_str()), + Some("HIT"), + "first call must be a miss" + ); + + let r2 = proxy + .post_json( + "/v1/chat/completions", + "ai.localhost", + &chat("What is France's capital city?"), + &[], + ) + .expect("second call"); + assert_eq!(r2.status, 200); + assert_eq!( + r2.headers.get("x-semcache").map(|s| s.as_str()), + Some("HIT"), + "near-duplicate prompt must be a semantic cache hit" + ); + let b2 = String::from_utf8(r2.body).unwrap(); + assert!( + b2.contains("reply-1"), + "hit must replay the cached reply-1, got: {b2}" + ); + assert_eq!( + upstream.chat_calls(), + 1, + "second call must not reach upstream" + ); +}