diff --git a/docs/how-to/cost-optimization.md b/docs/how-to/cost-optimization.md new file mode 100644 index 000000000..f8e07db0c --- /dev/null +++ b/docs/how-to/cost-optimization.md @@ -0,0 +1,361 @@ +--- +sidebar_position: 11 +--- + +# Cost Optimization + +This is the long-form cost optimization guide for SynapseKit. The cheapest system is the one that *does fewer expensive calls* — everything else is marginal. The four real levers are: + +1. **Cost-quality routing** (always pick the cheapest model that meets your quality bar). +2. **Prompt compression** (send less context per call). +3. **Caching layers** (avoid repeat calls entirely). +4. **Fallback ladders** (fail cheap → expensive, not the other way around). + +If you want only one rule: **route + compress + cache.** + +--- + +## Prerequisites + +```bash +pip install synapsekit[openai] +# Optional for Redis caching +pip install synapsekit[redis] +``` + +--- + +# 1) Know your cost envelope + +**Per-request cost** is roughly: + +``` +(input_tokens * price_in) + (output_tokens * price_out) +``` + +You don’t need perfect accuracy to make good decisions. A rough estimate is enough to decide when to downgrade or compress. + +## 1.1 Use `CostTracker` to measure, not guess + +```python +import time +from synapsekit import CostTracker +from synapsekit.observability.tracer import COST_TABLE + +tracker = CostTracker() + +# Example: record a call +with tracker.scope("pipeline:qa"): + start = time.perf_counter() + # ... call an LLM ... + latency_ms = (time.perf_counter() - start) * 1000 + + # Use your real token counts if you have them + rec = tracker.record( + model="gpt-4o-mini", + input_tokens=1200, + output_tokens=400, + latency_ms=latency_ms, + ) + +print(rec.cost_usd) +print(tracker.summary()) +print(COST_TABLE["gpt-4o-mini"]) # price per token (input/output) +``` + +## 1.2 Enforce hard budgets with `BudgetGuard` + +```python +from synapsekit import BudgetGuard, BudgetLimit + +# Hard limits in USD +limits = BudgetLimit(per_request=0.02, per_user=1.00, daily=20.00) + +guard = BudgetGuard(limits) + +# Before call +estimated_cost = 0.015 +user_id = "alice" + +guard.check_before(estimated_cost=estimated_cost, user_id=user_id) + +# After call +actual_cost = 0.012 + +guard.record_spend(actual_cost, user_id=user_id) +``` + +**Rule:** If you don’t have budgets, you *will* get a bill shock. + +--- + +# 2) Cost-quality routing + +Routing is the biggest cost lever. Use the cheapest model that meets your quality threshold, and only promote when needed. + +## 2.1 Static routing with `CostRouter` + +`CostRouter` filters out models below a quality threshold, then picks the **cheapest** of the rest. + +```python +from synapsekit import CostRouter, CostRouterConfig, RouterModelSpec + +router = CostRouter(CostRouterConfig( + models=[ + RouterModelSpec(model="gpt-4o-mini", api_key="sk-...", provider="openai"), + RouterModelSpec(model="gpt-4o", api_key="sk-...", provider="openai"), + RouterModelSpec(model="claude-sonnet-4-6", api_key="sk-ant-...", provider="anthropic"), + ], + quality_threshold=0.78, # Uses built-in QUALITY_TABLE + fallback_on_error=True, +)) + +answer = await router.generate("Summarize this doc") +print(router.selected_model) +``` + +Use this when you already know your minimum acceptable quality. + +--- + +## 2.2 Learning-based routing with `CostQualityRouter` + +`CostQualityRouter` explores, learns real cost/quality stats, then exploits the cheapest model that meets your threshold. + +```python +from synapsekit import CostQualityRouter +from synapsekit.llm.openai import OpenAILLM +from synapsekit import LLMConfig + +cheap = OpenAILLM(LLMConfig(model="gpt-4o-mini", api_key="sk-...", provider="openai")) +strong = OpenAILLM(LLMConfig(model="gpt-4o", api_key="sk-...", provider="openai")) + +router = CostQualityRouter( + candidates=[cheap, strong], + eval_suite="my_project.eval:eval_suite", # Any object with evaluate(question, answer) + quality_threshold=0.85, + budget_per_call_usd=0.02, + explore_n=50, +) + +result = await router.generate("Explain transformer attention") +print(router.stats()) # includes Pareto frontier +``` + +**Pattern:** run the router in exploration for 50–200 calls, then let it exploit. The Pareto frontier will show which models are actually worth paying for. + +--- + +# 3) Prompt compression (send fewer tokens) + +You pay for every token you send and receive. Shrink the context first; don’t just downgrade models. + +## 3.1 Cut context size with `ContextualCompressionRetriever` + +This retrieves **more** documents (fetch_k) and compresses each down to only what is relevant to the question. + +```python +from synapsekit import ( + RAGPipeline, RAGConfig, InMemoryVectorStore, + SynapsekitEmbeddings, ContextualCompressionRetriever, + Retriever, ConversationMemory, +) +from synapsekit.llm.openai import OpenAILLM +from synapsekit import LLMConfig + +llm = OpenAILLM(LLMConfig(model="gpt-4o-mini", api_key="sk-...", provider="openai")) +store = InMemoryVectorStore(SynapsekitEmbeddings()) +base_retriever = Retriever(store) +retriever = ContextualCompressionRetriever( + retriever=base_retriever, + llm=llm, + fetch_k=12, # pull more docs +) + +memory = ConversationMemory(window=5) +rag = RAGPipeline(RAGConfig( + llm=llm, + retriever=retriever, + memory=memory, + retrieval_top_k=4, # but only keep top 4 compressed excerpts +)) +``` + +**Effect:** fewer context tokens per call, lower cost, often *higher* relevance. + +--- + +## 3.2 Summarize long conversations with `SummaryBufferMemory` + +For chat systems, history grows fast. Summarize the old turns instead of appending everything. + +```python +from synapsekit import SummaryBufferMemory + +memory = SummaryBufferMemory(llm=llm, max_tokens=1200) + +memory.add("user", "long question...") +memory.add("assistant", "long answer...") + +messages = await memory.get_messages() +``` + +--- + +## 3.3 Tune RAG chunking and top_k + +```python +from synapsekit import RAGConfig, ConversationMemory + +config = RAGConfig( + llm=llm, + retriever=retriever, + memory=ConversationMemory(window=5), + chunk_size=400, # smaller chunks → fewer irrelevant tokens + chunk_overlap=40, + retrieval_top_k=3, # fewer chunks per query +) +``` + +Smaller chunks + lower `top_k` typically reduce context tokens by 30–60%. + +--- + +# 4) Caching layers (avoid calls entirely) + +## 4.1 LLM response caching + +Enable the built-in cache — it’s the cheapest optimization you’ll ever make. + +```python +from synapsekit import LLMConfig +from synapsekit.llm.openai import OpenAILLM + +llm = OpenAILLM(LLMConfig( + model="gpt-4o-mini", + api_key="sk-...", + provider="openai", + cache=True, + cache_backend="redis", # "memory" | "sqlite" | "filesystem" | "redis" + cache_db_path="redis://localhost:6379", +)) + +await llm.generate("What is SynapseKit?") +await llm.generate("What is SynapseKit?") # cache hit + +print(llm.cache_stats) # {"hits": 1, "misses": 1, "size": 1} +``` + +## 4.2 Persist RAG indexes (skip re-embedding) + +```python +from synapsekit import RAG + +rag = RAG(model="gpt-4o-mini", api_key="sk-...") + +await rag.add_async("doc 1") +await rag.add_async("doc 2") +rag.save("./index.npz") + +# Later +rag.load("./index.npz") +``` + +--- + +# 5) Model fallback ladders + +Always try cheap models first, then escalate only if the answer is too short or fails. + +```python +from synapsekit import FallbackChain, FallbackChainConfig +from synapsekit.llm.openai import OpenAILLM +from synapsekit import LLMConfig + +cheap = OpenAILLM(LLMConfig(model="gpt-4o-mini", api_key="sk-...", provider="openai")) +strong = OpenAILLM(LLMConfig(model="gpt-4o", api_key="sk-...", provider="openai")) + +chain = FallbackChain(FallbackChainConfig( + models=[cheap, strong], + min_response_length=20, + fallback_on_empty=True, +)) + +answer = await chain.generate("Explain vector databases") +print(chain.used_model) +``` + +**Pattern:** Use `FallbackChain` when your main model is cheap but occasionally underperforms. + +--- + +# 6) Savings case studies (illustrative) + +These are **example calculations** using published per-token prices. Swap in your real numbers. + +## Case 1 — Router + cheap models for 70% of traffic + +**Assumptions:** +- 100,000 requests/month +- Average tokens per request: 1,200 input + 400 output +- GPT-4o-mini pricing: $0.15 / 1M input, $0.60 / 1M output +- GPT-4o pricing: $2.50 / 1M input, $10.00 / 1M output +- Routing sends 70% to GPT-4o-mini, 30% to GPT-4o + +**Cost per request:** +- GPT-4o-mini: (1200 * 0.15 + 400 * 0.60) / 1e6 = **$0.00033** +- GPT-4o: (1200 * 2.50 + 400 * 10.00) / 1e6 = **$0.00700** + +**Monthly cost (example):** +``` +0.70 * 100,000 * 0.00033 = $23.10 +0.30 * 100,000 * 0.00700 = $210.00 +TOTAL = $233.10 +``` + +If everything ran on GPT-4o, cost would be **$700**. Routing saves **~67%**. + +--- + +## Case 2 — Cache hits on 30% repeated queries + +**Assumptions:** +- 50,000 requests/month +- 30% are repeat questions +- Average GPT-4o-mini cost: $0.00033 per request + +``` +Without cache: 50,000 * 0.00033 = $16.50 +With 30% cache hits: 35,000 * 0.00033 = $11.55 +Savings: $4.95 (30%) +``` + +Cache hits scale linearly with savings. If repeat rates are high, caching is the biggest lever after routing. + +--- + +## Case 3 — Context compression reduces input tokens by 50% + +If your average prompt includes 2,000 context tokens and you compress it to 1,000, input cost halves. For GPT-4o-mini, that’s a 50% input-cost reduction immediately. + +--- + +# 7) Cost optimization checklist + +- [ ] Route requests with **CostRouter** or **CostQualityRouter** +- [ ] Enforce budgets with **BudgetGuard** +- [ ] Trim context (`retrieval_top_k`, `chunk_size`, compression) +- [ ] Summarize long chat history with **SummaryBufferMemory** +- [ ] Enable LLM caching (Redis for multi-worker) +- [ ] Use **FallbackChain** to escalate only when needed +- [ ] Track costs with **CostTracker** and log by scope + +--- + +# Next steps + +- [CostRouter & FallbackChain](../llms/cost-router) — static routing + ladders +- [Cost-Aware LLM Router](../guides/llms/cost-router) — full pipeline example +- [Caching & Retries](../llms/caching-retries) — cache backends and knobs +- [Performance tuning](./performance-tuning) — latency + throughput playbook +- [Cost intelligence](../observability/cost-tracker) — CostTracker + BudgetGuard API diff --git a/docs/how-to/index.md b/docs/how-to/index.md index 50daa54f4..dec2e2391 100644 --- a/docs/how-to/index.md +++ b/docs/how-to/index.md @@ -16,4 +16,5 @@ Task-focused guides for common SynapseKit patterns. | [Error handling](./error-handling) | Retries, fallbacks, budget errors | | [Testing](./testing) | Unit, integration, and eval testing | | [Production deployment](./production) | Docker, gunicorn, CI/CD | +| [Cost optimization](./cost-optimization) | Routing, compression, caching, budgets | | [Migrate from LangChain](./migrate-from-langchain) | Side-by-side code comparison | diff --git a/sidebars.ts b/sidebars.ts index cffdf9538..5340fac00 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -277,6 +277,7 @@ const sidebars: SidebarsConfig = { 'how-to/error-handling', 'how-to/testing', 'how-to/production', + 'how-to/cost-optimization', 'how-to/migrate-from-langchain', ], },