diff --git a/docs/how-to/index.md b/docs/how-to/index.md index 50daa54f4..6e04b712e 100644 --- a/docs/how-to/index.md +++ b/docs/how-to/index.md @@ -16,4 +16,5 @@ Task-focused guides for common SynapseKit patterns. | [Error handling](./error-handling) | Retries, fallbacks, budget errors | | [Testing](./testing) | Unit, integration, and eval testing | | [Production deployment](./production) | Docker, gunicorn, CI/CD | +| [Performance tuning](./performance-tuning) | Concurrency, caching, batching, profiling | | [Migrate from LangChain](./migrate-from-langchain) | Side-by-side code comparison | diff --git a/docs/how-to/performance-tuning.md b/docs/how-to/performance-tuning.md new file mode 100644 index 000000000..136902b8b --- /dev/null +++ b/docs/how-to/performance-tuning.md @@ -0,0 +1,474 @@ +--- +sidebar_position: 10 +--- + +# Performance Tuning + +This is the long-form performance playbook for SynapseKit. It focuses on the four levers that actually move the needle: + +1. **Concurrency knobs** — reduce wall-clock latency by running independent work in parallel. +2. **Caching strategies** — avoid redundant LLM calls and re-embedding. +3. **Batch APIs** — amortize overhead across many inputs. +4. **Profiling cookbook** — measure, don’t guess. + +If you want only one takeaway: **framework overhead is tiny; network/LLM time dominates.** The biggest gains come from *fewer calls*, *parallel calls*, and *cache hits*. + +--- + +## Prerequisites + +```bash +pip install synapsekit[openai] +# Optional (for caching + tracing examples below) +pip install synapsekit[cache,redis,otel] +``` + +--- + +## Performance model (where time goes) + +A typical SynapseKit request spends time in four buckets: + +``` +Total latency = retrieval + generation + tool calls + framework overhead +``` + +SynapseKit’s own overhead is usually sub-millisecond. The [Benchmarks](../benchmarks) page shows representative numbers: + +| Area | Metric | Baseline (overhead only) | +|---|---|---| +| LLM | `generate()` (OpenAI) | +0.3 ms | +| LLM | `stream()` first token | +0.4 ms | +| RAG | `RAGPipeline.add()` (100 chunks, InMemory) | ~12 ms | +| RAG | `RAGPipeline.aquery()` retrieval (InMemory, k=5) | +1.2 ms | +| Graph | 5-node parallel graph overhead | +1.1 ms | +| Memory | In-memory read/write | <0.1 ms | +| Memory | Redis read/write | ~1.0–1.5 ms | + +**Implication:** Don’t obsess over micro-optimizing the framework. **Optimize LLM calls, retrieval, and parallelization.** + +--- + +# 1) Concurrency knobs + +## 1.1 Use async end-to-end + +If you’re calling multiple independent LLMs or pipelines, run them concurrently. Use `asyncio.gather()` and cap concurrency with a semaphore. + +```python +import asyncio +from synapsekit import LLMConfig +from synapsekit.llm.openai import OpenAILLM + +llm = OpenAILLM(LLMConfig(model="gpt-4o-mini", api_key="sk-...")) + +sem = asyncio.Semaphore(8) # cap concurrency to avoid rate limits + +async def guarded_generate(prompt: str) -> str: + async with sem: + return await llm.generate(prompt) + +async def main(): + prompts = [f"Question {i}" for i in range(20)] + results = await asyncio.gather(*[guarded_generate(p) for p in prompts]) + print(results[:2]) + +asyncio.run(main()) +``` + +**Benchmark note:** For parallelizable work, wall time approaches the slowest task, not the sum. Use this to reduce P99 latency for fan-out workflows. + +--- + +## 1.2 Parallel graph nodes (fan-out / fan-in) + +Use graph parallelism when multiple steps don’t depend on each other. SynapseKit supports this directly via `add_parallel_edges()` and `add_join_edge()`. + +```python +from dataclasses import dataclass +from synapsekit.graph import StateGraph +from synapsekit.llm.openai import OpenAILLM +from synapsekit import LLMConfig + +llm = OpenAILLM(LLMConfig(model="gpt-4o-mini", api_key="sk-...")) + +@dataclass +class State: + text: str + summary: str = "" + sentiment: str = "" + keywords: str = "" + +async def summarize(state: State) -> State: + state.summary = await llm.generate(f"Summarize: {state.text}") + return state + +async def sentiment(state: State) -> State: + state.sentiment = await llm.generate(f"Sentiment: {state.text}") + return state + +async def keywords(state: State) -> State: + state.keywords = await llm.generate(f"Keywords: {state.text}") + return state + +async def merge(state: State) -> State: + # Merge happens after all parallel branches finish + return state + +graph = StateGraph(State) +graph.add_node("summarize", summarize) +graph.add_node("sentiment", sentiment) +graph.add_node("keywords", keywords) +graph.add_node("merge", merge) +graph.set_entry_point("summarize") +graph.add_parallel_edges("summarize", ["sentiment", "keywords"]) +graph.add_join_edge(["summarize", "sentiment", "keywords"], "merge") +compiled = graph.compile() +``` + +**Benchmark note:** The [Benchmarks](../benchmarks) page shows a **5-node parallel graph adds ~1.1 ms overhead** — essentially free compared to LLM latency. + +--- + +## 1.3 Parallel agent execution + +If you have multiple specialist agents (analysis, retrieval, summarization), fan them out in parallel. See [Parallel Agent Execution](../guides/multi-agent/parallel-agent-execution) for a complete pattern. + +```python +# Quick sketch +results = await asyncio.gather( + agent_a.run("task A"), + agent_b.run("task B"), + agent_c.run("task C"), +) +``` + +--- + +## 1.4 Backpressure: rate limiting and retries + +Concurrency without rate limits can hurt performance due to provider throttling. Use `requests_per_minute` and retries in `LLMConfig`. + +```python +llm = OpenAILLM(LLMConfig( + model="gpt-4o-mini", + api_key="sk-...", + max_retries=3, + retry_delay=0.5, + requests_per_minute=60, +)) +``` + +This prevents 429 storms and smooths throughput during spikes. + +--- + +## 1.5 Streaming = better perceived latency + +Even if total latency is unchanged, streaming reduces **time-to-first-token**. This often matters more to users than absolute completion time. + +```python +async for token in llm.stream("Explain RAG in one sentence"): + print(token, end="", flush=True) +``` + +--- + +# 2) Caching strategies + +## 2.1 Exact-match response caching (LLMConfig) + +Enable built-in LRU caching for identical prompts. It’s the fastest win for repeated questions or deterministic pipelines. + +```python +llm = OpenAILLM(LLMConfig( + model="gpt-4o-mini", + api_key="sk-...", + cache=True, + cache_maxsize=256, +)) + +# First call hits API +await llm.generate("What is SynapseKit?") + +# Second call hits cache +await llm.generate("What is SynapseKit?") +``` + +**Benchmark note:** In-memory reads/writes are **<0.1 ms** in the benchmarks, so cache hits are effectively instant compared to network calls. + +--- + +## 2.2 Persistent caches (SQLite / filesystem / Redis) + +Use a persistent backend when you want cache hits across restarts or multiple workers. + +```python +llm = OpenAILLM(LLMConfig( + model="gpt-4o-mini", + api_key="sk-...", + cache=True, + cache_backend="sqlite", + cache_db_path="llm_cache.db", +)) +``` + +| Backend | Best for | Latency (from benchmarks) | +|---|---|---| +| `memory` | single-process dev | <0.1 ms | +| `sqlite` | local persistence | ~1.2 ms read/write | +| `filesystem` | simplest disk cache | ~disk IO dependent | +| `redis` | shared cache across workers | ~1.0–1.5 ms | + +--- + +## 2.3 Semantic caching (paraphrase hits) + +Exact-match caching misses paraphrases. Semantic caching returns a cached response when similarity exceeds a threshold. + +```python +from synapsekit import LLMConfig +from synapsekit.llm.openai import OpenAILLM +from synapsekit.cache import SQLiteCache + +semantic_cache = SQLiteCache( + path="./llm_cache.db", + similarity_threshold=0.92, + max_entries=10_000, + ttl_seconds=86_400, +) + +llm = OpenAILLM(LLMConfig( + model="gpt-4o-mini", + api_key="sk-...", + cache_backend=semantic_cache, +)) +``` + +**When to use:** FAQ-style apps, support bots, internal knowledge bases. **When not to use:** real-time data or questions where stale answers are unacceptable. + +--- + +## 2.4 RAG persistence (skip re-embedding) + +Embedding is often the slowest *offline* step. Save your vector store once and reload on startup. + +```python +from synapsekit import RAG, RAGConfig, InMemoryVectorStore, SynapsekitEmbeddings, LLMConfig +from synapsekit.llm.openai import OpenAILLM + +llm = OpenAILLM(LLMConfig(model="gpt-4o-mini", api_key="sk-...")) +store = InMemoryVectorStore(SynapsekitEmbeddings()) +rag = RAG(RAGConfig(llm=llm, vector_store=store, k=5)) + +# Ingest once +await rag.aadd(["Doc 1", "Doc 2"]) +rag.save("./index.npz") + +# Next boot +rag.load("./index.npz") +``` + +--- + +## 2.5 Measure cache effectiveness + +Track hit rates to confirm caching helps. + +```python +await llm.generate("What is SynapseKit?") +await llm.generate("What is SynapseKit?") +print(llm.cache_stats) # {"hits": 1, "misses": 1, "size": 1} +``` + +If hits are low, caching may be the wrong lever — improve prompt reuse or add semantic caching. + +--- + +# 3) Batch APIs + +## 3.1 Batch document ingestion + +`RAG.add()` and `RAG.aadd()` accept a list of documents. This avoids per-document overhead and lets embeddings run in larger batches. + +```python +await rag.aadd( + [ + "Doc 1 text", + "Doc 2 text", + "Doc 3 text", + ], + metadata=[ + {"source": "doc1"}, + {"source": "doc2"}, + {"source": "doc3"}, + ], +) +``` + +**Benchmark note:** InMemory RAG adds **100 chunks in ~12 ms** (see [Benchmarks](../benchmarks)). + +--- + +## 3.2 Batch embeddings + +The embeddings backend already supports lists; take advantage of it when you control ingestion pipelines. + +```python +from synapsekit import SynapsekitEmbeddings + +embeddings = SynapsekitEmbeddings() +vecs = await embeddings.embed([ + "first text", + "second text", + "third text", +]) +print(vecs.shape) # (3, D) +``` + +Batching reduces Python overhead and maximizes vectorized compute. + +--- + +## 3.3 Batch evaluation + +Evaluation supports concurrent batches so offline quality checks don’t take hours. + +```python +from synapsekit.evaluation import EvaluationPipeline, FaithfulnessMetric, RelevancyMetric +from synapsekit.llm.openai import OpenAILLM +from synapsekit import LLMConfig + +judge_llm = OpenAILLM(LLMConfig(model="gpt-4o-mini", api_key="sk-...")) +pipeline = EvaluationPipeline( + metrics=[ + FaithfulnessMetric(llm=judge_llm), + RelevancyMetric(llm=judge_llm), + ] +) +results = await pipeline.evaluate_batch(samples, concurrency=8) +``` + +--- + +## 3.4 Manual batch with `asyncio.gather()` + +If you’re processing multiple queries and there isn’t a dedicated batch API, treat a batch as a **concurrency window**. + +```python +async def batch_generate(prompts: list[str], batch_size: int = 8): + out = [] + for i in range(0, len(prompts), batch_size): + chunk = prompts[i : i + batch_size] + out.extend(await asyncio.gather(*[llm.generate(p) for p in chunk])) + return out +``` + +This approach keeps memory and rate limits under control while still giving you throughput gains. + +--- + +# 4) Profiling cookbook + +## 4.1 Start with coarse timers + +Measure retrieval vs generation first; don’t guess. + +```python +import time + +q = "What is SynapseKit?" + +# Retrieval-only +start = time.perf_counter() +contexts = await rag.get_relevant_documents(q, k=5) +retrieval_ms = (time.perf_counter() - start) * 1000 + +# Full query +start = time.perf_counter() +answer = await rag.aquery(q) +full_ms = (time.perf_counter() - start) * 1000 + +print(f"retrieval_ms={retrieval_ms:.1f} full_ms={full_ms:.1f}") +``` + +--- + +## 4.2 Trace every LLM call + +Use `TracingMiddleware` to get spans with timings and metadata. + +```python +from synapsekit import TracingMiddleware, TracingUI + +middleware = TracingMiddleware() +traced_llm = middleware.wrap(llm) + +await traced_llm.generate("Hello!") + +# Save a local HTML trace viewer +TracingUI(middleware.spans).save("traces.html") +``` + +--- + +## 4.3 Token and cost tracking + +Token counts correlate strongly with latency and cost. + +```python +from synapsekit.observability import TokenTracer + +tracer = TokenTracer(llm) +await tracer.generate("Explain vector databases") +print(tracer.tokens_used) +``` + +Use [CostTracker](../observability/cost-tracker) when you need dollar attribution. + +--- + +## 4.4 Benchmark harness template + +Run controlled microbenchmarks with warm-up and consistent payloads. + +```python +import asyncio, time + +async def bench(fn, n=20, warmup=3): + for _ in range(warmup): + await fn() + start = time.perf_counter() + for _ in range(n): + await fn() + return (time.perf_counter() - start) * 1000 / n + +async def llm_call(): + await llm.generate("Summarize RAG in one sentence") + +ms = await bench(llm_call) +print(f"avg_ms={ms:.1f}") +``` + +Use this harness to compare **before/after** when you change caching, parallelism, or prompt size. + +--- + +## 4.5 Profiling checklist + +- [ ] Measure **P50 / P95 / P99** latencies (not just averages) +- [ ] Split **retrieval vs generation** time +- [ ] Track **tokens in/out** for every call +- [ ] Enable **cache stats** and verify hit rate +- [ ] Cap concurrency to avoid 429s +- [ ] Use streaming when perceived latency matters + +--- + +# Next steps + +- [Benchmarks](../benchmarks) — official baseline numbers +- [Streaming](./streaming) — reduce time-to-first-token +- [Parallel Agent Execution](../guides/multi-agent/parallel-agent-execution) — fan-out patterns +- [Caching & Retries](../llms/caching-retries) — caching knobs and rate limits +- [Observability](../observability/overview) — tracing and cost tracking diff --git a/sidebars.ts b/sidebars.ts index cffdf9538..9519afbc1 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -277,6 +277,7 @@ const sidebars: SidebarsConfig = { 'how-to/error-handling', 'how-to/testing', 'how-to/production', + 'how-to/performance-tuning', 'how-to/migrate-from-langchain', ], },