Skip to content

SDK Caching #78

@vantasnerdan

Description

@vantasnerdan

switching to xai-sdk-python would directly solve (or massively improve) your caching problems.
Current State in Your Repo (Confirmed)
From the code:

You have almost no response-level caching.
The only cache is a tiny StructureLoader that memoizes parsed YAML files based on directory mtime (just avoids re-reading disk on every request).
Every single request to xAI goes through raw httpx (or equivalent) with the full conversation history + enriched tool schemas + system prompt.
No use of previous_response_id, no store, no conversation ID header, no client-side response caching.

This is why you're "not caching much" — you're leaving xAI's built-in caching mechanisms on the table and manually rebuilding everything every turn.
How the Official xAI SDK Fixes This
The official SDK (xai-sdk-python) is the recommended way to talk to the Responses API. Here's exactly how it helps with caching:

<style type="text/css"></style> Caching Problem | Current (raw httpx) | With xai-sdk-python + Responses API | Impact -- | -- | -- | -- Full history on every turn | You resend everything every time | Use previous_response_id + store=True → xAI serves from server-side cache | Massive token + latency savings Repeated prefixes (system + tools) | No automatic reuse | xAI's automatic prompt caching kicks in (same starting messages = cache hit) | Faster TTFT + lower cost Cache key generation | Fragile (manual dicts, translations) | Clean Pydantic models → trivial, reliable cache keys | Easy to add client-side cache Stateful agent sessions | You manage everything | SDK + Responses API handles state natively | Much simpler bridge code Connection / retry overhead | Manual httpx handling | gRPC with built-in retries, timeouts, multiplexing | Fewer wasted calls
  1. Server-Side Conversation Caching (Biggest Win)
    The Responses API supports:

store: true (default) → xAI stores the full response + context server-side for 30 days.
previous_response_id → On the next turn you only send the new input + this ID. xAI reconstructs the conversation from cache.

This is perfect for Claude-Code-style agent loops (many turns, long context). Your current bridge sends the entire enriched history every single time.
The SDK makes this trivial:

response = await client.responses.create(
model="grok-4.20-reasoning-latest",
input=..., # only the new turn
previous_response_id=last_xai_response_id,
store=True,
...
)

You just keep a mapping anthropic_conv_id → xai_response_id in your bridge and you're done.
2. Automatic Prompt Caching
xAI automatically caches repeated prefixes (your system prompt + tool schemas are ideal candidates).
To maximize hits, set the x-grok-conv-id header (or let the SDK manage conversation IDs).
The SDK handles headers cleanly — no more manual httpx header dicts.
3. Much Easier Client-Side Caching
Because the SDK returns typed Pydantic models, you can safely do things like:

from functools import lru_cache
import hashlib

def make_cache_key(req):
# stable hash of the important parts
return hashlib.sha256(
f"{req.model}:{req.instructions}:{req.input}:{req.tools}".encode()
).hexdigest()

@lru_cache(maxsize=128)
async def cached_response(...):
return await client.responses.create(...)

This is extremely hard to do reliably with raw httpx + your translation layer.
4. Other Practical Benefits

gRPC under the hood → better connection reuse and lower latency than repeated httpx calls.
Built-in retries with backoff (fewer transient failures that would otherwise miss cache).
Cleaner separation: your enrichment/translation logic sits on top of a proper client instead of raw HTTP.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions