SDK Caching

switching to xai-sdk-python would directly solve (or massively improve) your caching problems.
Current State in Your Repo (Confirmed)
From the code:

You have almost no response-level caching.
The only cache is a tiny StructureLoader that memoizes parsed YAML files based on directory mtime (just avoids re-reading disk on every request).
Every single request to xAI goes through raw httpx (or equivalent) with the full conversation history + enriched tool schemas + system prompt.
No use of previous_response_id, no store, no conversation ID header, no client-side response caching.

This is why you're "not caching much" — you're leaving xAI's built-in caching mechanisms on the table and manually rebuilding everything every turn.
How the Official xAI SDK Fixes This
The official SDK (xai-sdk-python) is the recommended way to talk to the Responses API. Here's exactly how it helps with caching:
<html><head><style type="text/css"></style></head><body>
Caching Problem | Current (raw httpx) | With xai-sdk-python + Responses API | Impact
-- | -- | -- | --
Full history on every turn | You resend everything every time | Use previous_response_id + store=True → xAI serves from server-side cache | Massive token + latency savings
Repeated prefixes (system + tools) | No automatic reuse | xAI's automatic prompt caching kicks in (same starting messages = cache hit) | Faster TTFT + lower cost
Cache key generation | Fragile (manual dicts, translations) | Clean Pydantic models → trivial, reliable cache keys | Easy to add client-side cache
Stateful agent sessions | You manage everything | SDK + Responses API handles state natively | Much simpler bridge code
Connection / retry overhead | Manual httpx handling | gRPC with built-in retries, timeouts, multiplexing | Fewer wasted calls

</body></html>

1. Server-Side Conversation Caching (Biggest Win)
The Responses API supports:

store: true (default) → xAI stores the full response + context server-side for 30 days.
previous_response_id → On the next turn you only send the new input + this ID. xAI reconstructs the conversation from cache.

This is perfect for Claude-Code-style agent loops (many turns, long context). Your current bridge sends the entire enriched history every single time.
The SDK makes this trivial:

response = await client.responses.create(
    model="grok-4.20-reasoning-latest",
    input=...,           # only the new turn
    previous_response_id=last_xai_response_id,
    store=True,
    ...
)

You just keep a mapping anthropic_conv_id → xai_response_id in your bridge and you're done.
2. Automatic Prompt Caching
xAI automatically caches repeated prefixes (your system prompt + tool schemas are ideal candidates).
To maximize hits, set the x-grok-conv-id header (or let the SDK manage conversation IDs).
The SDK handles headers cleanly — no more manual httpx header dicts.
3. Much Easier Client-Side Caching
Because the SDK returns typed Pydantic models, you can safely do things like:

from functools import lru_cache
import hashlib

def make_cache_key(req):
    # stable hash of the important parts
    return hashlib.sha256(
        f"{req.model}:{req.instructions}:{req.input}:{req.tools}".encode()
    ).hexdigest()

@lru_cache(maxsize=128)
async def cached_response(...):
    return await client.responses.create(...)

This is extremely hard to do reliably with raw httpx + your translation layer.
4. Other Practical Benefits

gRPC under the hood → better connection reuse and lower latency than repeated httpx calls.
Built-in retries with backoff (fewer transient failures that would otherwise miss cache).
Cleaner separation: your enrichment/translation logic sits on top of a proper client instead of raw HTTP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDK Caching #78

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

SDK Caching #78

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions