switching to xai-sdk-python would directly solve (or massively improve) your caching problems.
Current State in Your Repo (Confirmed)
From the code:
You have almost no response-level caching.
The only cache is a tiny StructureLoader that memoizes parsed YAML files based on directory mtime (just avoids re-reading disk on every request).
Every single request to xAI goes through raw httpx (or equivalent) with the full conversation history + enriched tool schemas + system prompt.
No use of previous_response_id, no store, no conversation ID header, no client-side response caching.
This is why you're "not caching much" — you're leaving xAI's built-in caching mechanisms on the table and manually rebuilding everything every turn.
How the Official xAI SDK Fixes This
The official SDK (xai-sdk-python) is the recommended way to talk to the Responses API. Here's exactly how it helps with caching:
<style type="text/css"></style>
Caching Problem | Current (raw httpx) | With xai-sdk-python + Responses API | Impact
-- | -- | -- | --
Full history on every turn | You resend everything every time | Use previous_response_id + store=True → xAI serves from server-side cache | Massive token + latency savings
Repeated prefixes (system + tools) | No automatic reuse | xAI's automatic prompt caching kicks in (same starting messages = cache hit) | Faster TTFT + lower cost
Cache key generation | Fragile (manual dicts, translations) | Clean Pydantic models → trivial, reliable cache keys | Easy to add client-side cache
Stateful agent sessions | You manage everything | SDK + Responses API handles state natively | Much simpler bridge code
Connection / retry overhead | Manual httpx handling | gRPC with built-in retries, timeouts, multiplexing | Fewer wasted calls
- Server-Side Conversation Caching (Biggest Win)
The Responses API supports:
store: true (default) → xAI stores the full response + context server-side for 30 days.
previous_response_id → On the next turn you only send the new input + this ID. xAI reconstructs the conversation from cache.
This is perfect for Claude-Code-style agent loops (many turns, long context). Your current bridge sends the entire enriched history every single time.
The SDK makes this trivial:
response = await client.responses.create(
model="grok-4.20-reasoning-latest",
input=..., # only the new turn
previous_response_id=last_xai_response_id,
store=True,
...
)
You just keep a mapping anthropic_conv_id → xai_response_id in your bridge and you're done.
2. Automatic Prompt Caching
xAI automatically caches repeated prefixes (your system prompt + tool schemas are ideal candidates).
To maximize hits, set the x-grok-conv-id header (or let the SDK manage conversation IDs).
The SDK handles headers cleanly — no more manual httpx header dicts.
3. Much Easier Client-Side Caching
Because the SDK returns typed Pydantic models, you can safely do things like:
from functools import lru_cache
import hashlib
def make_cache_key(req):
# stable hash of the important parts
return hashlib.sha256(
f"{req.model}:{req.instructions}:{req.input}:{req.tools}".encode()
).hexdigest()
@lru_cache(maxsize=128)
async def cached_response(...):
return await client.responses.create(...)
This is extremely hard to do reliably with raw httpx + your translation layer.
4. Other Practical Benefits
gRPC under the hood → better connection reuse and lower latency than repeated httpx calls.
Built-in retries with backoff (fewer transient failures that would otherwise miss cache).
Cleaner separation: your enrichment/translation logic sits on top of a proper client instead of raw HTTP.
switching to xai-sdk-python would directly solve (or massively improve) your caching problems.
Current State in Your Repo (Confirmed)
From the code:
You have almost no response-level caching.
The only cache is a tiny StructureLoader that memoizes parsed YAML files based on directory mtime (just avoids re-reading disk on every request).
Every single request to xAI goes through raw httpx (or equivalent) with the full conversation history + enriched tool schemas + system prompt.
No use of previous_response_id, no store, no conversation ID header, no client-side response caching.
This is why you're "not caching much" — you're leaving xAI's built-in caching mechanisms on the table and manually rebuilding everything every turn.
<style type="text/css"></style> Caching Problem | Current (raw httpx) | With xai-sdk-python + Responses API | Impact -- | -- | -- | -- Full history on every turn | You resend everything every time | Use previous_response_id + store=True → xAI serves from server-side cache | Massive token + latency savings Repeated prefixes (system + tools) | No automatic reuse | xAI's automatic prompt caching kicks in (same starting messages = cache hit) | Faster TTFT + lower cost Cache key generation | Fragile (manual dicts, translations) | Clean Pydantic models → trivial, reliable cache keys | Easy to add client-side cache Stateful agent sessions | You manage everything | SDK + Responses API handles state natively | Much simpler bridge code Connection / retry overhead | Manual httpx handling | gRPC with built-in retries, timeouts, multiplexing | Fewer wasted callsHow the Official xAI SDK Fixes This
The official SDK (xai-sdk-python) is the recommended way to talk to the Responses API. Here's exactly how it helps with caching:
The Responses API supports:
store: true (default) → xAI stores the full response + context server-side for 30 days.
previous_response_id → On the next turn you only send the new input + this ID. xAI reconstructs the conversation from cache.
This is perfect for Claude-Code-style agent loops (many turns, long context). Your current bridge sends the entire enriched history every single time.
The SDK makes this trivial:
response = await client.responses.create(
model="grok-4.20-reasoning-latest",
input=..., # only the new turn
previous_response_id=last_xai_response_id,
store=True,
...
)
You just keep a mapping anthropic_conv_id → xai_response_id in your bridge and you're done.
2. Automatic Prompt Caching
xAI automatically caches repeated prefixes (your system prompt + tool schemas are ideal candidates).
To maximize hits, set the x-grok-conv-id header (or let the SDK manage conversation IDs).
The SDK handles headers cleanly — no more manual httpx header dicts.
3. Much Easier Client-Side Caching
Because the SDK returns typed Pydantic models, you can safely do things like:
from functools import lru_cache
import hashlib
def make_cache_key(req):
# stable hash of the important parts
return hashlib.sha256(
f"{req.model}:{req.instructions}:{req.input}:{req.tools}".encode()
).hexdigest()
@lru_cache(maxsize=128)
async def cached_response(...):
return await client.responses.create(...)
This is extremely hard to do reliably with raw httpx + your translation layer.
4. Other Practical Benefits
gRPC under the hood → better connection reuse and lower latency than repeated httpx calls.
Built-in retries with backoff (fewer transient failures that would otherwise miss cache).
Cleaner separation: your enrichment/translation logic sits on top of a proper client instead of raw HTTP.