fix(llm): strip Vertex Gemini thought signatures from archival history#3581
fix(llm): strip Vertex Gemini thought signatures from archival history#3581juanmichelini wants to merge 2 commits into
Conversation
LiteLLM smuggles Vertex Gemini's `thoughtSignature` blob through the OpenAI-shaped `tool_call.id` field as `call_<hex>__thought__<base64>`. The signature is only required on the immediately-following tool-result turn so the model can resume reasoning; on every later turn it is dead weight that gets re-shipped in every prompt. On a real swe-bench-verified instance (`django__django-11999`, $11.11 with gemini-3.5-flash + reasoning_effort=high) the cumulative tool_call_id payload was 1.21 MB; only 3.5 KB of that is the actual canonical id. The remaining ~1.2 MB is the same signatures replayed across 47 turns. This commit adds a post-processing pass at the bottom of `events_to_messages` that: * Identifies the tool_call ids on the most recent assistant turn that has tool calls. * Strips the `__thought__<blob>` suffix from every other assistant `tool_call.id` and every matching `tool` message `tool_call_id`, so the paired ids stay consistent. * Is a no-op for Anthropic, OpenAI, and ACP ids that do not contain the `__thought__` marker. The pass mutates only the produced `Message` objects (via `MessageToolCall.model_copy(update=...)` and a plain string reassignment on `tool_call_id`); the underlying `ActionEvent` / `ObservationEvent` data is untouched, so on-disk event logs preserve the signatures. Tests added: * `tests/sdk/llm/test_thought_signature.py` — unit tests for the classifier and stripping helpers. * `tests/sdk/event/test_events_to_messages.py::TestThoughtSignatureStripping` — five integration tests covering older-turn stripping, paired consistency, source-event immutability, the non-Gemini no-op case, and parallel tool calls within the most-recent assistant turn. Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Ooh wow…! Sometimes I think we might be happier to implement Claude and Gemini native APIs, and maybe use liteLLM only for openai-compatible providers… On a different note, I think they should have been cached anyway, and maybe @VascoSch92 ’s fix on that addresses a lot of the problem. |
|
@enyst interesting take! |
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
The SDK behavior change works as intended: archival Gemini __thought__ signatures are stripped from older history while the latest tool-call turn, matching tool results, source events, parallel calls, and non-Gemini IDs remain correct.
Does this PR achieve its stated goal?
Yes. I exercised the SDK as a library user by constructing real ActionEvent/ObservationEvent histories and calling LLMConvertibleEvent.events_to_messages(). On main, all 6 synthetic Gemini tool-call IDs were re-emitted with __thought__ blobs (message_tool_id_bytes=120288); on this PR, only the latest assistant/tool pair kept signatures (message_marker_count=2, message_tool_id_bytes=40244), earlier assistant/tool pairs stayed consistent after stripping, and the original event log still retained the full first ID.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully and installed editable SDK packages via uv sync --dev. |
| CI Status | Validate PR description is failing and qa-changes is still in progress at review time. |
| Functional Verification | ✅ Before/after SDK execution confirms the claimed stripping behavior and no-op behavior for plain IDs. |
Functional Verification
Test 1: Archival Gemini signatures are stripped only after the immediate tool-result turn
Step 1 — Reproduce baseline without the fix:
Checked out origin/main and ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_thought_signature.py, which constructs three real SDK action/observation pairs with large call_*__thought__... IDs and converts them through LLMConvertibleEvent.events_to_messages():
{
"archival_stripping": {
"assistant_markers_by_turn": [true, true, true],
"message_marker_count": 6,
"message_tool_id_bytes": 120288,
"pairs_consistent": [true, true, true],
"raw_tool_id_bytes": 120288,
"source_event_first_kept_full": true,
"tool_markers_by_turn": [true, true, true]
},
"parallel_latest_turn": {
"latest_parallel_markers": [true, true],
"latest_parallel_pairs_consistent": [true, true],
"old_turn_stripped": false
},
"non_gemini_ids_unchanged": {"unchanged": true, "marker_count": 0}
}This confirms the pre-fix problem: every previous assistant/tool message re-ships the thought signature, and message ID bytes equal the raw event-log ID bytes.
Step 2 — Apply the PR's changes:
Checked out f5efa635ab520c11155a3ee82629330be0f60452.
Step 3 — Re-run with the fix in place:
Ran the same command on the PR commit:
{
"archival_stripping": {
"assistant_markers_by_turn": [false, false, true],
"message_marker_count": 2,
"message_tool_id_bytes": 40244,
"pairs_consistent": [true, true, true],
"raw_tool_id_bytes": 120288,
"source_event_first_kept_full": true,
"tool_markers_by_turn": [false, false, true]
},
"parallel_latest_turn": {
"latest_parallel_markers": [true, true],
"latest_parallel_pairs_consistent": [true, true],
"old_turn_stripped": true
},
"non_gemini_ids_unchanged": {"unchanged": true, "marker_count": 0}
}This shows the fix works: older assistant and tool-result IDs are stripped together, the latest turn keeps signatures, source events are not mutated, and prompt-history ID bytes dropped from 120,288 to 40,244 in this reproduction.
Test 2: Related behavior remains intact
The same script also verified two side paths on the PR commit: a latest assistant message with two parallel tool calls kept both signatures and both tool results matched their assistant IDs, while plain call_plain_* IDs without __thought__ were byte-for-byte unchanged.
Issues Found
- 🟡 Minor: CI is not fully green at review time because
PR Description Check / Validate PR descriptionis failing andqa-changesis still in progress. I did not inspect or edit the human-only PR description fields.
This review was created by an AI agent (OpenHands) on behalf of the user.
Final verdict: PASS WITH ISSUES
Why
While investigating the high cost of running
swebenchonlitellm_proxy/gemini-3.5-flash(mean $2.81/instance on a slice where all 10 instances resolved; one instance at $11.11), I pulled the conversation event logs and found the dominant cost driver is not iteration count, not the condenser (which never fired on this run), and not the agent flailing. It is Vertex Gemini'sthoughtSignatureblob being re-shipped in every subsequent prompt.Mechanism
When Vertex Gemini is used with
reasoning_effort, the provider returns athoughtSignaturefield on each function-calling turn that encodes the model's internal reasoning state. The signature must be passed back on the immediately following tool-result turn so the model can resume. LiteLLM smuggles it through the OpenAI-shapedtool_call.idas:The SDK currently stores those ids verbatim on
ActionEventandObservationEvent, so whenevents_to_messagesbuilds the history for the next LLM call, every prior signature is re-serialised into every prompt — once in the assistant message'stool_callsand once in the matchingtoolresult'stool_call_id.Empirical impact
Decompressed
results.tar.gzfrom a real eval run and replayed the events fromdjango__django-11999:The accumulated
Metrics.usage_to_metrics["default"]for that instance reports 5,063,835 prompt tokens with only 26 % cache hit rate and 0 cache writes — i.e. 74 % of those tokens are billed at $1.50/M uncached. Re-shipping 1.2 MB of dead signatures across 47 turns is roughly half of that prompt bill.The same pattern (smaller magnitudes) holds across the other 9 instances on that run — every Gemini turn that uses reasoning is affected.
What this PR does
Adds a post-processing pass at the bottom of
LLMConvertibleEvent.events_to_messages:tool_calls, and records those ids as "kept".tool_call.idand everytoolmessagetool_call_idnot in the kept set, strips everything from the literal marker__thought__onwards.MessageToolCallviamodel_copy(update={"id": ...})so the underlyingActionEvent.tool_callis untouched — the on-disk event log still has the full signature for forensic / replay use.The marker check (
__thought__substring) is a no-op for Anthropictoolu_*, OpenAIcall_*without signatures, ACP ids, and anything else that doesn't carry the marker.Files
openhands-sdk/openhands/sdk/llm/utils/thought_signature.py—THOUGHT_SIGNATURE_MARKER,has_thought_signature(id),strip_thought_signature(id).openhands-sdk/openhands/sdk/event/base.py— adds_strip_archival_thought_signatures(messages)and calls it at the end ofevents_to_messages.tests/sdk/llm/test_thought_signature.py— 13 unit tests for the classifier and stripper (Gemini ids, OpenAI ids, Anthropic ids, empty/None, the 278 KB pathological case, idempotence, multiple markers).tests/sdk/event/test_events_to_messages.py— 5 new integration tests inTestThoughtSignatureStripping:tool_call.id== tooltool_call_id).ActionEvent.tool_call.idis unchanged after conversion.Test plan
Also did the real-world replay above as a sanity check.
Scope / what this does not fix
__thought__<blob>suffixes; it does not change how Vertex prompt caching works.cache_write_tokens=0and the 26 % implicit-cache hit rate are a separate problem and need a follow-up to wire actualCachedContent.createexplicit caching for Vertex.reasoning_effortdefaults. Lowering it forgemini-3.5-flashis a separate model-config change.This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini, following an investigation triggered by the cost analysis in OpenHands/benchmarks#741.
@juanmichelini can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:f5efa63-pythonRun
All tags pushed for this build
About Multi-Architecture Support
f5efa63-python) is a multi-arch manifest supporting both amd64 and arm64f5efa63-python-amd64) are also available if needed