Skip to content

fix(llm): strip Vertex Gemini thought signatures from archival history#3581

Open
juanmichelini wants to merge 2 commits into
mainfrom
fix/strip-gemini-thought-signatures-from-history
Open

fix(llm): strip Vertex Gemini thought signatures from archival history#3581
juanmichelini wants to merge 2 commits into
mainfrom
fix/strip-gemini-thought-signatures-from-history

Conversation

@juanmichelini

@juanmichelini juanmichelini commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Why

While investigating the high cost of running swebench on litellm_proxy/gemini-3.5-flash (mean $2.81/instance on a slice where all 10 instances resolved; one instance at $11.11), I pulled the conversation event logs and found the dominant cost driver is not iteration count, not the condenser (which never fired on this run), and not the agent flailing. It is Vertex Gemini's thoughtSignature blob being re-shipped in every subsequent prompt.

Mechanism

When Vertex Gemini is used with reasoning_effort, the provider returns a thoughtSignature field on each function-calling turn that encodes the model's internal reasoning state. The signature must be passed back on the immediately following tool-result turn so the model can resume. LiteLLM smuggles it through the OpenAI-shaped tool_call.id as:

call_f0be918123f4462bb482dd9df123__thought__AY89a18oWjPi7IVOiw5FIMB22r9...

The SDK currently stores those ids verbatim on ActionEvent and ObservationEvent, so when events_to_messages builds the history for the next LLM call, every prior signature is re-serialised into every prompt — once in the assistant message's tool_calls and once in the matching tool result's tool_call_id.

Empirical impact

Decompressed results.tar.gz from a real eval run and replayed the events from django__django-11999:

actions: 47, observations: 47
raw tool_call_id bytes:      1,210,168
stripped tool_call_id bytes:     3,492
saved:                       1,206,676 (99.7%)

signatures larger than 1 KB: 14 actions / 14 observations
biggest tcid: 278,100 bytes

The accumulated Metrics.usage_to_metrics["default"] for that instance reports 5,063,835 prompt tokens with only 26 % cache hit rate and 0 cache writes — i.e. 74 % of those tokens are billed at $1.50/M uncached. Re-shipping 1.2 MB of dead signatures across 47 turns is roughly half of that prompt bill.

The same pattern (smaller magnitudes) holds across the other 9 instances on that run — every Gemini turn that uses reasoning is affected.

What this PR does

Adds a post-processing pass at the bottom of LLMConvertibleEvent.events_to_messages:

  1. Walks the produced messages from the end, finds the most recent assistant message that has tool_calls, and records those ids as "kept".
  2. For every other assistant tool_call.id and every tool message tool_call_id not in the kept set, strips everything from the literal marker __thought__ onwards.
  3. The pair stays consistent: assistant and matching tool result are both stripped, or both kept.
  4. Stripping creates a new MessageToolCall via model_copy(update={"id": ...}) so the underlying ActionEvent.tool_call is untouched — the on-disk event log still has the full signature for forensic / replay use.

The marker check (__thought__ substring) is a no-op for Anthropic toolu_*, OpenAI call_* without signatures, ACP ids, and anything else that doesn't carry the marker.

Files

  • New: openhands-sdk/openhands/sdk/llm/utils/thought_signature.pyTHOUGHT_SIGNATURE_MARKER, has_thought_signature(id), strip_thought_signature(id).
  • Modified: openhands-sdk/openhands/sdk/event/base.py — adds _strip_archival_thought_signatures(messages) and calls it at the end of events_to_messages.
  • New: tests/sdk/llm/test_thought_signature.py — 13 unit tests for the classifier and stripper (Gemini ids, OpenAI ids, Anthropic ids, empty/None, the 278 KB pathological case, idempotence, multiple markers).
  • Modified: tests/sdk/event/test_events_to_messages.py — 5 new integration tests in TestThoughtSignatureStripping:
    • Older turns get stripped, latest turn keeps its signature.
    • Stripped pairs stay consistent (assistant tool_call.id == tool tool_call_id).
    • Source ActionEvent.tool_call.id is unchanged after conversion.
    • No-op for ids without the marker (Anthropic / OpenAI shape).
    • Parallel tool calls within the most-recent assistant turn all keep their signatures.

Test plan

uv run pytest tests/sdk/llm/test_thought_signature.py -v        # 13 passed
uv run pytest tests/sdk/event/test_events_to_messages.py -v     # 20 passed (15 existing + 5 new)
uv run pytest tests/sdk/event/ tests/sdk/llm/ -q                # 942 passed
uv run ruff format <files> && uv run ruff check <files>         # clean
uv run pyright <files>                                          # 0 errors

Also did the real-world replay above as a sanity check.

Scope / what this does not fix

  • It only strips __thought__<blob> suffixes; it does not change how Vertex prompt caching works. cache_write_tokens=0 and the 26 % implicit-cache hit rate are a separate problem and need a follow-up to wire actual CachedContent.create explicit caching for Vertex.
  • It does not change reasoning_effort defaults. Lowering it for gemini-3.5-flash is a separate model-config change.
  • For non-Gemini models the behaviour is byte-for-byte identical to today (the marker is never present).

This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini, following an investigation triggered by the cost analysis in OpenHands/benchmarks#741.

@juanmichelini can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:f5efa63-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-f5efa63-python \
  ghcr.io/openhands/agent-server:f5efa63-python

All tags pushed for this build

ghcr.io/openhands/agent-server:f5efa63-golang-amd64
ghcr.io/openhands/agent-server:f5efa635ab520c11155a3ee82629330be0f60452-golang-amd64
ghcr.io/openhands/agent-server:fix-strip-gemini-thought-signatures-from-history-golang-amd64
ghcr.io/openhands/agent-server:f5efa63-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:f5efa63-golang-arm64
ghcr.io/openhands/agent-server:f5efa635ab520c11155a3ee82629330be0f60452-golang-arm64
ghcr.io/openhands/agent-server:fix-strip-gemini-thought-signatures-from-history-golang-arm64
ghcr.io/openhands/agent-server:f5efa63-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:f5efa63-java-amd64
ghcr.io/openhands/agent-server:f5efa635ab520c11155a3ee82629330be0f60452-java-amd64
ghcr.io/openhands/agent-server:fix-strip-gemini-thought-signatures-from-history-java-amd64
ghcr.io/openhands/agent-server:f5efa63-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:f5efa63-java-arm64
ghcr.io/openhands/agent-server:f5efa635ab520c11155a3ee82629330be0f60452-java-arm64
ghcr.io/openhands/agent-server:fix-strip-gemini-thought-signatures-from-history-java-arm64
ghcr.io/openhands/agent-server:f5efa63-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:f5efa63-python-amd64
ghcr.io/openhands/agent-server:f5efa635ab520c11155a3ee82629330be0f60452-python-amd64
ghcr.io/openhands/agent-server:fix-strip-gemini-thought-signatures-from-history-python-amd64
ghcr.io/openhands/agent-server:f5efa63-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:f5efa63-python-arm64
ghcr.io/openhands/agent-server:f5efa635ab520c11155a3ee82629330be0f60452-python-arm64
ghcr.io/openhands/agent-server:fix-strip-gemini-thought-signatures-from-history-python-arm64
ghcr.io/openhands/agent-server:f5efa63-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:f5efa63-golang
ghcr.io/openhands/agent-server:f5efa635ab520c11155a3ee82629330be0f60452-golang
ghcr.io/openhands/agent-server:fix-strip-gemini-thought-signatures-from-history-golang
ghcr.io/openhands/agent-server:f5efa63-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:f5efa63-java
ghcr.io/openhands/agent-server:f5efa635ab520c11155a3ee82629330be0f60452-java
ghcr.io/openhands/agent-server:fix-strip-gemini-thought-signatures-from-history-java
ghcr.io/openhands/agent-server:f5efa63-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:f5efa63-python
ghcr.io/openhands/agent-server:f5efa635ab520c11155a3ee82629330be0f60452-python
ghcr.io/openhands/agent-server:fix-strip-gemini-thought-signatures-from-history-python
ghcr.io/openhands/agent-server:f5efa63-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., f5efa63-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., f5efa63-python-amd64) are also available if needed

LiteLLM smuggles Vertex Gemini's `thoughtSignature` blob through the
OpenAI-shaped `tool_call.id` field as `call_<hex>__thought__<base64>`. The
signature is only required on the immediately-following tool-result turn
so the model can resume reasoning; on every later turn it is dead weight
that gets re-shipped in every prompt.

On a real swe-bench-verified instance (`django__django-11999`, $11.11
with gemini-3.5-flash + reasoning_effort=high) the cumulative tool_call_id
payload was 1.21 MB; only 3.5 KB of that is the actual canonical id. The
remaining ~1.2 MB is the same signatures replayed across 47 turns.

This commit adds a post-processing pass at the bottom of
`events_to_messages` that:
* Identifies the tool_call ids on the most recent assistant turn that has
  tool calls.
* Strips the `__thought__<blob>` suffix from every other assistant
  `tool_call.id` and every matching `tool` message `tool_call_id`, so the
  paired ids stay consistent.
* Is a no-op for Anthropic, OpenAI, and ACP ids that do not contain the
  `__thought__` marker.

The pass mutates only the produced `Message` objects (via
`MessageToolCall.model_copy(update=...)` and a plain string reassignment
on `tool_call_id`); the underlying `ActionEvent` / `ObservationEvent`
data is untouched, so on-disk event logs preserve the signatures.

Tests added:
* `tests/sdk/llm/test_thought_signature.py` — unit tests for the
  classifier and stripping helpers.
* `tests/sdk/event/test_events_to_messages.py::TestThoughtSignatureStripping`
  — five integration tests covering older-turn stripping, paired
  consistency, source-event immutability, the non-Gemini no-op case, and
  parallel tool calls within the most-recent assistant turn.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/event
   base.py108991%52, 63, 75–76, 82, 85–86, 88, 121
TOTAL29671834671% 

@enyst

enyst commented Jun 9, 2026

Copy link
Copy Markdown
Member

LiteLLM smuggles it through the OpenAI-shaped tool_call.id

Ooh wow…! Sometimes I think we might be happier to implement Claude and Gemini native APIs, and maybe use liteLLM only for openai-compatible providers…
/me ducks

On a different note, I think they should have been cached anyway, and maybe @VascoSch92 ’s fix on that addresses a lot of the problem.

@juanmichelini

Copy link
Copy Markdown
Collaborator Author

@enyst interesting take!
@VascoSch92 #3586 fix reduces costs a lot, I'm testing this other fixes on top of that.

@juanmichelini juanmichelini marked this pull request as ready for review June 9, 2026 21:36

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ QA Report: PASS WITH ISSUES

The SDK behavior change works as intended: archival Gemini __thought__ signatures are stripped from older history while the latest tool-call turn, matching tool results, source events, parallel calls, and non-Gemini IDs remain correct.

Does this PR achieve its stated goal?

Yes. I exercised the SDK as a library user by constructing real ActionEvent/ObservationEvent histories and calling LLMConvertibleEvent.events_to_messages(). On main, all 6 synthetic Gemini tool-call IDs were re-emitted with __thought__ blobs (message_tool_id_bytes=120288); on this PR, only the latest assistant/tool pair kept signatures (message_marker_count=2, message_tool_id_bytes=40244), earlier assistant/tool pairs stayed consistent after stripping, and the original event log still retained the full first ID.

Phase Result
Environment Setup make build completed successfully and installed editable SDK packages via uv sync --dev.
CI Status ⚠️ Most checks are green, but Validate PR description is failing and qa-changes is still in progress at review time.
Functional Verification ✅ Before/after SDK execution confirms the claimed stripping behavior and no-op behavior for plain IDs.
Functional Verification

Test 1: Archival Gemini signatures are stripped only after the immediate tool-result turn

Step 1 — Reproduce baseline without the fix:
Checked out origin/main and ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_thought_signature.py, which constructs three real SDK action/observation pairs with large call_*__thought__... IDs and converts them through LLMConvertibleEvent.events_to_messages():

{
  "archival_stripping": {
    "assistant_markers_by_turn": [true, true, true],
    "message_marker_count": 6,
    "message_tool_id_bytes": 120288,
    "pairs_consistent": [true, true, true],
    "raw_tool_id_bytes": 120288,
    "source_event_first_kept_full": true,
    "tool_markers_by_turn": [true, true, true]
  },
  "parallel_latest_turn": {
    "latest_parallel_markers": [true, true],
    "latest_parallel_pairs_consistent": [true, true],
    "old_turn_stripped": false
  },
  "non_gemini_ids_unchanged": {"unchanged": true, "marker_count": 0}
}

This confirms the pre-fix problem: every previous assistant/tool message re-ships the thought signature, and message ID bytes equal the raw event-log ID bytes.

Step 2 — Apply the PR's changes:
Checked out f5efa635ab520c11155a3ee82629330be0f60452.

Step 3 — Re-run with the fix in place:
Ran the same command on the PR commit:

{
  "archival_stripping": {
    "assistant_markers_by_turn": [false, false, true],
    "message_marker_count": 2,
    "message_tool_id_bytes": 40244,
    "pairs_consistent": [true, true, true],
    "raw_tool_id_bytes": 120288,
    "source_event_first_kept_full": true,
    "tool_markers_by_turn": [false, false, true]
  },
  "parallel_latest_turn": {
    "latest_parallel_markers": [true, true],
    "latest_parallel_pairs_consistent": [true, true],
    "old_turn_stripped": true
  },
  "non_gemini_ids_unchanged": {"unchanged": true, "marker_count": 0}
}

This shows the fix works: older assistant and tool-result IDs are stripped together, the latest turn keeps signatures, source events are not mutated, and prompt-history ID bytes dropped from 120,288 to 40,244 in this reproduction.

Test 2: Related behavior remains intact

The same script also verified two side paths on the PR commit: a latest assistant message with two parallel tool calls kept both signatures and both tool results matched their assistant IDs, while plain call_plain_* IDs without __thought__ were byte-for-byte unchanged.

Issues Found

  • 🟡 Minor: CI is not fully green at review time because PR Description Check / Validate PR description is failing and qa-changes is still in progress. I did not inspect or edit the human-only PR description fields.

This review was created by an AI agent (OpenHands) on behalf of the user.

Final verdict: PASS WITH ISSUES

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants