Skip to content

Resumed sessions can crash with orphaned tool_result when no recent user message exists #799

@kylejryan

Description

@kylejryan

Summary

When a session is resumed (or trimmed mid-session) after a long autonomous run with no operator input in the recent tail, getResumeMessages can return a slice whose first message is a tool (tool_result). Anthropic rejects every subsequent submit on that session with:

messages.0.content.0: unexpected `tool_use_id` found in `tool_result` blocks: <id>.
Each `tool_result` block must have a corresponding `tool_use` block in the previous message.

The error is sticky: each new submit fails identically, and the rollback path re-persists the broken slice to messages.json, so the session is effectively bricked until the file is manually repaired.

Reproduction (deterministic)

  1. Start a session in operator/auto mode and let the agent run autonomously past ~200 model messages without typing any operator input.
  2. Close and reopen the session (or trigger any code path that re-runs getResumeMessages on the persisted history — e.g. the post-step sync at src/tui/components/operator-dashboard/index.tsx:1314).
  3. Type any prompt.
  4. The request to the model fails with the orphaned tool_result error above. Every retry fails identically with the same tool_use_id.

Root cause

src/core/session/index.tsgetResumeMessages:

if (messages.length <= limit) return messages;

let cutIndex = messages.length - limit;
while (cutIndex < messages.length) {
  if (messages[cutIndex].role === "user") break;
  cutIndex++;
}
if (cutIndex >= messages.length) {
  cutIndex = messages.length - limit; // raw fallback — can land on a `tool` message
}
return messages.slice(cutIndex);
  • The walk only searches forward for a user boundary.
  • When the recent limit messages contain no user role (common in long autonomous runs), the fallback does a raw cut at messages.length - limit.
  • That index can land on a tool message, putting an orphaned tool-result at result[0]. The matching assistant tool-call has been trimmed off the front.
  • The AI SDK converts a leading tool role to an Anthropic user message containing a tool_result block with no preceding tool_use — exactly the error condition.

normalizeMessages does not repair this; it only merges consecutive user messages and upgrades raw-string output fields to { type: "text", value: ... }. It does not enforce the tool_use/tool_result pairing invariant.

Why the broken state is sticky

Two paths re-persist the broken slice on disk:

  1. User submitsrc/tui/components/operator-dashboard/index.tsx:817-826 writes [...conversationRef.current, { role: "user", content: prompt }] to messages.json. If conversationRef.current[0] is a tool, the orphan is at the head of the persisted file.
  2. Error rollbacksrc/tui/components/operator-dashboard/index.tsx:1342-1354. When the API rejects the request, the catch block rolls conversationRef.current back to prevMessages and writes that to disk — i.e. the orphan-headed state without the new user message. Subsequent submits append a user at the end, but the orphan at the head persists.

Existing test gap

src/core/session/persistence.test.ts ("handles conversations with no user messages after cut point") only asserts result.length === 5. It never asserts that result[0] is a safe role, so the regression slipped past existing coverage.

Suggested fix shape

In getResumeMessages, after picking cutIndex, advance past any leading tool messages and any leading assistant message that begins with a tool-call whose paired tool-result was trimmed. The chosen slice must start with either a user message or an assistant whose content has no orphan tool-call parts.

Hardening (defense in depth): have normalizeMessages strip leading orphaned tool messages and leading tool-call-only assistant messages, so any caller that constructs a conversation prefix (not just resume) gets the same invariant for free.

Also worth adding: an explicit test that asserts the slice is API-valid (no orphaned tool_use/tool_result at head) for the all-tool-and-assistant tail case.

Workaround for affected sessions

The session is recoverable: edit ~/.pensar/sessions/<sessionId>/messages.json so the array starts at the first assistant message with no tool-call content parts (or the first user/safe-assistant further in). Deleting only messages[0] is not enough — the next message is typically also an orphan.

Impact

  • Affects any long-running session resumed in auto/autopilot mode without recent operator input.
  • Once triggered, the session cannot be used until messages.json is repaired by hand.
  • Silent for the operator: the failure mode looks like an unrelated 400 from the model provider.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions