Skip to content

Report exact backend token usage, including prompt-cache reads#2

Open
TheStreamCode wants to merge 1 commit into
chutesai:masterfrom
TheStreamCode:fix/streaming-usage-accuracy
Open

Report exact backend token usage, including prompt-cache reads#2
TheStreamCode wants to merge 1 commit into
chutesai:masterfrom
TheStreamCode:fix/streaming-usage-accuracy

Conversation

@TheStreamCode

Copy link
Copy Markdown

Summary

In streaming mode the proxy reported estimated token counts instead of the backend's real ones:

  • output_tokens fell back to a character-based estimate (text.len() / CHARS_PER_TOKEN).
  • input_tokens was omitted from the final message_delta.
  • Prompt-cache reads were never surfaced, so input_tokens overstated the billable input on cache hits.

Claude Code uses these counts for its context-window accounting, auto-compaction threshold, and cost reporting.

Root cause

Two issues combined:

  1. The proxy never sent stream_options: {"include_usage": true}, so OpenAI-compatible backends (vLLM/SGLang) did not emit a usage chunk during streaming.
  2. When a usage chunk did arrive, it is a usage-only final chunk of the form {"choices": [], "usage": {...}}. The if chunk.choices.is_empty() guard ran before the usage read, so the chunk was discarded and the code fell back to the estimate. The existing tests did not catch this because their mock emits usage on a chunk with a non-empty choices array.

Changes

  • Add stream_options to the outgoing request and set include_usage: true when the client requests streaming.
  • Read usage before the empty-choices guard, capturing prompt_tokens, completion_tokens, and prompt_tokens_details.cached_tokens.
  • Map to Anthropic usage semantics in the final message_delta and in the non-streaming response:
    • output_tokens from completion_tokens
    • input_tokens from prompt_tokens − cached_tokens
    • cache_read_input_tokens from prompt_tokens_details.cached_tokens (emitted when greater than 0)
  • The character estimate remains as a fallback when the backend reports no usage, so behavior is unchanged for backends that do not support stream_options.

Testing

  • cargo test --all-targets passes, including two added unit tests (stream_options serialization; prompt_tokens_details deserialization). cargo fmt --check and cargo clippy --all-targets -- -D warnings are clean for the changed files.
  • Mock backend: cold call output_tokens goes from a char estimate to the real completion_tokens; on a cache hit the split is correct (input_tokens = prompt − cached, cache_read_input_tokens = cached).
  • End-to-end with the claude CLI pointed at the proxy against a live Chutes backend: on a cache-warm turn the client reported cache_read_input_tokens and the reduced input_tokens, both of which are 0/absent without this change.

Note

message_start is emitted before the backend responds, so its input_tokens necessarily remains the pre-flight estimate; the authoritative counts are carried in the final message_delta, whose usage is cumulative per the Messages API streaming contract.

@TheStreamCode TheStreamCode force-pushed the fix/streaming-usage-accuracy branch from 9392c87 to 60ca2dc Compare June 17, 2026 19:46
@TheStreamCode TheStreamCode force-pushed the fix/streaming-usage-accuracy branch from 60ca2dc to 8bbf5af Compare June 17, 2026 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant