Report exact backend token usage, including prompt-cache reads#2
Open
TheStreamCode wants to merge 1 commit into
Open
Report exact backend token usage, including prompt-cache reads#2TheStreamCode wants to merge 1 commit into
TheStreamCode wants to merge 1 commit into
Conversation
9392c87 to
60ca2dc
Compare
60ca2dc to
8bbf5af
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
In streaming mode the proxy reported estimated token counts instead of the backend's real ones:
output_tokensfell back to a character-based estimate (text.len() / CHARS_PER_TOKEN).input_tokenswas omitted from the finalmessage_delta.input_tokensoverstated the billable input on cache hits.Claude Code uses these counts for its context-window accounting, auto-compaction threshold, and cost reporting.
Root cause
Two issues combined:
stream_options: {"include_usage": true}, so OpenAI-compatible backends (vLLM/SGLang) did not emit a usage chunk during streaming.{"choices": [], "usage": {...}}. Theif chunk.choices.is_empty()guard ran before the usage read, so the chunk was discarded and the code fell back to the estimate. The existing tests did not catch this because their mock emits usage on a chunk with a non-emptychoicesarray.Changes
stream_optionsto the outgoing request and setinclude_usage: truewhen the client requests streaming.usagebefore the empty-choices guard, capturingprompt_tokens,completion_tokens, andprompt_tokens_details.cached_tokens.message_deltaand in the non-streaming response:output_tokensfromcompletion_tokensinput_tokensfromprompt_tokens − cached_tokenscache_read_input_tokensfromprompt_tokens_details.cached_tokens(emitted when greater than 0)stream_options.Testing
cargo test --all-targetspasses, including two added unit tests (stream_optionsserialization;prompt_tokens_detailsdeserialization).cargo fmt --checkandcargo clippy --all-targets -- -D warningsare clean for the changed files.output_tokensgoes from a char estimate to the realcompletion_tokens; on a cache hit the split is correct (input_tokens = prompt − cached,cache_read_input_tokens = cached).claudeCLI pointed at the proxy against a live Chutes backend: on a cache-warm turn the client reportedcache_read_input_tokensand the reducedinput_tokens, both of which are0/absent without this change.Note
message_startis emitted before the backend responds, so itsinput_tokensnecessarily remains the pre-flight estimate; the authoritative counts are carried in the finalmessage_delta, whoseusageis cumulative per the Messages API streaming contract.