Report exact backend token usage, including prompt-cache reads by TheStreamCode · Pull Request #2 · chutesai/claude-proxy

TheStreamCode · 2026-06-17T19:12:58Z

Summary

In streaming mode the proxy reported estimated token counts instead of the backend's real ones:

output_tokens fell back to a character-based estimate (text.len() / CHARS_PER_TOKEN).
input_tokens was omitted from the final message_delta.
Prompt-cache reads were never surfaced, so input_tokens overstated the billable input on cache hits.

Claude Code uses these counts for its context-window accounting, auto-compaction threshold, and cost reporting.

Root cause

Two issues combined:

The proxy never sent stream_options: {"include_usage": true}, so OpenAI-compatible backends (vLLM/SGLang) did not emit a usage chunk during streaming.
When a usage chunk did arrive, it is a usage-only final chunk of the form {"choices": [], "usage": {...}}. The if chunk.choices.is_empty() guard ran before the usage read, so the chunk was discarded and the code fell back to the estimate. The existing tests did not catch this because their mock emits usage on a chunk with a non-empty choices array.

Changes

Add stream_options to the outgoing request and set include_usage: true when the client requests streaming.
Read usage before the empty-choices guard, capturing prompt_tokens, completion_tokens, and prompt_tokens_details.cached_tokens.
Map to Anthropic usage semantics in the final message_delta and in the non-streaming response:
- output_tokens from completion_tokens
- input_tokens from prompt_tokens − cached_tokens
- cache_read_input_tokens from prompt_tokens_details.cached_tokens (emitted when greater than 0)
The character estimate remains as a fallback when the backend reports no usage, so behavior is unchanged for backends that do not support stream_options.

Testing

cargo test --all-targets passes, including two added unit tests (stream_options serialization; prompt_tokens_details deserialization). cargo fmt --check and cargo clippy --all-targets -- -D warnings are clean for the changed files.
Mock backend: cold call output_tokens goes from a char estimate to the real completion_tokens; on a cache hit the split is correct (input_tokens = prompt − cached, cache_read_input_tokens = cached).
End-to-end with the claude CLI pointed at the proxy against a live Chutes backend: on a cache-warm turn the client reported cache_read_input_tokens and the reduced input_tokens, both of which are 0/absent without this change.

Note

message_start is emitted before the backend responds, so its input_tokens necessarily remains the pre-flight estimate; the authoritative counts are carried in the final message_delta, whose usage is cumulative per the Messages API streaming contract.

TheStreamCode force-pushed the fix/streaming-usage-accuracy branch from 9392c87 to 60ca2dc Compare June 17, 2026 19:46

Report exact backend token usage for streaming and non-streaming

8bbf5af

TheStreamCode force-pushed the fix/streaming-usage-accuracy branch from 60ca2dc to 8bbf5af Compare June 17, 2026 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Report exact backend token usage, including prompt-cache reads#2

Report exact backend token usage, including prompt-cache reads#2
TheStreamCode wants to merge 1 commit into
chutesai:masterfrom
TheStreamCode:fix/streaming-usage-accuracy

TheStreamCode commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

TheStreamCode commented Jun 17, 2026

Summary

Root cause

Changes

Testing

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant