fix(bedrock): avoid double-counting cache tokens in Anthropic Messages streaming usage#25517
Conversation
…s streaming usage Made-with: Cursor
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR fixes a double-counting bug in Confidence Score: 5/5Safe to merge — the fix is correct, minimal, and well-tested; the only remaining finding is a style nit. The bug fix is logically correct (remove summation that caused double-counting), the updated test reflects real corrected behavior rather than masking a regression, and three new tests give good coverage including an end-to-end cost check. The sole remaining finding is a P2 import style issue that does not affect correctness or CI. No files require special attention beyond the minor import style cleanup in the test file.
|
| Filename | Overview |
|---|---|
| litellm/llms/bedrock/messages/invoke_transformations/anthropic_claude3_transformation.py | Removes the summation of cache tokens into input_tokens; now passes uncached count through unchanged so calculate_usage adds cache once. |
| tests/test_litellm/llms/bedrock/messages/invoke_transformations/test_anthropic_claude3_transformation.py | Updates existing assertion to reflect fixed behavior; adds two new tests, one of which imports proxy code inside the function body (violates module-level import style). |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Bedrock Stream] -->|message_delta with cache fields| B[_promote_message_stop_usage]
B -->|buffer pending_delta| C{message_stop arrives?}
C -->|No| D[yield pending_delta as-is]
C -->|Yes| E[Copy cache_creation and cache_read from stop into delta_usage]
E --> F[Set delta input_tokens to uncached count from message_stop]
F --> G[Yield merged message_delta]
G --> H[calculate_usage]
H --> I[prompt_tokens equals uncached plus cache_creation plus cache_read]
I --> J[No double-counting]
Reviews (1): Last reviewed commit: "fix(bedrock): avoid double-counting cach..." | Re-trigger Greptile
…hing Adds TestBedrockInvokeCacheTokenBilling covering the Bedrock InvokeModel path: - baseline: no cache tokens, prompt_tokens equals input_tokens - cache_read: prompt_tokens inflated by design, prompt_tokens_details carries breakdown - cache_creation: same pattern for write tokens - cost_calculation_correct_with_cache_read: core billing regression test - cost_calculation_correct_with_cache_creation: write-rate billing regression test - back_to_back_requests_cost: full end-to-end scenario (cache write then read) These lock in the fix from PR #25517 - cache tokens were being double-counted in AnthropicConfig.calculate_usage causing 10-50x inflated cost on cache reads.
…hing Adds TestBedrockInvokeCacheTokenBilling covering the Bedrock InvokeModel path: - baseline: no cache tokens, prompt_tokens equals input_tokens - cache_read: prompt_tokens inflated by design, prompt_tokens_details carries breakdown - cache_creation: same pattern for write tokens - cost_calculation_correct_with_cache_read: core billing regression test - cost_calculation_correct_with_cache_creation: write-rate billing regression test - back_to_back_requests_cost: full end-to-end scenario (cache write then read) These lock in the fix from PR #25517 - cache tokens were being double-counted in AnthropicConfig.calculate_usage causing 10-50x inflated cost on cache reads.
…hing Adds TestBedrockInvokeCacheTokenBilling covering the Bedrock InvokeModel path: - baseline: no cache tokens, prompt_tokens equals input_tokens - cache_read: prompt_tokens inflated by design, prompt_tokens_details carries breakdown - cache_creation: same pattern for write tokens - cost_calculation_correct_with_cache_read: core billing regression test - cost_calculation_correct_with_cache_creation: write-rate billing regression test - back_to_back_requests_cost: full end-to-end scenario (cache write then read) These lock in the fix from PR #25517 - cache tokens were being double-counted in AnthropicConfig.calculate_usage causing 10-50x inflated cost on cache reads.
Problem
For Bedrock Invoke Anthropic Messages streaming (
bedrock_sse_wrapper→_promote_message_stop_usage), cached requests could report roughly 2× the real prompt token total and inflated spend.What went wrong
message_stop, Bedrock sends uncachedinput_tokens(e.g.3) plus cache breakdown on the mergedmessage_delta(cache_creation_input_tokens,cache_read_input_tokens)._promote_message_stop_usagewas rewritingmessage_delta.usage.input_tokensasuncached + cache_creation + cache_read(e.g.3 + 17 + 32651 = 32671).AnthropicConfig.calculate_usagetreatsinput_tokensas uncached-only and adds cache read/write again toprompt_tokens.input_tokensand added again → double counting (e.g.prompt_tokens65339instead of32671for the same request).Fix
Keep
message_delta.usage.input_tokensas the uncached-only value frommessage_stop(raw_input). Still promotecache_creation_input_tokensandcache_read_input_tokensontomessage_deltafor clients that ignoremessage_stop.calculate_usagethen adds cache toprompt_tokensexactly once.Tests
message_deltakeeps uncachedinput_tokens == 3with cache fields present.bedrock_sse_wrapper→ passthrough logging rebuild →prompt_tokensandcompletion_costforus.anthropic.claude-sonnet-4-6.test_bedrock_sse_wrapper_keeps_usage_in_message_start_and_message_deltaexpectations forinput_tokensafter promotion.