Skip to content

fix(bedrock): avoid double-counting cache tokens in Anthropic Messages streaming usage#25517

Merged
yuneng-berri merged 1 commit intomainfrom
litellm_bedrock-messages-cache-prompt-double-count
Apr 10, 2026
Merged

fix(bedrock): avoid double-counting cache tokens in Anthropic Messages streaming usage#25517
yuneng-berri merged 1 commit intomainfrom
litellm_bedrock-messages-cache-prompt-double-count

Conversation

@Sameerlite
Copy link
Copy Markdown
Collaborator

Problem

For Bedrock Invoke Anthropic Messages streaming (bedrock_sse_wrapper_promote_message_stop_usage), cached requests could report roughly 2× the real prompt token total and inflated spend.

What went wrong

  1. On message_stop, Bedrock sends uncached input_tokens (e.g. 3) plus cache breakdown on the merged message_delta (cache_creation_input_tokens, cache_read_input_tokens).
  2. _promote_message_stop_usage was rewriting message_delta.usage.input_tokens as uncached + cache_creation + cache_read (e.g. 3 + 17 + 32651 = 32671).
  3. Downstream, AnthropicConfig.calculate_usage treats input_tokens as uncached-only and adds cache read/write again to prompt_tokens.
  4. Result: cache tokens were included in input_tokens and added again → double counting (e.g. prompt_tokens 65339 instead of 32671 for the same request).

Fix

Keep message_delta.usage.input_tokens as the uncached-only value from message_stop (raw_input). Still promote cache_creation_input_tokens and cache_read_input_tokens onto message_delta for clients that ignore message_stop. calculate_usage then adds cache to prompt_tokens exactly once.

Tests

  • Assert promoted message_delta keeps uncached input_tokens == 3 with cache fields present.
  • End-to-end: SSE through bedrock_sse_wrapper → passthrough logging rebuild → prompt_tokens and completion_cost for us.anthropic.claude-sonnet-4-6.
  • Updated test_bedrock_sse_wrapper_keeps_usage_in_message_start_and_message_delta expectations for input_tokens after promotion.

@Sameerlite Sameerlite temporarily deployed to integration-postgres April 10, 2026 18:34 — with GitHub Actions Inactive
@Sameerlite Sameerlite temporarily deployed to integration-postgres April 10, 2026 18:34 — with GitHub Actions Inactive
@Sameerlite Sameerlite temporarily deployed to integration-postgres April 10, 2026 18:34 — with GitHub Actions Inactive
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Apr 10, 2026 6:35pm

Request Review

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq bot commented Apr 10, 2026

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing litellm_bedrock-messages-cache-prompt-double-count (f0d2d26) with main (d0e347a)

Open in CodSpeed

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 10, 2026

Greptile Summary

This PR fixes a double-counting bug in _promote_message_stop_usage for Bedrock Invoke Anthropic Messages streaming: the old code summed uncached + cache_creation + cache_read into input_tokens, but calculate_usage then added the cache tokens again to prompt_tokens, causing ~2× inflation. The fix keeps input_tokens as the uncached-only count from message_stop, relying on calculate_usage to add cache fields once. The logic change is minimal and correct, and the three new tests (unit, preservation, and end-to-end cost) provide solid coverage.

Confidence Score: 5/5

Safe to merge — the fix is correct, minimal, and well-tested; the only remaining finding is a style nit.

The bug fix is logically correct (remove summation that caused double-counting), the updated test reflects real corrected behavior rather than masking a regression, and three new tests give good coverage including an end-to-end cost check. The sole remaining finding is a P2 import style issue that does not affect correctness or CI.

No files require special attention beyond the minor import style cleanup in the test file.

Important Files Changed

Filename Overview
litellm/llms/bedrock/messages/invoke_transformations/anthropic_claude3_transformation.py Removes the summation of cache tokens into input_tokens; now passes uncached count through unchanged so calculate_usage adds cache once.
tests/test_litellm/llms/bedrock/messages/invoke_transformations/test_anthropic_claude3_transformation.py Updates existing assertion to reflect fixed behavior; adds two new tests, one of which imports proxy code inside the function body (violates module-level import style).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Bedrock Stream] -->|message_delta with cache fields| B[_promote_message_stop_usage]
    B -->|buffer pending_delta| C{message_stop arrives?}
    C -->|No| D[yield pending_delta as-is]
    C -->|Yes| E[Copy cache_creation and cache_read from stop into delta_usage]
    E --> F[Set delta input_tokens to uncached count from message_stop]
    F --> G[Yield merged message_delta]
    G --> H[calculate_usage]
    H --> I[prompt_tokens equals uncached plus cache_creation plus cache_read]
    I --> J[No double-counting]
Loading

Reviews (1): Last reviewed commit: "fix(bedrock): avoid double-counting cach..." | Re-trigger Greptile

@yuneng-berri yuneng-berri self-requested a review April 10, 2026 19:43
@yuneng-berri yuneng-berri merged commit 576e6a0 into main Apr 10, 2026
104 of 108 checks passed
@yuneng-berri yuneng-berri deleted the litellm_bedrock-messages-cache-prompt-double-count branch April 10, 2026 19:55
ishaan-berri added a commit that referenced this pull request Apr 14, 2026
…hing

Adds TestBedrockInvokeCacheTokenBilling covering the Bedrock InvokeModel path:
- baseline: no cache tokens, prompt_tokens equals input_tokens
- cache_read: prompt_tokens inflated by design, prompt_tokens_details carries breakdown
- cache_creation: same pattern for write tokens
- cost_calculation_correct_with_cache_read: core billing regression test
- cost_calculation_correct_with_cache_creation: write-rate billing regression test
- back_to_back_requests_cost: full end-to-end scenario (cache write then read)

These lock in the fix from PR #25517 - cache tokens were being double-counted
in AnthropicConfig.calculate_usage causing 10-50x inflated cost on cache reads.
ishaan-berri added a commit that referenced this pull request Apr 15, 2026
…hing

Adds TestBedrockInvokeCacheTokenBilling covering the Bedrock InvokeModel path:
- baseline: no cache tokens, prompt_tokens equals input_tokens
- cache_read: prompt_tokens inflated by design, prompt_tokens_details carries breakdown
- cache_creation: same pattern for write tokens
- cost_calculation_correct_with_cache_read: core billing regression test
- cost_calculation_correct_with_cache_creation: write-rate billing regression test
- back_to_back_requests_cost: full end-to-end scenario (cache write then read)

These lock in the fix from PR #25517 - cache tokens were being double-counted
in AnthropicConfig.calculate_usage causing 10-50x inflated cost on cache reads.
ishaan-berri added a commit that referenced this pull request Apr 15, 2026
…hing

Adds TestBedrockInvokeCacheTokenBilling covering the Bedrock InvokeModel path:
- baseline: no cache tokens, prompt_tokens equals input_tokens
- cache_read: prompt_tokens inflated by design, prompt_tokens_details carries breakdown
- cache_creation: same pattern for write tokens
- cost_calculation_correct_with_cache_read: core billing regression test
- cost_calculation_correct_with_cache_creation: write-rate billing regression test
- back_to_back_requests_cost: full end-to-end scenario (cache write then read)

These lock in the fix from PR #25517 - cache tokens were being double-counted
in AnthropicConfig.calculate_usage causing 10-50x inflated cost on cache reads.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants