Skip to content

fix(caching): add Responses API params to cache key allow-list#25673

Merged
ishaan-berri merged 1 commit intoBerriAI:litellm_ishaan_april14from
michelligabriele:fix/responses-api-cache-key
Apr 14, 2026
Merged

fix(caching): add Responses API params to cache key allow-list#25673
ishaan-berri merged 1 commit intoBerriAI:litellm_ishaan_april14from
michelligabriele:fix/responses-api-cache-key

Conversation

@michelligabriele
Copy link
Copy Markdown
Collaborator

Relevant issues

No linked GitHub issue. Reported by a customer via Slack with a full Postman reproduction; I verified the bug against current main before sending this PR.

Pre-Submission checklist

  • I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirementtests/test_litellm/test_model_param_helper.py::test_get_all_llm_api_params_includes_responses_api. A second cache-key behavioral test is also added in tests/local_testing/test_unit_test_caching.py::test_get_cache_key_responses_api, mirroring the existing chat / embedding / text-completion cache-key tests in that file.
  • My PR passes all unit tests on make test-unit — targeted and adjacent suites pass locally (tests/test_litellm/test_model_param_helper.py, tests/local_testing/test_unit_test_caching.py::test_get_cache_key_*, tests/test_litellm/caching/test_caching_handler.py, ruff-clean on litellm/litellm_core_utils/model_param_helper.py).
  • My PR's scope is as isolated as possible, it only solves 1 specific problem — one source file touched (litellm/litellm_core_utils/model_param_helper.py), strictly additive union into the cache-key allow-list. No chat / text / embedding / transcription / rerank behavior changes.
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review — will do immediately after this PR is open.

CI (LiteLLM team)

  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Screenshots / Proof of Fix

Reproduced against current main on a local proxy (in-memory cache, gpt-4.1, native OpenAI /v1/responses path, no DB). Two POST /v1/responses calls with identical model / input / temperature, differing only in instructions:

  • Call A — instructions: \"summarize the weather on 10th May\"
  • Call B — instructions: \"summarize the weather on 7th May\"

Before fix

Call A-10May: HTTP 200, total=0.046513s
  \"On **10th May 2018**, the weather was **chilly** with a temperature of **-2°C (29°F)**.\"
Call B-7May:  HTTP 200, total=0.027232s
  \"On **10th May 2018**, the weather was **chilly** with a temperature of **-2°C (29°F)**.\"
Diff: IDENTICAL

Both calls return the 10 May body; call B is served from cache in 27 ms. The proxy debug log shows the same pre-hash key and the same SHA-256 on both requests:

Created cache key: model: openai/gpt-4.1input: [{'role': 'user', ...}]temperature: 0.3
Hashed cache key (SHA-256): f2dc610942bd00aaecde317cf7844550d33a464d4af53b94ea6ab8f144ab8af2
Cache Hit!

instructions is absent from the pre-hash key string on both requests — that is the bug.

After fix

Call A-10May -> payload-A.json
HTTP 200  total=2.486610s  ttfb=2.486447s
  On **10th May 2018**, the weather was **chilly** with a temperature of **-2°C (29°F)**.

Call B-7May  -> payload-B.json
HTTP 200  total=1.625355s  ttfb=1.625228s
  On **7th May 2018**, the weather was **bracing** with a temperature of **14°C (57°F)**.

Diff: DIFFERENT

Both calls are now real upstream round trips (>1.6 s) and each returns the body for the correct instructions. Three independent signals of the fix: (1) call-B latency goes from 27 ms to 1.6 s, (2) call-B body flips from the 10 May content to the 7 May content, (3) the normalized-body diff verdict flips from IDENTICAL to DIFFERENT. No regression on call A.

Added test run (local):

tests/test_litellm/test_model_param_helper.py::test_get_all_llm_api_params_includes_responses_api PASSED
tests/local_testing/test_unit_test_caching.py::test_get_cache_key_responses_api PASSED
tests/local_testing/test_unit_test_caching.py::test_get_cache_key_chat_completion PASSED
tests/local_testing/test_unit_test_caching.py::test_get_cache_key_embedding PASSED
tests/local_testing/test_unit_test_caching.py::test_get_cache_key_text_completion PASSED
tests/local_testing/test_unit_test_caching.py::test_get_kwargs_for_cache_key PASSED
tests/test_litellm/test_model_param_helper.py::test_cached_relevant_logging_args_matches_dynamic PASSED
tests/test_litellm/test_model_param_helper.py::test_get_standard_logging_model_parameters_filters PASSED
tests/test_litellm/test_model_param_helper.py::test_get_standard_logging_model_parameters_excludes_prompt_content PASSED

Type

🐛 Bug Fix

Changes

The bug

Cache.get_cache_key() (litellm/caching/caching.py:294) builds the key from ModelParamHelper._get_all_llm_api_params(), which today unions the supported kwargs for chat / text / embedding / transcription / rerank — and nothing else. Native-OpenAI /v1/responses requests pass through the caching handler unchanged, so every Responses-API-only top-level kwarg is silently dropped from the key.

Under openai==2.30.0 (the hard-pinned version in pyproject.toml and requirements.txt), the Responses-only top-level kwargs that are currently being dropped are:

background, context_management, conversation, include, instructions,
max_output_tokens, max_tool_calls, parallel_tool_calls, previous_response_id,
prompt, prompt_cache_key, prompt_cache_retention, reasoning,
safety_identifier, service_tier, store, text, partial_images

model, input, temperature, top_p, stream, tools, tool_choice, user, metadata, top_logprobs, truncation, stream_options happen to survive because they collide with names in the chat / text / embedding TypedDicts — pure name coincidence. That is exactly the pattern the Pfizer customer noticed: changing input invalidates their cache, but changing instructions does not.

The user-facing effect is a silent correctness bug on /v1/responses: two requests that differ only in (e.g.) instructions collapse to the same cache entry and the second is served a stale 200. No error, no warning — just the wrong body.

The fix

Add a sixth helper, _get_litellm_supported_responses_api_kwargs(), on ModelParamHelper that sources Responses-API kwargs from openai.types.responses.response_create_params.ResponseCreateParamsNonStreaming / Streaming — a one-to-one mirror of the existing _get_litellm_supported_chat_completion_kwargs() helper — and union the returned set into _get_all_llm_api_params().

from openai.types.responses.response_create_params import (
    ResponseCreateParamsNonStreaming,
    ResponseCreateParamsStreaming,
)

# ...

@staticmethod
def _get_litellm_supported_responses_api_kwargs() -> Set[str]:
    \"\"\"
    Get the litellm supported responses API kwargs

    This follows the OpenAI API Spec
    \"\"\"
    non_streaming_params: Set[str] = set(
        getattr(ResponseCreateParamsNonStreaming, \"__annotations__\", {}).keys()
    )
    streaming_params: Set[str] = set(
        getattr(ResponseCreateParamsStreaming, \"__annotations__\", {}).keys()
    )
    return non_streaming_params.union(streaming_params)

Why this shape

  • Strictly additive. Chat / text / embedding / transcription / rerank cache keys are byte-identical before and after. Only /v1/responses requests see a change, and that change is the intended correctness fix. One-time cache-miss surge on first run after upgrade as previously-collapsed entries split into real per-kwarg entries; steady state recovers immediately.
  • Sourced from the openai SDK, not duplicated. Same convention used for chat / text / embedding / transcription; no drift between LiteLLM's allow-list and OpenAI's own spec. Future additions to ResponseCreateParamsNonStreaming / Streaming flow through automatically.
  • No try/except import wrapper. openai = \"2.30.0\" is hard-pinned in pyproject.toml and requirements.txt (no ^/~/>=), so there is no resolver path under which ResponseCreateParamsNonStreaming could be missing. Top-level import matches chat / text / embedding. (The transcription helper uses a try/except only because typed transcription params landed later in the openai SDK timeline — not relevant here.)
  • metadata is still excluded. The existing _get_exclude_kwargs() == {\"metadata\"} step still runs, so metadata behavior is unchanged for every call type including Responses. Whether metadata should be part of the Responses cache key is a separate discussion.

Files touched

  • litellm/litellm_core_utils/model_param_helper.py — add the import, add the helper, union the helper's output into _get_all_llm_api_params().
  • tests/test_litellm/test_model_param_helper.py — new test test_get_all_llm_api_params_includes_responses_api: asserts 13 Responses-API-only kwargs are present in the allow-list. Failure message names exactly what regressed.
  • tests/local_testing/test_unit_test_caching.py — new test test_get_cache_key_responses_api: mirrors the existing chat / embedding / text-completion cache-key tests in this file. Asserts instructions yields a distinct cache key from a baseline payload, plus a parametric loop over previous_response_id, reasoning, include, max_output_tokens, background, plus a sanity-check that identical payloads still collide (so cache hits still work).

Three files, 103 net additions, zero deletions.

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 14, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Apr 14, 2026 3:51am

Request Review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 14, 2026

Greptile Summary

This PR fixes a silent correctness bug in Cache.get_cache_key() where Responses API-specific parameters (e.g. instructions, previous_response_id, reasoning) were absent from the cache-key allow-list, causing two requests that differed only in those fields to collapse onto the same cache entry and the second to be served a stale response.

The fix adds _get_litellm_supported_responses_api_kwargs() on ModelParamHelper, sourcing params from openai.types.responses.response_create_params, and unions it into _get_all_llm_api_params() — a one-to-one mirror of the existing chat/text/embedding/transcription/rerank pattern. The change is strictly additive; chat, text, embedding, transcription, and rerank cache keys are byte-identical before and after.

Confidence Score: 5/5

  • Safe to merge — strictly additive fix with clear reproduction proof, no regressions on existing call types, and solid test coverage.
  • All findings are P2 or lower. The fix is a minimal, well-scoped union into an existing allowlist, sourced directly from OpenAI SDK types to avoid drift. Tests are pure unit tests (no real network calls), the pattern is consistent with every other call-type helper in the file, and the one-time cache-miss side effect is intentional and documented.
  • No files require special attention.

Important Files Changed

Filename Overview
litellm/litellm_core_utils/model_param_helper.py Adds _get_litellm_supported_responses_api_kwargs() sourced from OpenAI SDK types and unions it into _get_all_llm_api_params(); strictly additive, consistent with existing chat/text/embedding/transcription/rerank pattern.
tests/local_testing/test_unit_test_caching.py Adds test_get_cache_key_responses_api covering the instructions regression plus parametric spot-checks for 5 other Responses-only params; no real network calls made — uses only Cache() and cache.get_cache_key().
tests/test_litellm/test_model_param_helper.py Adds test_get_all_llm_api_params_includes_responses_api asserting 13 Responses-API-only kwargs are present in the allow-list with a clear failure message; serves as a regression guard for future SDK changes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Cache.get_cache_key(kwargs)"] --> B["ModelParamHelper._get_all_llm_api_params()"]
    B --> C["_get_litellm_supported_chat_completion_kwargs()"]
    B --> D["_get_litellm_supported_text_completion_kwargs()"]
    B --> E["_get_litellm_supported_embedding_kwargs()"]
    B --> F["_get_litellm_supported_transcription_kwargs()"]
    B --> G["_get_litellm_supported_rerank_kwargs()"]
    B --> H["_get_litellm_supported_responses_api_kwargs() ✨ NEW"]
    H --> I["ResponseCreateParamsNonStreaming.__annotations__"]
    H --> J["ResponseCreateParamsStreaming.__annotations__"]
    C & D & E & F & G & H --> K["union of all param sets"]
    K --> L["minus _get_exclude_kwargs() = {'metadata'}"]
    L --> M["allowed_params set"]
    M --> N["Filter kwargs → build key string"]
    N --> O["SHA-256 hash → cache key"]
Loading

Reviews (1): Last reviewed commit: "fix(caching): add Responses API params t..." | Re-trigger Greptile

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented Apr 14, 2026

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing michelligabriele:fix/responses-api-cache-key (8549774) with main (e64d98f)

Open in CodSpeed

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@ishaan-berri ishaan-berri changed the base branch from main to litellm_ishaan_april14 April 14, 2026 16:52
@ishaan-berri ishaan-berri merged commit f6058bd into BerriAI:litellm_ishaan_april14 Apr 14, 2026
48 of 51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants