fix(caching): add Responses API params to cache key allow-list#25673
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR fixes a silent correctness bug in The fix adds Confidence Score: 5/5
|
| Filename | Overview |
|---|---|
| litellm/litellm_core_utils/model_param_helper.py | Adds _get_litellm_supported_responses_api_kwargs() sourced from OpenAI SDK types and unions it into _get_all_llm_api_params(); strictly additive, consistent with existing chat/text/embedding/transcription/rerank pattern. |
| tests/local_testing/test_unit_test_caching.py | Adds test_get_cache_key_responses_api covering the instructions regression plus parametric spot-checks for 5 other Responses-only params; no real network calls made — uses only Cache() and cache.get_cache_key(). |
| tests/test_litellm/test_model_param_helper.py | Adds test_get_all_llm_api_params_includes_responses_api asserting 13 Responses-API-only kwargs are present in the allow-list with a clear failure message; serves as a regression guard for future SDK changes. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["Cache.get_cache_key(kwargs)"] --> B["ModelParamHelper._get_all_llm_api_params()"]
B --> C["_get_litellm_supported_chat_completion_kwargs()"]
B --> D["_get_litellm_supported_text_completion_kwargs()"]
B --> E["_get_litellm_supported_embedding_kwargs()"]
B --> F["_get_litellm_supported_transcription_kwargs()"]
B --> G["_get_litellm_supported_rerank_kwargs()"]
B --> H["_get_litellm_supported_responses_api_kwargs() ✨ NEW"]
H --> I["ResponseCreateParamsNonStreaming.__annotations__"]
H --> J["ResponseCreateParamsStreaming.__annotations__"]
C & D & E & F & G & H --> K["union of all param sets"]
K --> L["minus _get_exclude_kwargs() = {'metadata'}"]
L --> M["allowed_params set"]
M --> N["Filter kwargs → build key string"]
N --> O["SHA-256 hash → cache key"]
Reviews (1): Last reviewed commit: "fix(caching): add Responses API params t..." | Re-trigger Greptile
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
f6058bd
into
BerriAI:litellm_ishaan_april14
Relevant issues
No linked GitHub issue. Reported by a customer via Slack with a full Postman reproduction; I verified the bug against current
mainbefore sending this PR.Pre-Submission checklist
tests/test_litellm/directory, Adding at least 1 test is a hard requirement —tests/test_litellm/test_model_param_helper.py::test_get_all_llm_api_params_includes_responses_api. A second cache-key behavioral test is also added intests/local_testing/test_unit_test_caching.py::test_get_cache_key_responses_api, mirroring the existing chat / embedding / text-completion cache-key tests in that file.make test-unit— targeted and adjacent suites pass locally (tests/test_litellm/test_model_param_helper.py,tests/local_testing/test_unit_test_caching.py::test_get_cache_key_*,tests/test_litellm/caching/test_caching_handler.py, ruff-clean onlitellm/litellm_core_utils/model_param_helper.py).litellm/litellm_core_utils/model_param_helper.py), strictly additive union into the cache-key allow-list. No chat / text / embedding / transcription / rerank behavior changes.@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer review — will do immediately after this PR is open.CI (LiteLLM team)
Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:
Screenshots / Proof of Fix
Reproduced against current
mainon a local proxy (in-memory cache,gpt-4.1, native OpenAI/v1/responsespath, no DB). TwoPOST /v1/responsescalls with identicalmodel/input/temperature, differing only ininstructions:instructions: \"summarize the weather on 10th May\"instructions: \"summarize the weather on 7th May\"Before fix
Both calls return the 10 May body; call B is served from cache in 27 ms. The proxy debug log shows the same pre-hash key and the same SHA-256 on both requests:
instructionsis absent from the pre-hash key string on both requests — that is the bug.After fix
Both calls are now real upstream round trips (>1.6 s) and each returns the body for the correct
instructions. Three independent signals of the fix: (1) call-B latency goes from 27 ms to 1.6 s, (2) call-B body flips from the 10 May content to the 7 May content, (3) the normalized-body diff verdict flips from IDENTICAL to DIFFERENT. No regression on call A.Added test run (local):
Type
🐛 Bug Fix
Changes
The bug
Cache.get_cache_key()(litellm/caching/caching.py:294) builds the key fromModelParamHelper._get_all_llm_api_params(), which today unions the supported kwargs for chat / text / embedding / transcription / rerank — and nothing else. Native-OpenAI/v1/responsesrequests pass through the caching handler unchanged, so every Responses-API-only top-level kwarg is silently dropped from the key.Under
openai==2.30.0(the hard-pinned version inpyproject.tomlandrequirements.txt), the Responses-only top-level kwargs that are currently being dropped are:model,input,temperature,top_p,stream,tools,tool_choice,user,metadata,top_logprobs,truncation,stream_optionshappen to survive because they collide with names in the chat / text / embedding TypedDicts — pure name coincidence. That is exactly the pattern the Pfizer customer noticed: changinginputinvalidates their cache, but changinginstructionsdoes not.The user-facing effect is a silent correctness bug on
/v1/responses: two requests that differ only in (e.g.)instructionscollapse to the same cache entry and the second is served a stale 200. No error, no warning — just the wrong body.The fix
Add a sixth helper,
_get_litellm_supported_responses_api_kwargs(), onModelParamHelperthat sources Responses-API kwargs fromopenai.types.responses.response_create_params.ResponseCreateParamsNonStreaming/Streaming— a one-to-one mirror of the existing_get_litellm_supported_chat_completion_kwargs()helper — and union the returned set into_get_all_llm_api_params().Why this shape
/v1/responsesrequests see a change, and that change is the intended correctness fix. One-time cache-miss surge on first run after upgrade as previously-collapsed entries split into real per-kwarg entries; steady state recovers immediately.ResponseCreateParamsNonStreaming/Streamingflow through automatically.try/exceptimport wrapper.openai = \"2.30.0\"is hard-pinned inpyproject.tomlandrequirements.txt(no^/~/>=), so there is no resolver path under whichResponseCreateParamsNonStreamingcould be missing. Top-level import matches chat / text / embedding. (The transcription helper uses atry/exceptonly because typed transcription params landed later in the openai SDK timeline — not relevant here.)metadatais still excluded. The existing_get_exclude_kwargs() == {\"metadata\"}step still runs, sometadatabehavior is unchanged for every call type including Responses. Whethermetadatashould be part of the Responses cache key is a separate discussion.Files touched
litellm/litellm_core_utils/model_param_helper.py— add the import, add the helper, union the helper's output into_get_all_llm_api_params().tests/test_litellm/test_model_param_helper.py— new testtest_get_all_llm_api_params_includes_responses_api: asserts 13 Responses-API-only kwargs are present in the allow-list. Failure message names exactly what regressed.tests/local_testing/test_unit_test_caching.py— new testtest_get_cache_key_responses_api: mirrors the existing chat / embedding / text-completion cache-key tests in this file. Assertsinstructionsyields a distinct cache key from a baseline payload, plus a parametric loop overprevious_response_id,reasoning,include,max_output_tokens,background, plus a sanity-check that identical payloads still collide (so cache hits still work).Three files, 103 net additions, zero deletions.