You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running multiple consecutive tasks in a single session, the prompt cache drops to system-prompt level (~18k tokens) on the first request of a new user task, even though the message prefix is identical to the previous request.
Cache works well within an agentic tool-call loop (90–99% hit rate), and recovers on the subsequent user task (99.9%). Only the first request after a task boundary is affected.
Prerequisite: tested on #2897, which preserves reasoning blocks across turns. Without that branch, follow-up requests always break cache because reasoning blocks are stripped from history.
Run a task that involves several turns and tool calls (e.g., "read package.json and summarize the scripts")
Once the task completes, send: Hi
Once that responds, send: Hi again
Check the OpenAI logs in ~/.qwen/logs/ — the Hi request will show a cache drop to ~18k, while Hi again recovers to ~99%
Example output:
Req
Input
Cached
Cache%
Notes
0
17,971
0
0.0%
Task 1 — cold cache
1
18,360
17,965
97.8%
Task 1 — agentic loop
2
18,976
18,354
96.7%
Task 1 — agentic loop
3
20,380
18,970
93.1%
Task 1 — agentic loop
4
20,582
20,374
99.0%
Task 1 — agentic loop
5
22,650
20,576
90.8%
Task 1 — last request
6
23,455
17,965
76.6%
"Hi" — cache breaks
7
23,476
23,449
99.9%
"Hi again" — cache recovers
Req 6 caches only ~18k tokens (system prompt) instead of the expected ~22k (req 5's full prefix).
What did you expect to happen?
Req 6 should cache ~22,650 tokens (the full prefix from req 5), since the message content is unchanged. Expected cache hit rate would be ~90%+ instead of the observed 76.6%.
Investigation so far
Per-message MD5 hash comparison of the actual OpenAI request payloads (captured at the pipeline level before sending) confirms all shared messages are byte-for-byte identical between req 5 and req 6
Non-message request fields (model, tools, stream options) are also identical
The only differences are:
cache_control annotation placement (non-semantic, used as a cache hint)
metadata.promptId (changes every request, outside message content)
Thinking/reasoning blocks are retained in history (no stripping occurs)
The pattern is reproducible across sessions
This appears to be a provider-side cache behavior rather than a client-side prefix mismatch.
What happened?
When running multiple consecutive tasks in a single session, the prompt cache drops to system-prompt level (~18k tokens) on the first request of a new user task, even though the message prefix is identical to the previous request.
Cache works well within an agentic tool-call loop (90–99% hit rate), and recovers on the subsequent user task (99.9%). Only the first request after a task boundary is affected.
Prerequisite: tested on #2897, which preserves reasoning blocks across turns. Without that branch, follow-up requests always break cache because reasoning blocks are stripped from history.
Reproduction:
HiHi again~/.qwen/logs/— theHirequest will show a cache drop to ~18k, whileHi againrecovers to ~99%Example output:
Req 6 caches only ~18k tokens (system prompt) instead of the expected ~22k (req 5's full prefix).
What did you expect to happen?
Req 6 should cache ~22,650 tokens (the full prefix from req 5), since the message content is unchanged. Expected cache hit rate would be ~90%+ instead of the observed 76.6%.
Investigation so far
cache_controlannotation placement (non-semantic, used as a cache hint)metadata.promptId(changes every request, outside message content)This appears to be a provider-side cache behavior rather than a client-side prefix mismatch.