Skip to content

Prompt cache miss at user task boundary despite identical prefix #2986

@tanzhenxin

Description

@tanzhenxin

What happened?

When running multiple consecutive tasks in a single session, the prompt cache drops to system-prompt level (~18k tokens) on the first request of a new user task, even though the message prefix is identical to the previous request.

Cache works well within an agentic tool-call loop (90–99% hit rate), and recovers on the subsequent user task (99.9%). Only the first request after a task boundary is affected.

Prerequisite: tested on #2897, which preserves reasoning blocks across turns. Without that branch, follow-up requests always break cache because reasoning blocks are stripped from history.

Reproduction:

  1. Start a session on the feat(core): thinking block cross-turn retention with idle cleanup #2897 branch
  2. Run a task that involves several turns and tool calls (e.g., "read package.json and summarize the scripts")
  3. Once the task completes, send: Hi
  4. Once that responds, send: Hi again
  5. Check the OpenAI logs in ~/.qwen/logs/ — the Hi request will show a cache drop to ~18k, while Hi again recovers to ~99%

Example output:

Req Input Cached Cache% Notes
0 17,971 0 0.0% Task 1 — cold cache
1 18,360 17,965 97.8% Task 1 — agentic loop
2 18,976 18,354 96.7% Task 1 — agentic loop
3 20,380 18,970 93.1% Task 1 — agentic loop
4 20,582 20,374 99.0% Task 1 — agentic loop
5 22,650 20,576 90.8% Task 1 — last request
6 23,455 17,965 76.6% "Hi" — cache breaks
7 23,476 23,449 99.9% "Hi again" — cache recovers

Req 6 caches only ~18k tokens (system prompt) instead of the expected ~22k (req 5's full prefix).

What did you expect to happen?

Req 6 should cache ~22,650 tokens (the full prefix from req 5), since the message content is unchanged. Expected cache hit rate would be ~90%+ instead of the observed 76.6%.

Investigation so far

  • Per-message MD5 hash comparison of the actual OpenAI request payloads (captured at the pipeline level before sending) confirms all shared messages are byte-for-byte identical between req 5 and req 6
  • Non-message request fields (model, tools, stream options) are also identical
  • The only differences are:
    • cache_control annotation placement (non-semantic, used as a cache hint)
    • metadata.promptId (changes every request, outside message content)
  • Thinking/reasoning blocks are retained in history (no stripping occurs)
  • The pattern is reproducible across sessions

This appears to be a provider-side cache behavior rather than a client-side prefix mismatch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    status/needs-triageIssue needs to be triaged and labeledtype/bugSomething isn't working as expected

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions