-
Notifications
You must be signed in to change notification settings - Fork 158
Description
Hi,
I noticed that when using the standard OpenAI Chat Completions API - where I only modify the final fragment of the user prompt while keeping the full system prompt and the majority (initial part) of the user prompt identical across requests - prompt caching works effectively, with most tokens being cached.
However, when using AX in a similar pattern (only changing one input field in the signature + a bit longer signature description, typically the last one), caching is rarely utilized. Is this due to how AX constructs the prompt structure, the order of inputs, or something else? Can this be addressed?
I believe this is a significant disadvantage of AX that might be relatively easy to fix.
Context on GPT Prompt Caching:
Token caching in GPT models like 4o-mini automatically stores prefixes of repeated input prompts (starting from 1024 tokens, in 128-token blocks), reducing latency by up to 80% and input costs by 50-90% on cache hits.
What do you think?
Thanks