We have started trying out the implicit caching behavior on gemini 2.5 pro models. Our system prompt is over 2048 tokens and it is always static. Even when we send 2 requests after each other(around 10 seconds in total, 4 + 2 sec gap + 4), the cache_read property on usage_metadata is 0. The documents state that we don’t need any parameter setting etc. to enable this, but it looks like this is not the case for us. The model we are using is gemini-2.5-pro-05-06.
Hey Emir,
Thanks for reaching out and detailing your experience with implicit caching on Gemini 2.5 Pro. I understand how frustrating it can be when a feature isn’t behaving as expected, especially with a static, large prompt like yours.
It’s interesting that usage_metadata.cache_read is returning 0 even after multiple sequential requests. You’re absolutely right that the documentation suggests that implicit caching should “just work” without explicit parameter settings for static contexts, which your setup clearly fits.
To help us dig into this a bit more effectively, could you clarify a couple of things?
1.“Static” prompt: You mentioned your prompt is “static.” Does this mean the entire prompt, including user input within the conversation, remains unchanged across those two requests, or just the system prompt portion?
2.Request content: Are the two requests identical in terms of the prompt structure and content, or are there any subtle differences in the user turns or other elements?
3.API client/SDK: Are you using a specific Google-provided client library (e.g., Python SDK, Node.js SDK) or making direct REST API calls? If so, which version?
4.Full prompt structure (anonymized): Without revealing sensitive info, could you give a high-level idea of how your prompt is structured? For example, is it a single system message, or are there multiple parts?
Also, just to confirm, you’re using gemini-2.5-pro-05-06, which is great to know.
We’re actively working to improve the caching mechanisms, and your detailed feedback is invaluable. We appreciate your patience as we look into this!
Thanks.
Hello Deepakishore,
- The prompt is fully static, the entire system prompt and the user prompt is same between 2 requests.
- They are fully the same.
- We are using the langchain google-genai with the following versions:
- langchain 0.3.24
- langchain-google-genai 2.1.5
- google-genai 1.20.0
- It is a single long system prompt which instruct the agent regarding how to act. The output is using with_structured_output, which acts as a forced tool call on gemini models on langchain.
Hi @Emir_Arditi,
Thank you so much for the detailed and clear answers to my questions! This information is incredibly helpful in understanding the context of the implicit caching issue you’re experiencing.
To recap, you’ve confirmed that:
Your entire prompt (both system and user parts) is indeed static and identical across sequential requests.
You’re specifically using langchain-google-genai with versions langchain 0.3.24, langchain-google-genai 2.1.5, and google-genai 1.20.0.
The prompt itself is a long system prompt, and you’re leveraging _structured_output for a forced tool call.
This is a really interesting edge case, especially with the use of _structured_output and the LangChain integration. While implicit caching is designed to handle static prompts, there might be specific interactions with structured output or the way these libraries handle the request payload that could be affecting its behavior
We’re actively looking into how _structured_output and tool calling might interact with our caching mechanisms, and your use case is providing valuable data. We really appreciate your patience and cooperation as we debug this.
Thanks again for the excellent detail!