Hi,
looking at our usage, we realised nearly no api calls to flash resulted in any cached tokens. After investigating we found the following:
So from our small experiments we could reproducibly see that caching is only active after 6k tokens. The description states that caching for flash should start beyond 1k tokens.
I wrote a short reproduction script that tests caching for 2k tokens and for 8k tokens which reliably shows that there are no cache hits even for 2k tokens:
#!/usr/bin/env python3
"""
Minimal reproduction: Gemini Flash only caches prompts above ~7k tokens.
pip install openai
"""
import asyncio
import random
from openai import AsyncOpenAI
from core.env import env
async def test_gemini_caching(word_count: int):
"""Test Gemini Flash with a specific word count - makes 2 calls to check caching"""
client = AsyncOpenAI(
base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
api_key=env.GOOGLE_AI_API_KEY,
)
# Generate prompt with random words ONCE
words = " ".join(
random.choice(["apple", "banana", "orange", "grape", "cherry"]) for _ in range(word_count)
)
messages = [
{"role": "system", "content": f"You are a helpful assistant.\n\nIgnore: {words}"},
{"role": "user", "content": "What time is it?"},
]
tools = [
{
"type": "function",
"function": {
"name": "get_time",
"description": "Get current time",
"parameters": {"type": "object", "properties": {}},
},
}
]
# First call (never cached)
await client.chat.completions.create(model="gemini-2.5-flash", messages=messages, tools=tools)
# Second call with SAME prompt (may be cached)
# Add random number to user message to avoid response caching
messages[1]["content"] = f"What time is it? (Request #{random.randint(1, 1000000)})"
response = await client.chat.completions.create(
model="gemini-2.5-flash", messages=messages, tools=tools
)
return response.usage
async def main():
print("Testing Gemini Flash caching threshold:\n")
# Test with 2k words
usage_2k = await test_gemini_caching(2000)
print(f"2k words (2nd call): {usage_2k}")
# Test with 8k words
usage_8k = await test_gemini_caching(8000)
print(f"8k words (2nd call): {usage_8k}")
print(
f"\nā Cached tokens in 8k call: {getattr(usage_8k.prompt_tokens_details, 'cached_tokens', 0) if usage_8k.prompt_tokens_details else 0}"
)
if __name__ == "__main__":
asyncio.run(main())
This will lead to the output:
Testing Gemini Flash caching threshold:
2k words (2nd call): CompletionUsage(completion_tokens=10, prompt_tokens=2050, total_tokens=2095, completion_tokens_details=None, prompt_tokens_details=None)
8k words (2nd call): CompletionUsage(completion_tokens=10, prompt_tokens=8050, total_tokens=8099, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=7790))
ā Cached tokens in 8k call: 7790
As you can see, even for 2050 input tokens, nothing is cached (on the second call).
For 8k tokens the caching works as expected (on the second call).
We can also observe this behavior with flash in our production systems that short prompts just dont get cached at all.
The documentation specifies that implicit cache hits for flash should occur after 1,024 tokens.
We know that caching is best effort and can be unreliable, but right now it is reliably not working below 6k input tokens, which makes it prohibitively expensive for short tasks (classification etc).
I reposted this from Gemini 2.5 Flash implicit caching problem - #4 by Toni_A as this is quite an urgent issue. We are spending more than 1k on flash calls every single day and plan to increase this rapidly. Caching could reduce costs for us (and google) by 75%, so it would be great if someone could take a look at this!