Dear @chunduriv
I encountered the same issue. Here you can see that caching only triggers after roughly 6k tokens at all.
I wrote a short reproduction script that tests caching for 2k tokens and for 8k tokens which reliably shows that there are no cache hits even for 2k tokens.
#!/usr/bin/env python3
"""
Minimal reproduction: Gemini Flash only caches prompts above ~7k tokens.
pip install openai
"""
import asyncio
import random
from openai import AsyncOpenAI
from core.env import env
async def test_gemini_caching(word_count: int):
"""Test Gemini Flash with a specific word count - makes 2 calls to check caching"""
client = AsyncOpenAI(
base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
api_key=env.GOOGLE_AI_API_KEY,
)
# Generate prompt with random words ONCE
words = " ".join(
random.choice(["apple", "banana", "orange", "grape", "cherry"]) for _ in range(word_count)
)
messages = [
{"role": "system", "content": f"You are a helpful assistant.\n\nIgnore: {words}"},
{"role": "user", "content": "What time is it?"},
]
tools = [
{
"type": "function",
"function": {
"name": "get_time",
"description": "Get current time",
"parameters": {"type": "object", "properties": {}},
},
}
]
# First call (never cached)
await client.chat.completions.create(model="gemini-2.5-flash", messages=messages, tools=tools)
# Second call with SAME prompt (may be cached)
# Add random number to user message to avoid response caching
messages[1]["content"] = f"What time is it? (Request #{random.randint(1, 1000000)})"
response = await client.chat.completions.create(
model="gemini-2.5-flash", messages=messages, tools=tools
)
return response.usage
async def main():
print("Testing Gemini Flash caching threshold:\n")
# Test with 2k words
usage_2k = await test_gemini_caching(2000)
print(f"2k words (2nd call): {usage_2k}")
# Test with 8k words
usage_8k = await test_gemini_caching(8000)
print(f"8k words (2nd call): {usage_8k}")
print(
f"\n→ Cached tokens in 8k call: {getattr(usage_8k.prompt_tokens_details, 'cached_tokens', 0) if usage_8k.prompt_tokens_details else 0}"
)
if __name__ == "__main__":
asyncio.run(main())
This will lead to the output:
Testing Gemini Flash caching threshold:
2k words (2nd call): CompletionUsage(completion_tokens=10, prompt_tokens=2050, total_tokens=2095, completion_tokens_details=None, prompt_tokens_details=None)
8k words (2nd call): CompletionUsage(completion_tokens=10, prompt_tokens=8050, total_tokens=8099, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=7790))
→ Cached tokens in 8k call: 7790
As you can see, even for 2050 input tokens, nothing is cached (on the second call).
For 8k tokens the caching works as expected (on the second call).
We can also observe this behavior with flash in our production systems that short prompts just dont get cached at all.
The documentation specifies that implicit cache hits for flash should occur after 1,024 tokens.
This leads to massive costs every day for us, so I would be glad if you could confirm that this is will be looked at:)