Gemini 2.5 Flash implicit caching problem

dragonrider · June 26, 2025, 5:15pm

I am experimenting with 2.5 Flash caching and I noticed two things:

the cache is triggered only for higher numbers of tokens, it’s not 1024.
also not all tokens are cached

So is it heavily context-dependent or something?

I would need to cache my instructions, which are around 1300 tokens (+ ideally a chat with user as it is being created), but the caching seems to work only for many more tokens.

Maybe I should try explicit caching?

chunduriv · June 26, 2025, 9:06pm

Hi @dragonrider,

Welcome to the Forum,

We appreciate you bringing this to our attention. If you could share sample examples where you noticed these issues, it would really help us investigate further.

So is it heavily context-dependent or something?

Yes, implicit caching is heavily context-dependent.

Maybe I should try explicit caching?

For your specific need to cache around 1300 tokens of instructions that you want to consistently reuse with dynamic user chat, you should try explicit caching.

Thank you!

Toni_A · June 30, 2025, 1:53pm

Dear @chunduriv

I encountered the same issue. Here you can see that caching only triggers after roughly 6k tokens at all.

I wrote a short reproduction script that tests caching for 2k tokens and for 8k tokens which reliably shows that there are no cache hits even for 2k tokens.

#!/usr/bin/env python3
"""
Minimal reproduction: Gemini Flash only caches prompts above ~7k tokens.
pip install openai
"""

import asyncio
import random

from openai import AsyncOpenAI

from core.env import env


async def test_gemini_caching(word_count: int):
    """Test Gemini Flash with a specific word count - makes 2 calls to check caching"""
    client = AsyncOpenAI(
        base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
        api_key=env.GOOGLE_AI_API_KEY,
    )

    # Generate prompt with random words ONCE
    words = " ".join(
        random.choice(["apple", "banana", "orange", "grape", "cherry"]) for _ in range(word_count)
    )

    messages = [
        {"role": "system", "content": f"You are a helpful assistant.\n\nIgnore: {words}"},
        {"role": "user", "content": "What time is it?"},
    ]

    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_time",
                "description": "Get current time",
                "parameters": {"type": "object", "properties": {}},
            },
        }
    ]

    # First call (never cached)
    await client.chat.completions.create(model="gemini-2.5-flash", messages=messages, tools=tools)

    # Second call with SAME prompt (may be cached)
    # Add random number to user message to avoid response caching
    messages[1]["content"] = f"What time is it? (Request #{random.randint(1, 1000000)})"

    response = await client.chat.completions.create(
        model="gemini-2.5-flash", messages=messages, tools=tools
    )

    return response.usage


async def main():
    print("Testing Gemini Flash caching threshold:\n")

    # Test with 2k words
    usage_2k = await test_gemini_caching(2000)
    print(f"2k words (2nd call): {usage_2k}")

    # Test with 8k words
    usage_8k = await test_gemini_caching(8000)
    print(f"8k words (2nd call): {usage_8k}")

    print(
        f"\n→ Cached tokens in 8k call: {getattr(usage_8k.prompt_tokens_details, 'cached_tokens', 0) if usage_8k.prompt_tokens_details else 0}"
    )


if __name__ == "__main__":
    asyncio.run(main())

This will lead to the output:

Testing Gemini Flash caching threshold:

2k words (2nd call): CompletionUsage(completion_tokens=10, prompt_tokens=2050, total_tokens=2095, completion_tokens_details=None, prompt_tokens_details=None)
8k words (2nd call): CompletionUsage(completion_tokens=10, prompt_tokens=8050, total_tokens=8099, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=7790))

→ Cached tokens in 8k call: 7790

As you can see, even for 2050 input tokens, nothing is cached (on the second call).
For 8k tokens the caching works as expected (on the second call).
We can also observe this behavior with flash in our production systems that short prompts just dont get cached at all.
The documentation specifies that implicit cache hits for flash should occur after 1,024 tokens.

This leads to massive costs every day for us, so I would be glad if you could confirm that this is will be looked at:)

chunduriv · July 2, 2025, 6:13pm

Hi @Toni_A,

We appreciate the reproducible script. We have reported this bug to the internal team.

Thank you!

Sina_Azizi · July 13, 2025, 3:39am

I can confirm the issue still exists. For longer prompts (I tried over 10K token), Gemini 2.5 Flash does cache most tokens. But for prompts below 6K tokens, it does not cache any tokens.

Any updates on this would be greatly appreciated.

Topic		Replies	Views
Flash implicit caching only works after 6k tokens vs the advertised 1k tokens Gemini API api , gemini-flash	1	122	July 2, 2025
Gemini 2.5 Flash Lite: Implicit Caching Not Working Despite Meeting Documented Requirements Gemini API bug , gemini	0	114	October 12, 2025
Gemini 2.5 Flash Live Implicit Context Caching Not Working / Feedback Gemini API models , gemini	2	91	November 28, 2025
Implicit Caching Not Working for Gemini-2.5-Pro with 30k+ Tokens Despite Documentation Requirements Gemini API api , prompt	2	152	September 3, 2025
Implicit Caching not Working on Gemini 2.5 Pro Gemini API gemini-2-5 , context_caching	3	399	June 16, 2025

Gemini 2.5 Flash implicit caching problem

Related topics