Flash implicit caching only works after 6k tokens vs the advertised 1k tokens

Toni_A · July 2, 2025, 12:02pm

Hi,

looking at our usage, we realised nearly no api calls to flash resulted in any cached tokens. After investigating we found the following:

So from our small experiments we could reproducibly see that caching is only active after 6k tokens. The description states that caching for flash should start beyond 1k tokens.

I wrote a short reproduction script that tests caching for 2k tokens and for 8k tokens which reliably shows that there are no cache hits even for 2k tokens:

#!/usr/bin/env python3
"""
Minimal reproduction: Gemini Flash only caches prompts above ~7k tokens.
pip install openai
"""

import asyncio
import random

from openai import AsyncOpenAI

from core.env import env


async def test_gemini_caching(word_count: int):
    """Test Gemini Flash with a specific word count - makes 2 calls to check caching"""
    client = AsyncOpenAI(
        base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
        api_key=env.GOOGLE_AI_API_KEY,
    )

    # Generate prompt with random words ONCE
    words = " ".join(
        random.choice(["apple", "banana", "orange", "grape", "cherry"]) for _ in range(word_count)
    )

    messages = [
        {"role": "system", "content": f"You are a helpful assistant.\n\nIgnore: {words}"},
        {"role": "user", "content": "What time is it?"},
    ]

    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_time",
                "description": "Get current time",
                "parameters": {"type": "object", "properties": {}},
            },
        }
    ]

    # First call (never cached)
    await client.chat.completions.create(model="gemini-2.5-flash", messages=messages, tools=tools)

    # Second call with SAME prompt (may be cached)
    # Add random number to user message to avoid response caching
    messages[1]["content"] = f"What time is it? (Request #{random.randint(1, 1000000)})"

    response = await client.chat.completions.create(
        model="gemini-2.5-flash", messages=messages, tools=tools
    )

    return response.usage


async def main():
    print("Testing Gemini Flash caching threshold:\n")

    # Test with 2k words
    usage_2k = await test_gemini_caching(2000)
    print(f"2k words (2nd call): {usage_2k}")

    # Test with 8k words
    usage_8k = await test_gemini_caching(8000)
    print(f"8k words (2nd call): {usage_8k}")

    print(
        f"\n→ Cached tokens in 8k call: {getattr(usage_8k.prompt_tokens_details, 'cached_tokens', 0) if usage_8k.prompt_tokens_details else 0}"
    )


if __name__ == "__main__":
    asyncio.run(main())

This will lead to the output:

Testing Gemini Flash caching threshold:

2k words (2nd call): CompletionUsage(completion_tokens=10, prompt_tokens=2050, total_tokens=2095, completion_tokens_details=None, prompt_tokens_details=None)
8k words (2nd call): CompletionUsage(completion_tokens=10, prompt_tokens=8050, total_tokens=8099, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=7790))

→ Cached tokens in 8k call: 7790

As you can see, even for 2050 input tokens, nothing is cached (on the second call).
For 8k tokens the caching works as expected (on the second call).
We can also observe this behavior with flash in our production systems that short prompts just dont get cached at all.
The documentation specifies that implicit cache hits for flash should occur after 1,024 tokens.
We know that caching is best effort and can be unreliable, but right now it is reliably not working below 6k input tokens, which makes it prohibitively expensive for short tasks (classification etc).

I reposted this from Gemini 2.5 Flash implicit caching problem - #4 by Toni_A as this is quite an urgent issue. We are spending more than 1k on flash calls every single day and plan to increase this rapidly. Caching could reduce costs for us (and google) by 75%, so it would be great if someone could take a look at this!

chunduriv · July 2, 2025, 6:09pm

Hi @Toni_A,

Thank you for flagging this issue. We have reported this bug to the internal team.

Topic		Replies	Views
Gemini 2.5 Flash implicit caching problem Gemini API api , context_caching	5	742	March 4, 2026
Gemini 2.5 Flash Lite: Implicit Caching Not Working Despite Meeting Documented Requirements Gemini API bug , gemini	1	386	March 4, 2026
Has anyone gotten implicit caching to work? Gemini API gemini-3	2	152	May 5, 2026
Gemini 2.5 Flash Live Implicit Context Caching Not Working / Feedback Gemini API models , gemini	4	331	December 22, 2025
Implicit Caching not Working on Gemini 2.5 Pro Gemini API gemini-2-5 , context_caching	3	670	June 16, 2025

Flash implicit caching only works after 6k tokens vs the advertised 1k tokens

Related topics