Gemini 2.5 Flash Lite: Implicit Caching Not Working Despite Meeting Documented Requirements

Hello Gemini team,

I’ve been testing implicit caching with Gemini 2.5 Flash Lite Latest and encountered unexpected behavior that
differs from the documentation. I’d appreciate clarification on whether Flash Lite has different caching
mechanisms compared to standard Flash.

Issue Summary

Despite meeting the documented 1024-token threshold for implicit caching, Flash Lite consistently returns
cached_content_token_count = 0 in production environments, while my controlled tests show caching only works above
~3000 tokens.

Test Evidence

Production Environment (No Cache Hits)

  • Model: gemini-flash-lite-latest
  • System prompt: ~1301 tokens (fixed across requests)
  • Total input: ~1551 tokens per request
  • Results: 377 API calls, 0 cache hits (cached_content_token_count always 0)
  • Configuration: Used system_instruction parameter with identical content

Controlled Test (Partial Cache Hits)

I wrote a test script to systematically verify caching behavior:

Test Scenario 1 (~3177 tokens):
Request 1: cached_content_token_count = 0
Request 2: cached_content_token_count = 0
Request 3: cached_content_token_count = 3060 :white_check_mark:

Test Scenario 2 (~1500 tokens):
All requests: cached_content_token_count = 0 :cross_mark:

Documentation Gaps

  1. Official documentation states implicit caching triggers at 1024 tokens for Flash models
  2. Reality: Our tests suggest Flash Lite requires significantly more tokens (~3000-6000+)
  3. Unclear differentiation: The docs mention “Gemini 2.5 series” but don’t explicitly clarify if Flash Lite has
    different caching thresholds or mechanisms

Related Community Reports

I found a community bug report ( Gemini 2.5 Flash implicit caching problem )
where Google engineers confirmed the actual threshold is much higher (~6000-8000 tokens) for standard Flash,
despite documentation stating 1024 tokens.

Questions for Google Team

  1. Does Flash Lite have a different implicit caching threshold than standard Flash?
  2. What is the actual token threshold for Flash Lite to trigger implicit caching?
  3. Is the system_instruction parameter content included in the caching prefix matching?
  4. When will the documentation be updated to reflect actual caching behavior?

Requests

  1. Official clarification on Gemini 2.5 Flash Lite’s caching trigger conditions
  2. Documentation update to explicitly state Flash Lite requirements vs standard Flash
  3. Bug fix or clear guidance on achieving consistent caching behavior at lower token counts (as originally
    documented)

Technical Details

API Configuration:
response = client.models.generate_content(
model=“models/gemini-flash-lite-latest”,
contents=[…],
config=types.GenerateContentConfig(
system_instruction=fixed_prompt, # ~1301 tokens
temperature=0.8,
top_p=0.95
)
)

Cache Detection:
cached_tokens = response.usage_metadata.cached_content_token_count

Environment:

  • SDK: google-genai (latest)
  • Deployment: Production service with high request volume
  • Use case: Translation service with consistent system prompts

Impact

This discrepancy significantly affects cost estimation and system design for production applications.
Understanding the exact caching behavior is crucial for our service planning.

Thank you for your attention to this issue. Happy to provide additional test data or collaborate on reproducing
the behavior.

1 Like