Critical bug: Vertex API with context cache leaks prompt state between generateContent calls

We have found what we believe is a serious bug with using explicit context caches together with prompts containing images with the Vertex API. We have a complex system with moderate size (~ 5000 token) prompts that we are moving from the Gemini API to the Vertex API, and found that post-migration, generate_content calls were producing ‘nonsensical’ responses that seemed to apply to prompts we provided at a previous time.

We have put together a reproducer in Python that’s as simple as possible (~ 100 lines of code) which reproduces this bug 100% of the time. It first makes an explicit context cache containing ~ 2500 tokens (‘lorem ipsum’ repeated). It makes a generate_content call with an image of a cat in the prompt, with the prompt asking to identify it. It then makes another generate_content call with an image of a dog in the prompt , with a prompt asking to identify it. When the Vertex API is used with a context cache (contents of the cache don’t matter, it just has to be present), the second generate_content call produces what looks like a response to the earlier prompt (i.e. it says ‘cat’ even though it is prompted to produce ‘dog’). Correct responses are produced by Vertex API when context cache is not used, and by Gemini API both with and without context cache.

Observations:

  • Issue seems to be somewhat region-dependent. It fails every time on us-central1 but only sometimes on ‘global’.
  • Issue only seems to happen with prompts containing images (and only for images above a certain size).
  • The reproducer uses gemini-2.5-flash-lite but the issue occurs with any model.
  • We have examined the REST API calls made and they seem to contain the correct data, so this looks like a back end issue not an SDK issue

I would appreciate you looking into this. This is a blocker for us because it means we can’t use Vertex API and context caching together.

Steps to reproduce

The reproducer demonstrates things working in the Gemini API and breaking in the Vertex API, so to run it you need a Gemini API key and a project/service account with Vertex AI API enabled. Set up these environment variables:

GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service-account-keys.json
MY_VERTEX_PROJECT_ID=project-id
MY_GEMINI_API_KEY=AIza...

Then unpack and run the attachment (contains reproducer.py, cat.jpg and dog.jpg):

tar xvfz reproducer.tar.gz
python reproducer.py

Reproducer prints:

Model is prompted with a cat image only and asked to identify, then separately prompted with a dog image only and asked to identify.
Each test should print 'Cat' followed by 'Dog'.

Testing with Gemini API (no context cache):
Cat
Dog

Testing with Vertex API (no context cache):
Cat
Dog

Testing with Gemini API (using context cache):
Cat
Dog

Testing with Vertex API (using context cache):
Cat
Cat

Should be Cat/Dog every time - no reference to a cat is made in the 2nd generate_content call each time, indicating that in the Vertex AI / context cache test case, state is somehow leaking from the 1st generate_content call, which sounds pretty serious.

reproducer.tar.gz

Thanks so much in advance for looking into this!

-Adrian

1 Like

+1. We have also been experiencing this recently. We see this problem even when images are not present in the prompt, but other file/mime types such as pdf’s are.

This seems like a critical issue?!