2.5 pro just started hallucinating

I’ve been developing an app which uses vector embeddings and indexes and does cutting edge analysis that pushes the models to their limits. 2.5 Pro was my go to favorite model with 4.1 a little behind, but over this past weekend the model started hallucinating. The only possible explanation i can think of is that Google decided to implement some kind of persistent memory feature for a given api key, and that is confusing the heck out the model. I’m evaluating applications for something and used app A for a long time as a test case for development, but recently started using other apps, and that worked fine until this weekend where Gemini 2.5 Pro started to hallucinate that its still evaluating app A when asking it to evaluate app B, even though there’s absolutely nothing in the context going into the prompt that relates to app A. I used chainsmith debugging to carefully look at all the input and output context and it clearly shows no hint of app A in the input context. I use 2.5 pro in AI studio as well to help me with development and its been working on this with me and concludes the model is hallucinating. Here is a report it wrote on this:
Bug Report: gemini-2.5-pro-exp-03-25 Severe Off-Topic Hallucination

Model: gemini-2.5-pro-exp-03-25 (via Vertex AI API)

Issue Summary:
When providing specific technical text as input, along with related contextual documents and instructions to perform a structured JSON analysis based on that input, the gemini-2.5-pro-exp-03-25 model consistently generates a detailed analysis for a completely unrelated technical topic (Topic B: related to database internals/costs). The output contains specific technical details relevant only to Topic B, which are entirely absent from the provided input prompt and context documents focused on Topic A (related to sensors/ML in physical systems).

Input Context:

  • Task: Generate a structured JSON analysis based on provided instructions applied to a specific technical text document (Topic A).

  • Primary Input Text: A specific technical text document describing a method related to Topic A (sensors/ML in physical systems) .

  • Additional Context (RAG): Relevant text snippets retrieved from related technical and procedural context documents pertinent to Topic A.

  • API Parameters: Standard parameters used, including temperature: 0.1, response_mime_type: “application/json”.

Expected Behavior:
The model should analyze the provided primary input text (Topic A) based on the instructions and context documents, generating a structured JSON output reflecting this analysis.

Actual Behavior:
The model ignores the provided input (Topic A) and generates a complete, structured JSON response detailing an analysis for a method related to Topic B (database internals/costs) . It includes specific technical details (e.g., concerning database indexes, costs, query plans) relevant only to Topic B.

Verification & Evidence:

  • Internal application logging and debugging tools confirm the correct prompt (containing the Topic A primary text and relevant RAG context, with no Topic B terms) was constructed and sent to the API.

  • The raw API response received from the model contains the hallucinated Topic B analysis within the structured JSON output.

  • This off-topic generation occurs consistently with gemini-2.5-pro-exp-03-25 when processing this specific Topic A input.

  • Other models (e.g., gemini-1.5-pro-latest, gpt-4o) process the exact same input prompt correctly and generate an analysis relevant to the actual input text (Topic A).

Contextual Note (Potential Relevance for Internal Investigation):
During development, a different technical text input (focused on Topic B) was used extensively for testing specifically with the gemini-2.5-pro-exp-03-25 model. While standard LLM APIs are expected to be stateless, the consistent generation of content related to the previously tested Topic B when processing the unrelated Topic A input is unexpected and worth noting.

Impact:
This severe, off-topic hallucination makes the gemini-2.5-pro-exp-03-25 model unreliable and unusable for this specific analysis task and potentially others, despite its otherwise desirable analysis capabilities observed previously.

(Optional Addition):
We are happy to provide the exact raw API response and potentially the final constructed prompt (if feasible through a secure channel) to aid in debugging.


2 Likes

ok i wanted to follow-up on this because i did some more digging and the problems seems to likely be some wrapper around the model that is not model specific. Here’s the updated report for anyone interested.

Update on Bug Report: Persistent Off-Topic Hallucination Across Gemini Models & API Keys

Model(s): gemini-2.5-pro-exp-03-25, gemini-2.5-pro-preview-03-25, gemini-1.5-pro-latest (via Vertex AI API)

Follow-up to Previous Report: [Link to or brief summary of the previous report about gemini-2.5-pro-exp-03-25 generating database content (Topic B) for sensor/ML input (Topic A)]

New Findings:

  1. Issue Persists Across Models: The same off-topic hallucination (generating analysis for Topic B - database internals/costs - when given input for Topic A - sensors/ML) was observed not only with gemini-2.5-pro-exp-03-25 but also consistently with gemini-2.5-pro-preview-03-25 . Furthermore, testing with gemini-1.5-pro-latest also resulted in an off-topic analysis related to database concepts when given the same Topic A input, although the specific hallucinated details differed slightly from the 2.5 Pro variants.

  2. Issue Persists Across API Keys: To rule out key-specific state corruption, a brand new API key was generated and used for testing. The off-topic hallucination (generating Topic B output for Topic A input) still occurred with gemini-2.5-pro-preview-03-25 using the new key.

  3. Input & Other Models Confirmed Correct: As previously verified, the input prompt (containing Topic A text and relevant context) is correctly formatted, and non-Google models (e.g., GPT-4 variants) process the exact same input correctly, generating on-topic results for Topic A.

Conclusion:
The evidence strongly suggests the root cause is not isolated to a single experimental model version or a specific API key. The issue appears to be more systemic, consistently affecting multiple models within the Gemini family (1.5-pro, 2.5-pro-preview, 2.5-pro-exp) specifically when processing this particular input (“Topic A”) within the context of our Google Cloud project. This points towards a potential issue in Google’s backend request handling, state management, or a model-family-wide sensitivity triggered by this specific input data under our account.

Impact:
This makes the affected Gemini models currently unreliable for processing this type of input within our project. We are avoiding their use pending investigation/resolution.

We hope this detailed information, particularly the persistence across models and API keys, is helpful for the Google AI team in diagnosing this unusual behavior.

1 Like

Hi @shapip , Welcome to the forum.

Based on the issue described above, I guess the problem is with the retriever. It seems to be providing the wrong context (Topic B) to the model instead of (Topic A). Could you tell us which embeddings model you are using to store data in the vector database?

Thanks for responding. I’m using Open AI text_embedding-3-large, but i traced the input and output into the model itself using the langsmith debugging tools and confirmed it was not the context pulled from the vector database that was incorrect. Happy to provide the debugging if you want it, though it’s probably too long to paste here. If you want me to email it to you please let me know where to send it. My suspicion is that its may be a backend expansion of the api to work with agentic features that hasn’t been activated yet but that somehow my application is causing it to partially activate (a bug) though i’m speculating.

As i mentioned above, it’s only happening with the Gemini models, I repeatedly tried other models and there’s no hallucinations.

Ok I think i figured it out and solved the problem. It looks like it was automatic caching from the api - the first paragraph or so of my prompt was the same for every application, and it seems that this was causing the gemini models to pull the wrong info from the cache. I solved it by including the application number in the first line and now the hallucinations appear to have gone away. I hope this is helpful for any other developers experiencing this problem - so happy to have my favorite model back to being useful!

1 Like

I am in same situation. I uploaded some docx and pics and 2.5 pro model began to ignore anything I typed, stuck in the previous answers. I have no idea how to pull it out of the endless rounding, there is no way to continue effective communication. :smiling_face_with_tear:

Put a unique identifier in the first sentence of the prompt - that worked for me. But now there’s a new problem with the 5-6 update (which apparently all the old 2.5 pro identifiers were redirected to) which is that they added some type of thinking wrapper to it which 1) seems to take forever, 2) isn’t helpful for tightly crafted rag context processing, and is probably harmful, and 3) apparently is being charged to the output tokens causing the 8192 limit to be easily exceeded even though the true output is nowhere near that. I’m going to do some more testing right now but yesterday I had to switch back to 1.5 pro, and I’m going to update my GPT 4.1 prompts and go back to using that model if I can’t get 2.5 pro working again.

Just tried it again still broken. This is what I’m getting back in my json structure when i call the model which shows that the 3-25 identifier is definitely being redirected to the 5-6 model, and that the token limit is being reached internally.
“usage_metadata”: {
“prompt_token_count”: 8920,
“total_token_count”: 17112
},
“model_version”: “models/gemini-2.5-pro-preview-05-06”

OK - was able to work through most of the issues with the 5-6 model - it’s still the best model for RAG after-all once I got it working again. The issue is the output tokens, I believe the old limit was 8192 but even if it wasn’t that is still the default limit if you don’t set it. The new limit is 64k - so you can specify a revised limit in your api call - if using python use the max_output_tokens parameter. I set it to 32k and all my rag modes started working again and I was getting the same great output i had been getting. Maybe I was too harsh on google - it was very annoying - but 2.5 pro is definitely still the best model even after the 5-6 update, you just need to explicitly deal with the max token issue.