2.5 pro just started hallucinating

I’ve been developing an app which uses vector embeddings and indexes and does cutting edge analysis that pushes the models to their limits. 2.5 Pro was my go to favorite model with 4.1 a little behind, but over this past weekend the model started hallucinating. The only possible explanation i can think of is that Google decided to implement some kind of persistent memory feature for a given api key, and that is confusing the heck out the model. I’m evaluating applications for something and used app A for a long time as a test case for development, but recently started using other apps, and that worked fine until this weekend where Gemini 2.5 Pro started to hallucinate that its still evaluating app A when asking it to evaluate app B, even though there’s absolutely nothing in the context going into the prompt that relates to app A. I used chainsmith debugging to carefully look at all the input and output context and it clearly shows no hint of app A in the input context. I use 2.5 pro in AI studio as well to help me with development and its been working on this with me and concludes the model is hallucinating. Here is a report it wrote on this:
Bug Report: gemini-2.5-pro-exp-03-25 Severe Off-Topic Hallucination

Model: gemini-2.5-pro-exp-03-25 (via Vertex AI API)

Issue Summary:
When providing specific technical text as input, along with related contextual documents and instructions to perform a structured JSON analysis based on that input, the gemini-2.5-pro-exp-03-25 model consistently generates a detailed analysis for a completely unrelated technical topic (Topic B: related to database internals/costs). The output contains specific technical details relevant only to Topic B, which are entirely absent from the provided input prompt and context documents focused on Topic A (related to sensors/ML in physical systems).

Input Context:

  • Task: Generate a structured JSON analysis based on provided instructions applied to a specific technical text document (Topic A).

  • Primary Input Text: A specific technical text document describing a method related to Topic A (sensors/ML in physical systems) .

  • Additional Context (RAG): Relevant text snippets retrieved from related technical and procedural context documents pertinent to Topic A.

  • API Parameters: Standard parameters used, including temperature: 0.1, response_mime_type: “application/json”.

Expected Behavior:
The model should analyze the provided primary input text (Topic A) based on the instructions and context documents, generating a structured JSON output reflecting this analysis.

Actual Behavior:
The model ignores the provided input (Topic A) and generates a complete, structured JSON response detailing an analysis for a method related to Topic B (database internals/costs) . It includes specific technical details (e.g., concerning database indexes, costs, query plans) relevant only to Topic B.

Verification & Evidence:

  • Internal application logging and debugging tools confirm the correct prompt (containing the Topic A primary text and relevant RAG context, with no Topic B terms) was constructed and sent to the API.

  • The raw API response received from the model contains the hallucinated Topic B analysis within the structured JSON output.

  • This off-topic generation occurs consistently with gemini-2.5-pro-exp-03-25 when processing this specific Topic A input.

  • Other models (e.g., gemini-1.5-pro-latest, gpt-4o) process the exact same input prompt correctly and generate an analysis relevant to the actual input text (Topic A).

Contextual Note (Potential Relevance for Internal Investigation):
During development, a different technical text input (focused on Topic B) was used extensively for testing specifically with the gemini-2.5-pro-exp-03-25 model. While standard LLM APIs are expected to be stateless, the consistent generation of content related to the previously tested Topic B when processing the unrelated Topic A input is unexpected and worth noting.

Impact:
This severe, off-topic hallucination makes the gemini-2.5-pro-exp-03-25 model unreliable and unusable for this specific analysis task and potentially others, despite its otherwise desirable analysis capabilities observed previously.

(Optional Addition):
We are happy to provide the exact raw API response and potentially the final constructed prompt (if feasible through a secure channel) to aid in debugging.


1 Like

ok i wanted to follow-up on this because i did some more digging and the problems seems to likely be some wrapper around the model that is not model specific. Here’s the updated report for anyone interested.

Update on Bug Report: Persistent Off-Topic Hallucination Across Gemini Models & API Keys

Model(s): gemini-2.5-pro-exp-03-25, gemini-2.5-pro-preview-03-25, gemini-1.5-pro-latest (via Vertex AI API)

Follow-up to Previous Report: [Link to or brief summary of the previous report about gemini-2.5-pro-exp-03-25 generating database content (Topic B) for sensor/ML input (Topic A)]

New Findings:

  1. Issue Persists Across Models: The same off-topic hallucination (generating analysis for Topic B - database internals/costs - when given input for Topic A - sensors/ML) was observed not only with gemini-2.5-pro-exp-03-25 but also consistently with gemini-2.5-pro-preview-03-25 . Furthermore, testing with gemini-1.5-pro-latest also resulted in an off-topic analysis related to database concepts when given the same Topic A input, although the specific hallucinated details differed slightly from the 2.5 Pro variants.

  2. Issue Persists Across API Keys: To rule out key-specific state corruption, a brand new API key was generated and used for testing. The off-topic hallucination (generating Topic B output for Topic A input) still occurred with gemini-2.5-pro-preview-03-25 using the new key.

  3. Input & Other Models Confirmed Correct: As previously verified, the input prompt (containing Topic A text and relevant context) is correctly formatted, and non-Google models (e.g., GPT-4 variants) process the exact same input correctly, generating on-topic results for Topic A.

Conclusion:
The evidence strongly suggests the root cause is not isolated to a single experimental model version or a specific API key. The issue appears to be more systemic, consistently affecting multiple models within the Gemini family (1.5-pro, 2.5-pro-preview, 2.5-pro-exp) specifically when processing this particular input (“Topic A”) within the context of our Google Cloud project. This points towards a potential issue in Google’s backend request handling, state management, or a model-family-wide sensitivity triggered by this specific input data under our account.

Impact:
This makes the affected Gemini models currently unreliable for processing this type of input within our project. We are avoiding their use pending investigation/resolution.

We hope this detailed information, particularly the persistence across models and API keys, is helpful for the Google AI team in diagnosing this unusual behavior.

1 Like

Hi @shapip , Welcome to the forum.

Based on the issue described above, I guess the problem is with the retriever. It seems to be providing the wrong context (Topic B) to the model instead of (Topic A). Could you tell us which embeddings model you are using to store data in the vector database?

Thanks for responding. I’m using Open AI text_embedding-3-large, but i traced the input and output into the model itself using the langsmith debugging tools and confirmed it was not the context pulled from the vector database that was incorrect. Happy to provide the debugging if you want it, though it’s probably too long to paste here. If you want me to email it to you please let me know where to send it. My suspicion is that its may be a backend expansion of the api to work with agentic features that hasn’t been activated yet but that somehow my application is causing it to partially activate (a bug) though i’m speculating.

As i mentioned above, it’s only happening with the Gemini models, I repeatedly tried other models and there’s no hallucinations.