I’ve been developing an app which uses vector embeddings and indexes and does cutting edge analysis that pushes the models to their limits. 2.5 Pro was my go to favorite model with 4.1 a little behind, but over this past weekend the model started hallucinating. The only possible explanation i can think of is that Google decided to implement some kind of persistent memory feature for a given api key, and that is confusing the heck out the model. I’m evaluating applications for something and used app A for a long time as a test case for development, but recently started using other apps, and that worked fine until this weekend where Gemini 2.5 Pro started to hallucinate that its still evaluating app A when asking it to evaluate app B, even though there’s absolutely nothing in the context going into the prompt that relates to app A. I used chainsmith debugging to carefully look at all the input and output context and it clearly shows no hint of app A in the input context. I use 2.5 pro in AI studio as well to help me with development and its been working on this with me and concludes the model is hallucinating. Here is a report it wrote on this:
Bug Report: gemini-2.5-pro-exp-03-25 Severe Off-Topic Hallucination
Model: gemini-2.5-pro-exp-03-25 (via Vertex AI API)
Issue Summary:
When providing specific technical text as input, along with related contextual documents and instructions to perform a structured JSON analysis based on that input, the gemini-2.5-pro-exp-03-25 model consistently generates a detailed analysis for a completely unrelated technical topic (Topic B: related to database internals/costs). The output contains specific technical details relevant only to Topic B, which are entirely absent from the provided input prompt and context documents focused on Topic A (related to sensors/ML in physical systems).
Input Context:
-
Task: Generate a structured JSON analysis based on provided instructions applied to a specific technical text document (Topic A).
-
Primary Input Text: A specific technical text document describing a method related to Topic A (sensors/ML in physical systems) .
-
Additional Context (RAG): Relevant text snippets retrieved from related technical and procedural context documents pertinent to Topic A.
-
API Parameters: Standard parameters used, including temperature: 0.1, response_mime_type: “application/json”.
Expected Behavior:
The model should analyze the provided primary input text (Topic A) based on the instructions and context documents, generating a structured JSON output reflecting this analysis.
Actual Behavior:
The model ignores the provided input (Topic A) and generates a complete, structured JSON response detailing an analysis for a method related to Topic B (database internals/costs) . It includes specific technical details (e.g., concerning database indexes, costs, query plans) relevant only to Topic B.
Verification & Evidence:
-
Internal application logging and debugging tools confirm the correct prompt (containing the Topic A primary text and relevant RAG context, with no Topic B terms) was constructed and sent to the API.
-
The raw API response received from the model contains the hallucinated Topic B analysis within the structured JSON output.
-
This off-topic generation occurs consistently with gemini-2.5-pro-exp-03-25 when processing this specific Topic A input.
-
Other models (e.g., gemini-1.5-pro-latest, gpt-4o) process the exact same input prompt correctly and generate an analysis relevant to the actual input text (Topic A).
Contextual Note (Potential Relevance for Internal Investigation):
During development, a different technical text input (focused on Topic B) was used extensively for testing specifically with the gemini-2.5-pro-exp-03-25 model. While standard LLM APIs are expected to be stateless, the consistent generation of content related to the previously tested Topic B when processing the unrelated Topic A input is unexpected and worth noting.
Impact:
This severe, off-topic hallucination makes the gemini-2.5-pro-exp-03-25 model unreliable and unusable for this specific analysis task and potentially others, despite its otherwise desirable analysis capabilities observed previously.
(Optional Addition):
We are happy to provide the exact raw API response and potentially the final constructed prompt (if feasible through a secure channel) to aid in debugging.