Gemini-2.5-pro accessed over https://generativelanguage.googleapis.com/v1beta/openai/ has dramatic latency increase

I’m extensively using Gemini 2.5 Pro in an agentic workflow which accesses the model using the OpenAI compatibility approach (to be able to accommodate that model in OpenAI Agents SDK).

Since yesterday, 26.06, the model latency has increased significantly, taking multiple minutes to process requests, which previously took 10-15 seconds leading to timeouts and retry requests, which also fail in more than half of the cases.

The timeline of the issue is along with Patrick’s message on X about starting to serve requests for older 2.5 Pro model versions to actual one: https://x.com/OfficialLoganK/status/1937286302614929409

Switching to any of preview versions of Gemini 2.5 Pro didn’t help.

gemini-2.5-flash doesn’t seem to be affected and works under the same circumstances as expected but not an option for for my particular agentic workflow because of lower capabilities.

I haven’t tried accessing Gemini API directly instead of using the compatibility approach, so cannot provide information if the Pro model has acceptable latency in this case.

Anyone else facing the same?

1 Like

These are the actual latency numbers for 2.5-pro

1 Like

The token amount the model operates with is very modest.

Bildschirmfoto 2025-06-27 um 08.48.12

I’m having the same issue, but for both the 2.5 pro and 2.5 flash.

1 Like

When running the model and calling the API key I also encountered this problem you’re going to have to run pip install rust and pip install cryptography to fix the error,

When running the model and calling the API key I also encountered this problem you’re going to have to run pip install rust and pip install cryptography to fix the error,

@Clintin_Brummer, how would this affect latency most likely caused by new model deployment on the API side?

I’m encountering the same issue across all versions of gemini-2.5-pro
Screenshot 2025-07-01 at 15.27.51

I’m encountering the same issue across all versions of gemini-2.5-pro

@Khachik_Smbatyan, what I’ve just discovered to mitigate this is to switch to Runner in OpenAI Agent SDK to run in streaming mode – that magically fixed the issue.

See Streaming - OpenAI Agents SDK for details.

Hi Viktor, thank you I’ll give it a try. The main challenge is that we currently interact with the API using the genai client, in a manner similar to the example below. I’ll look into updating this to support streaming.

from google import genai
from google.genai import types

client = genai.Client()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=["Explain how AI works"],
    config=types.GenerateContentConfig(
        temperature=0.1
    )
)
print(response.text)
1 Like

Good morning I do apologize for the later reply currency using the Gemini-Cli setup I am calling the model using API key as an example here it is my launch prompt llm -m gemini-2.5-pro -c | tee >(espeak -v en-208 -p 0 -s 168 -g2 -k1) .
Please keep in mind that before the pipe functionality which is the straight line you’re going to have to input your message to the model within quoteations.
In terms of launching a new model I experience very little extended or added latency as opposed to running the model within an externall environment.
The error comes with the fact that the call made was via the Beta version . I do suggest rather using the Gemini-Cli approach especially when you’re running a device with limited performance and or hardware capabilities