A few days ago, I began experiencing significant delays in Gemini-1.5-Flash API responses. Requests that previously took 2-3 seconds are now taking 60-90 seconds. These extended response times make my application unusable.
In contrast, similar requests on Google AI Studio continue to receive responses at their previous speed.
I would appreciate your assistance in identifying the cause of these delays to help me debug and resolve this critical issue.
I found a solution which returns the API response time to normal for my app’s functionality:
Use a different Gemini 1.5 Flash model. It can be either 1.5 Flash 002 or 1.5 Flash-8b. This can be done manually by changing the model_name parameter from gemini-1.5-Flash to gemini-1.5-Flash-002 or gemini-1.5-Flash-8b. Another way is via Google AI Studio by selecting gemini-1.5-Flash-002 or gemini-1.5-Flash-8b as your model and using the Get code button to get the updated code.
In both alternative models, the top_k value changes from 64 to 40. Verify changing this parameter, or you’ll receive the following error:
google.api_core.exceptions.InvalidArgument: 400 Unable to submit request because it has a topK value of 64 but the supported range is from 1 (inclusive) to 41 (exclusive). Update the value and try again.
Test the results for both alternative models. I’ve found that for my app, both models had similar API response times, but gemini-1.5-Flash-002 followed the system instruction better than gemini-1.5-Flash-8b.
I wish we could be notified regarding such issues in advance in the future. I don’t even know if this malfunction in gemini-1.5-Flash is a bug or an intentional action to divert developers to more advanced models. I’d appreciate if someone from the Google/Gemini team could respond and consider such a notification.