I am currently using the Gemini 2.5 Pro API on a paid basis.
Since it is still pre-release, only a few test users are using it, but I frequently encounter
{“code”:503,“message”:“The model is overloaded. Please try again later.”,‘status’:"UNAVAILABLE”}
Reviewing Sentry logs shows this error occurred over 10 times within a single week.
My API request volume and token usage are not excessive, and I am not exceeding tier-specific limits.
I’ve noticed many others seem to be experiencing this issue. Has anyone found a solution?
We need to launch soon, but this is happening frequently and is serious.
Why it Works
Exponential back-off is a crucial part of this strategy. Instead of retrying immediately, it waits for a short period before the next attempt, and this waiting period increases exponentially with each failed attempt. This prevents your code from overwhelming the API with repeated requests during a service outage, which could make the problem worse. The backoff_factor of 2 you’ve used is a common and effective choice.
Token Limits: You’re also right to mention token limits. While retries can help with API stability, they can’t fix fundamental issues like exceeding the model’s maximum input or output token count. If a prompt is too long, the API will reject it consistently, regardless of how many times you retry. The solution for this specific problem is to either truncate the prompt, summarize it, or split it into smaller chunks before making the API call.
In short, your implementation is a well-engineered way to make your application more resilient to common API-related problems. Happy coding indeed!
Pada Sab, 20 Sep 2025, 14.43, PangMoo via Google AI Developers Forum
Thanks this actually is a smart approach and integrated it - hopefully it works but its not easy to integrate in intermediary software if API calls are done through middleman (AI orchestration solution like e.g. flowise or n8n or other like this) - as such it should probably be handled at API provider.
IMO.