When the Gemini API returns a 503 (“service unavailable”), the official guidance is to retry with exponential backoff. I follow that guidance.
The problem: failed 503 attempts are still counted toward my request quota (RPM/RPD). This creates a compounding failure mode:
-
The API returns 503 at a high rate
-
Per the recommended retry strategy, I retry
-
Each retry counts as a quota request, even though no useful work was done
-
I hit the 429 quota limit almost immediately
-
My effective quota is a small fraction of what the dashboard shows
Here is my dashboard data from the past three days:
| Date | Total Requests | Success Rate | Useful Requests |
|---|---|---|---|
| May 5 | 382 | 16.2% | ~62 |
| May 6 | 152 | 1.3% | ~2 |
| May 7 | 1,426 | 9.5% | ~136 |
On May 7, I made over 1,400 requests — but 90% were 503 failures and their retries. Those retries depleted my quota, which is why the request count spiked that day while useful work barely increased.
This is a policy problem, not just a capacity problem. If Google’s infrastructure is temporarily overloaded, that is understandable. But counting failed capacity errors against the user’s quota means the user is penalized twice: once by the outage, and again by having their quota consumed by retries that Google’s own guidance recommends.
What I’d expect: 503 responses should not count toward RPM or RPD quotas. Only requests that are successfully processed should consume quota.
One additional detail that may be relevant: I am using the Google Search tool (grounding). I don’t know whether this contributes to a higher 503 rate compared to standard generation requests, but given the failure rates above, it seems worth flagging. If this tool is known to have lower reliability or different quota behavior, that would be important context for users relying on it.
@Jon_Matthews - is there an official position on this behavior? Is there a recommended approach for users who need to retry 503s without burning their request quota in the process?