Gemini 503 issue

Despite being on Tier 3, we are consistently hitting 503 UNAVAILABLE errors that persist for hours, not minutes. Our retry strategy (4
retries with equal-jitter exponential backoff up to 120s) is being fully exhausted, and the google-genai SDK’s internal retry (~31s) is
also failing.
Questions:

  1. Is 503 UNAVAILABLE subject to different handling than 429 RESOURCE_EXHAUSTED? Our understanding is 429 is a true rate limit and 503
    is server-side capacity — but the Tier 3 docs don’t distinguish between them. Should we expect 503s even within documented rate limits?
  2. Is there a recommended maximum concurrent request count for Tier 3 that isn’t reflected in the RPM/TPM limits? We’re within RPM but
    maybe hitting a per-second or per-connection limit that isn’t documented.
  3. For batch workloads (100-500 documents), is the Vertex AI Batch API the recommended path? We see that it’s “not subject to real-time
    rate limiting” — would this eliminate the 503s entirely?
  4. Are there any request headers or parameters we can set to signal batch/low-priority traffic that would be handled more gracefully
    during capacity spikes?

Hello @Dhruv_Kabra,
503 (“Service Unavailable”) errors are unrelated to your quota. Instead, they indicate that our services are temporarily at capacity and cannot process your request at that moment. You might notice that these errors are more common during certain times of the day.

  1. A 429 error means you have hit your account quota (RPM/TPM), whereas a 503 error means the servers are at capacity. Because Tier 3 relies on shared (rather than dedicated) hardware, you can still encounter 503 errors within your rate limits during massive regional demand spikes.
  2. While there is no hard limit on concurrent requests, sending instant, massive bursts can overwhelm the API, causing it to return a 503 error. Pacing your outbound requests is generally a safer approach than relying solely on retries.
  3. The Batch API is an excellent choice for large workloads, as it is designed to process high volumes of requests asynchronously at a reduced cost. Because the Batch API queues your jobs, it helps avoid 503 errors entirely. Please note that while the target turnaround time is 24 hours, jobs are completed much quicker in the majority of cases.
  4. Ultimately, the Batch API is our recommended method for handling bulk, low-priority, or asynchronous work.