Gemini-2.5-flash-image: Frequent 429 RESOURCE_EXHAUSTED during sequential image generation - seeking clarity on rate limits

Context

I’m building a children’s storybook app that generates 4 comic panel images per story using gemini-2.5-flash-image via Vertex AI. The app flow is:

  1. User captures a drawing
  2. AI interprets the drawing (text-only, works fine)
  3. Backend generates 4 comic panels sequentially (one after another, not parallel)

The sequential generation is intentional - each panel includes the previous 1-2 images for visual consistency (character appearance, art style).

The Problem

Even with sequential requests and exponential backoff retries, I consistently hit 429 RESOURCE_EXHAUSTED errors, typically on the 3rd or 4th panel of a story. The timing between successful requests is 10-40+ seconds depending on image generation time plus any backoff delays.

Configuration:

  • Model: gemini-2.5-flash-image
  • Endpoint: Global (location="global")
  • Project: Vertex AI on GCP (pay-as-you-go billing)
  • Retry strategy: 3 attempts with exponential backoff (5s, 10s, 20s)
  • Request pattern: Strictly sequential, one image at a time

Actual test run (2026-01-26):

10:02:50 - Panel 1: SUCCESS in 6.5s
10:02:56 - Panel 2: SUCCESS in 10.4s
10:03:07 - Panel 3: 429 RESOURCE_EXHAUSTED
           Retry after 5s backoff: 429 again
           Retry after 10s backoff: 429 again
           All 3 retries failed (32.8s total)
10:04:10 - Panel 4: SUCCESS in 9.7s (after 30s wait from panel 3 failure)

Observations:

  • Hit 429 on 3rd request, only ~10 seconds after 2nd request succeeded
  • Rate limit persisted for 32.8s of retry attempts
  • After waiting 30s, the next request succeeded immediately
  • Effective throughput: ~2 requests per minute before hitting limits
  • No Retry-After header in 429 responses (would be very helpful for backoff tuning)

Questions

  1. Is this rate limiting expected? With only 1 RPM average (one image every 15-40 seconds), I expected to be well within limits. The quota dashboard shows minimal usage.

  2. Do tier thresholds affect image generation? The Standard PayGo documentation states that “usage tiers don’t apply” to image generation models. However, I’m currently below Tier 1 spend thresholds (~$0 rolling 30-day spend). Could being in this low-spend state still result in more aggressive throttling for image generation, even if not documented as part of the tier system?

  3. Are there known capacity constraints? I’ve seen other threads mentioning traffic-related 429s for Gemini image models. Is gemini-2.5-flash-image currently experiencing capacity constraints?

  4. Recommendations? Besides Provisioned Throughput (which seems like significant overkill for 4 images/story during development), are there strategies I should try:

    • Different regional endpoints?
    • Specific time-of-day patterns?
    • Request modifications (smaller prompts, no multi-image context)?
  5. Retry-After header? Is there a plan to include Retry-After or similar headers in 429 responses? This would help clients implement smarter backoff without guessing.

What I’ve Tried

  • Using global endpoint (as recommended)
  • Tested regional endpoint (us-central1) - same rate limiting behavior
  • Exponential backoff with 3 retries (5s, 10s, 20s delays)
  • Sequential requests only (no parallelism)
  • Verified billing is active and linked correctly
  • Confirmed quota dashboard shows very low usage
1 Like

I started hitting the same thing in the last few days with gemini-2.5-flash-image after months of no issue. Also seems to happen on the third sequential request. It seems like something changed on Google’s side.

Eu também to com o mesmo problema mas não com história e sim com edição de imagem ele edita uma imagem modifica 2 veze na 3x pra alterar, ele dá a mensagem: " Retry-After" e You’ve reached your rate limit. Please try again later.
Não mostra o tempo mas como estou testando outras janelas, limpo o cache e problema continua. Isso não é de hoje desde ante ontem dia 24 ja começou a ocorrer mas com menos frequencia.

That’s interesting, thanks for sharing. Yes, seems strange that it’s failing so consistently. That doesn’t quite match with my understanding of what 429s should indicate (ie actual resource exhaustion of a global quota shared between all Vertex AI pay-as-you-go users).

I agree it seems odd. I found these other recent threads which appear to be related:

Likewise, this has been hitting us pretty hard since ~1/22, where it spiked significantly. I’ve tried the approaches listed here also. We’re likewise generating a batch of several images per workflow. Our absolute daily volume is relatively low though, so it’s not like we’re not spamming the model.

If I exclude exponential backoff retries, I’d say 75% of our requests are failing with 429s. Even with super long, gratuitous backoffs, we’re still seeing like ~35% our requests ultimately failing. Our image gen times have gone from 30s to 3 minutes, and half are broken. Our customers are not very happy with us.

I understand capacity is an issue, and that provisioned throughput is the obvious solution, but I have to admit I’m frustrated that the availability is so abysmal here and that the extremely low availability isn’t communicated more clearly. If this can’t be used for more than I toy project, and guarantees are so low, we need to know that upfront so we can evaluate our providers better.