Context
I’m building a children’s storybook app that generates 4 comic panel images per story using gemini-2.5-flash-image via Vertex AI. The app flow is:
- User captures a drawing
- AI interprets the drawing (text-only, works fine)
- Backend generates 4 comic panels sequentially (one after another, not parallel)
The sequential generation is intentional - each panel includes the previous 1-2 images for visual consistency (character appearance, art style).
The Problem
Even with sequential requests and exponential backoff retries, I consistently hit 429 RESOURCE_EXHAUSTED errors, typically on the 3rd or 4th panel of a story. The timing between successful requests is 10-40+ seconds depending on image generation time plus any backoff delays.
Configuration:
- Model:
gemini-2.5-flash-image - Endpoint: Global (
location="global") - Project: Vertex AI on GCP (pay-as-you-go billing)
- Retry strategy: 3 attempts with exponential backoff (5s, 10s, 20s)
- Request pattern: Strictly sequential, one image at a time
Actual test run (2026-01-26):
10:02:50 - Panel 1: SUCCESS in 6.5s
10:02:56 - Panel 2: SUCCESS in 10.4s
10:03:07 - Panel 3: 429 RESOURCE_EXHAUSTED
Retry after 5s backoff: 429 again
Retry after 10s backoff: 429 again
All 3 retries failed (32.8s total)
10:04:10 - Panel 4: SUCCESS in 9.7s (after 30s wait from panel 3 failure)
Observations:
- Hit 429 on 3rd request, only ~10 seconds after 2nd request succeeded
- Rate limit persisted for 32.8s of retry attempts
- After waiting 30s, the next request succeeded immediately
- Effective throughput: ~2 requests per minute before hitting limits
- No
Retry-Afterheader in 429 responses (would be very helpful for backoff tuning)
Questions
-
Is this rate limiting expected? With only 1 RPM average (one image every 15-40 seconds), I expected to be well within limits. The quota dashboard shows minimal usage.
-
Do tier thresholds affect image generation? The Standard PayGo documentation states that “usage tiers don’t apply” to image generation models. However, I’m currently below Tier 1 spend thresholds (~$0 rolling 30-day spend). Could being in this low-spend state still result in more aggressive throttling for image generation, even if not documented as part of the tier system?
-
Are there known capacity constraints? I’ve seen other threads mentioning traffic-related 429s for Gemini image models. Is
gemini-2.5-flash-imagecurrently experiencing capacity constraints? -
Recommendations? Besides Provisioned Throughput (which seems like significant overkill for 4 images/story during development), are there strategies I should try:
- Different regional endpoints?
- Specific time-of-day patterns?
- Request modifications (smaller prompts, no multi-image context)?
-
Retry-After header? Is there a plan to include
Retry-Afteror similar headers in 429 responses? This would help clients implement smarter backoff without guessing.
What I’ve Tried
- Using global endpoint (as recommended)
- Tested regional endpoint (
us-central1) - same rate limiting behavior - Exponential backoff with 3 retries (5s, 10s, 20s delays)
- Sequential requests only (no parallelism)
- Verified billing is active and linked correctly
- Confirmed quota dashboard shows very low usage