Hi all — we’re seeing a persistent increase in 429 RESOURCE_EXHAUSTED errors in production when calling Gemini via Vertex AI streaming.
What changed
-
Previously our stack was using:
-
google.cloud.aiplatform.v1.PredictionService.StreamGenerateContent -
This was noticeably more stable (rare 429s).
-
-
Recently it switched to:
-
google.cloud.aiplatform.v1beta1.PredictionService.StreamGenerateContent -
Since then we’re seeing lots of
RESOURCE_EXHAUSTEDfailures.
-
We’re calling from google-adk, using streaming responses. The ADK library appears to hardcode the Vertex client to v1beta1, so it’s difficult to test v1 without patching.
Symptoms
-
Error:
google.adk.models.google_llm._ResourceExhaustedError(maps to 429 RESOURCE_EXHAUSTED) -
Model:
gemini-3-flash-preview -
Location:
global(Gemini 3 Flash preview seems global-only on Vertex) -
Happens despite quotas looking within limits in the console.
-
These are not huge bursts — it occurs during normal interactive chat traffic
Questions
-
Is anyone else seeing an uptick in 429 RESOURCE_EXHAUSTED specifically with
v1beta1 StreamGenerateContentin the last ~1–2 weeks? -
Are there additional/granular limits (per-model / per-project / per-stream concurrency / per-minute token limits) that don’t show up clearly in the standard quota charts?
-
Does
RESOURCE_EXHAUSTEDreliably distinguish between:-
quota exceeded vs
-
backend capacity / shared contention
…and is there a recommended way to tell which one we’re hitting (e.g., specific metric strings, headers, gRPC details like RetryInfo)?
-
-
Any best practices for stability here beyond:
-
client-side concurrency limiting
-
exponential backoff with jitter
-
token/context reduction
-
or purchasing Provisioned Throughput for
gemini-3-flash-preview?
-
Happy to share anonymized logs if helpful (timestamps, approximate request rates, concurrent stream counts, input/output token estimates, etc.). Mainly trying to understand whether this is a known issue with current global capacity / v1beta1 routing, and what the recommended mitigation is.
Thanks!