We are intermittently experiencing 429 RESOURCE_EXHAUSTED errors from the Gemini / Vertex AI API in a paid project, even though all visible quotas are well within limits.
Visible quota usage in the Cloud Console is consistently <1%
Errors occur a few times per hour under normal, steady production load (not very large bursts)
Retries are implemented with backoff and jitter (base ~3s + 2–5s jitter)
Impact:
This is affecting a production application.
Request:
We suspect an undocumented internal quota, concurrency limit, token-throughput limit, or shared capacity throttling that is not exposed in the quota dashboard.
Could you please:
Confirm which internal quota or limit is being exceeded
Verify the project is not being enforced under free-tier or incorrect backend limits
Advise whether concurrency or capacity limits can be adjusted
We have the same issue. We are on Tier 1 and get error 429, but can not track down the quota that caused it, because nothing shows up in “current usage percentage”.
The 429 RESOURCE_EXHAUSTED error in this context typically stems from one of three causes that are not visible in the “Requests Per Minute” (RPM) view you shared.
Recommended Troubleshooting Steps
Step 1: Verify Token Usage (TPM)
Go back to the IAM & Admin > Quotas page:
Filter by the exact model causing errors (e.g., gemini-2.5-flash).
Look for “base_model_id_and_resolution: gemini-2.5-flash…-tokens-per-minute”.
Check: Is this bar spiking during your error windows? If so, you need to request a quota increase specifically for TPM, not RPM.
Step 2: Test a GA Model vs. Preview
If the errors are coming from gemini-3-flash-preview:
Action: Temporarily switch that traffic to gemini-1.5-pro-002 or gemini-1.5-flash-002 (Stable/GA versions).
Why: GA models have Service Level Agreements (SLAs) and reserved capacity. If the errors stop, the issue was “Preview” capacity throttling, which you cannot fix other than by waiting or switching models.
Step 3: Regional Redundancy
If europe-west4 is legitimately experiencing “Shared Capacity” issues (which happens):
Action: Configure your client to failover to a different region (e.g., us-central1 or europe-west1) upon receiving a 429.
Note: This requires your data residency requirements to allow processing in other regions.
Summary for your Engineering Team
Hypothesis: The application is likely hitting a Token Throughput (TPM) limit which is distinct from the Request (RPM) limit shown, OR it is suffering from Service Health/Capacity throttling on the gemini-3-flash-preview model.
We are seeing 429 RESOURCE_EXHAUSTED errors on gemini-3-flash-preview even when token usage is well below 50k TPM and the quota dashboard shows “Unlimited” for this model.
The error responses contain no quota name or dimension.
This strongly suggests shared capacity or preview-model admission control rather than a visible quota limit.
Could you please confirm whether our project is being throttled due to preview model capacity and whether a GA alternative or capacity adjustment is recommended?
Have been experiencing the same thing with other models. I thought Vertex AI was supposed to be the reliable one with a couple steps to set up, but that does not seem to be the case.