Spike in 429 RESOURCE_EXHAUSTED with v1beta1 StreamGenerateContent (Gemini 3 Flash Preview / Vertex global) — quotas look fine

Hi all — we’re seeing a persistent increase in 429 RESOURCE_EXHAUSTED errors in production when calling Gemini via Vertex AI streaming.

What changed

  • Previously our stack was using:

    • google.cloud.aiplatform.v1.PredictionService.StreamGenerateContent

    • This was noticeably more stable (rare 429s).

  • Recently it switched to:

    • google.cloud.aiplatform.v1beta1.PredictionService.StreamGenerateContent

    • Since then we’re seeing lots of RESOURCE_EXHAUSTED failures.

We’re calling from google-adk, using streaming responses. The ADK library appears to hardcode the Vertex client to v1beta1, so it’s difficult to test v1 without patching.

Symptoms

  • Error: google.adk.models.google_llm._ResourceExhaustedError (maps to 429 RESOURCE_EXHAUSTED)

  • Model: gemini-3-flash-preview

  • Location: global (Gemini 3 Flash preview seems global-only on Vertex)

  • Happens despite quotas looking within limits in the console.

  • These are not huge bursts — it occurs during normal interactive chat traffic

Questions

  1. Is anyone else seeing an uptick in 429 RESOURCE_EXHAUSTED specifically with v1beta1 StreamGenerateContent in the last ~1–2 weeks?

  2. Are there additional/granular limits (per-model / per-project / per-stream concurrency / per-minute token limits) that don’t show up clearly in the standard quota charts?

  3. Does RESOURCE_EXHAUSTED reliably distinguish between:

    • quota exceeded vs

    • backend capacity / shared contention
      …and is there a recommended way to tell which one we’re hitting (e.g., specific metric strings, headers, gRPC details like RetryInfo)?

  4. Any best practices for stability here beyond:

    • client-side concurrency limiting

    • exponential backoff with jitter

    • token/context reduction

    • or purchasing Provisioned Throughput for gemini-3-flash-preview?

Happy to share anonymized logs if helpful (timestamps, approximate request rates, concurrent stream counts, input/output token estimates, etc.). Mainly trying to understand whether this is a known issue with current global capacity / v1beta1 routing, and what the recommended mitigation is.

Thanks!

The 429 RESOURCE_EXHAUSTED with v1beta1 StreamGenerateContent often isn’t just about visible quotas. It can reflect per-model or per-stream concurrency limits, backend capacity, or token-rate limits that aren’t shown in the console. RESOURCE_EXHAUSTED doesn’t always distinguish between quota vs. backend contention, but gRPC headers like RetryInfo or monitoring serving.googleapis .com/quota_exceeded metrics can help identify the cause.

For stability, keep client-side concurrency low, use exponential backoff with jitter, reduce token/context size, and consider Provisioned Throughput if consistent capacity is needed. The uptick with v1beta1 likely relates to routing changes and global shared load.

1 Like

We are seeing this exact same issue too..and its with only using `gemini-3-pro-image-preview` through Vertex Studio which routes through: google.cloud.aiplatform.ui.PredictionService.StreamGenerateContent, which quite possibly may be hitting the v1beta1 under the hood.

It’s not intermittent..its been like this for the past several days, with no semblance of it actually responding with anything but a 429 for us.

There are multiple reports of this across Vertex and Gemini API users, some going on for months..but no resolution.

It’s definitely not about quotas for us. This really feels like its at the infrastructure level and their is general platform instability with this model in particular and how shared throughput is being managed (or not)