FYI, we were experiencing periodic service outage issues (429 or 503 errors) and are on a “pay as you go” plan, it looks like it is due to Google’s new Dynamic Shared Quota (DSQ) “feature”.
DSQ affects the latest versions of Gemini 1.5 Flash (gemini-1.5-flash-002) and Gemini 1.5 Pro (gemini-1.5-pro-002). It distributes on-demand capacity among all queries processed by Google Cloud services for the 1.5-002 and 2.0 models. This means excessive requests within the same region can lead to 429 resource exhausted errors, even for pay-as-you-go accounts. Error code 429 | Generative AI on Vertex AI | Google Cloud
While exponential backoff can help mitigate this, it’s not a foolproof solution. Currently, the only reliable options for production workloads seem to be:
Migrating to a direct Vertex API and switching regions.
It’s worth noting that reducing input token count can also improve success rates. Keeping input tokens under 20k and using exponential backoff has resulted in a consistent 70-80% success rate in my recent tests. Input exceeding 20k tokens becomes significantly less reliable (with the current region utilization).
I see a couple of other threads on this problem of 429s appearing as if you’re on free, even when you’re on paid. I’m running into the same issue. 429 Too Many Requests on a paid account with no signs of going anywhere near quota in the dashboard, and only make <5 RPM each with hundreds of tokens. Clearly some people are not having this issue. But I haven’t been able to narrow down anything that changed or I’m doing wrong.
I have found some improvement by sticking strictly to the gemini conversation format (Text generation | Gemini API | Google AI for Developers) which has the multipart format {role: “user”, parts[{text:“text”}]} vs the way I was doing it before {role:“user”, content:“text”}.
Got a Google reply about the 429 error with Gemini 2.0 Flash in VertexAI
I got the chance to talk to our business partner at Google this week and told him about the 429 Quota exceed error in Vertex AI, even though we are in a paid tier (1) (don’t ask me what the difference is or how to change) with 2000 requests per minute. The error appeared after 5 requests…
tl;dr: The quota is not guaranteed, so you should consider purchasing “Provisioned Throughput”, BUT Provisioned Throughput is not supported at launch for Gemini 2.0 Flash (as are Fine Tuning, Context Caching and Batch API). So we need to wait A COUPLE OF WEEKS for it to be solved…
My hope is that it’s currently a ressource problem that might solve itself in the next few days somehow and they have allocate more free ressources to us. It’s really a big bummer as we were really looking forward to use 2.0 Flash and the results look promising.
Why isn’t anyone from Google replying to these? There’s thousands of us that are dealing with this complete BS issue of “[429 too many requests] resource has been exhausted” after a miniscule amount of requests, even while on a paid tier. One of the largest companies in the world and they have the worst console UI, multiple seemingly disconnected portals that are all strangely linked together with no clear explanation of what’s what, and every time I use the site it memory leaks and crashes after 6gb+ of memory usage in my browser.
LMAO. PAID users OPENING CASES on Twitter for this 429 errors
while none of GCP from Google replying to these “changes” (Pay as you go with provisioned throughput
I AGREE WITH Dylan’s opinion from Lex Fridman podcast earlier this Feb
Dylan Patel (04:04:27) … if there’s no revenue for AI stuff or not enough revenue, then obviously, it’s going to blow up. People won’t continue to spend on GPUs forever. And NVIDIA is trying to move up the stack with software that they’re trying to sell and licensed and stuff. But Google has never had that DNA of like, “This is a product we should sell.” The Google Cloud, which is a separate organization from the TPU team, which is a separate organization from the DeepMind team, which is a separate organization from the Search team. There’s a lot of bureaucracy here.
Lex Fridman (04:04:52) Wait. Google Cloud is a separate team than the TPU team?
Dylan Patel (04:04:55) Technically, TPU sits under infrastructure, which sits under Google Cloud. But Google Cloud, for renting stuff-
Dylan Patel (04:05:00) … But Google cloud for renting stuff and TPU architecture are very different goals, and hardware and software, all of this, right? The Jax XLA teams do not serve Google’s customers externally. Whereas NVIDIA’s various CUDA teams for things like NCCL serve external customers. The internal teams like Jax and XLA and stuff, they more so serve DeepMind and Search, right? And so their customer is different. They’re not building a product for them.
Lex Fridman (04:05:27) Do you understand why AWS keeps winning versus Azure for cloud versus Google Cloud?
Dylan Patel (04:05:34) Yeah, there’s-
Lex Fridman (04:05:35) Google Cloud is tiny, isn’t it, relative to AWS?
Dylan Patel (04:05:37) Google Cloud is third. Yeah. Microsoft is the second biggest, but Amazon is the biggest, right?
Lex Fridman (04:05:37) Yeah.
from: https://lexfridman.com/deepseek-dylan-patel-nathan-lambert-transcript#chapter17_ai_megaclusters
I’m paying customer of GCP and I’m getting rate limited after 2 requests to Gemini Flash and other models on Vertex AI. I’ve been having that for over 2 weeks. I specifically bought support tier to handle that. After 2 weeks of pointless back and forth where I spent many hours testing and documenting I got informed that I should buy provisioned throughput. Yeah, the one that doesn’t exist.
Great, let me build my startup with 2 requests per minute to a small model.
Did I mention that they are 20 token request / resonses?
It’s outrageous. I just tested it on free tier of public gemini API and got 7 times more throughput.
Hey everyone, sorry for the trouble. If you’re experiencing issues, please DM me your GCP project number (9-13 digits) and whether you’re using the Developer API or Vertex AI for debugging purposes.
Start to happen today – for a paid account, limit tokens at almost 1 request per minutes… And, there is no project under GCP to report on. GCP customer service is requesting to have meeting… This is production service…
@Vishal We’re also constantly getting 429 errors despite using a paid plan through the developer API (which we’ve been using fine for months). I believe this is a new issue on Gemini’s side due to the new shared dynamic resource allocation.
I don’t see any way to DM you on this discuss.ai.google.dev portal. Do you have an email that I can contact? Thanks!