This error happens all the time. There is too little retries allowed.
Also this can happen in any of the multi-steps of an agentic loop.
If the model is reading files and making decisions or multi-part-editing it will fail halfway thru its agentic loop, and nudging it will break its proper order of context messages, leading to the model potentially misbehaving in the rare occasion it does manage to work.
You need to implement some sort of queuing and longer “timeouts” on the server side. Maybe some “priority quota”.. It is impossible, at all, to run any inference on claude models, and has been this way for the past week. Others are reporting problems with Gemini too, I don’t really know, I rarely have tasks I trust Gemini to complete so I use other tools instead.
This is not an auth or quota error. This is a server side error, overload of the inference layer. Perhaps you should reduce quotas for users if you are unable to provide them?
And no, using smaller models does not help most of the time. I used to adjust model to the task at hand, but you successfully cured me of that. If it is going to fail 90% of the time anyway, might as well try Opus each time, in case it actually works. It’s not really real Opus anyway since you quant and re-route it, but if the service has a failure rate that high, why would I bother trying to use my quota “responsibly”, clearly I will just handicap myself with no upside.
