**Update: April 10 — model capacity rotation confirmed, gemini-2.5-flash now affected**
Following up on our [earlier diagnostics]( Search grounding not working - recent regression · Issue #2249 · googleapis/python-genai · GitHub ). We now have 48h of data showing a new pattern: **capacity rotation between models**.
### Production workload (Apr 10, 00:00-03:00 UTC)
Application running `gemini-2.5-flash` as primary + `gemini-2.5-flash-lite` as fallback (switched from `gemini-3-flash-preview` after Apr 9 failures). SDK 1.67.0, 18 parallel tasks each making multiple `generate_content` calls with `google_search` tool.
| Metric | Value |
|--------|-------|
| Tasks completed | **18/18 (100%)** |
| Search grounding retries | 114 |
| Search grounding primary model success | 79% |
| Reasoning call retries | **0** |
| Reasoning primary model success | 75% |
| Error breakdown | 298× `503 UNAVAILABLE`, 7× `504 DEADLINE_EXCEEDED`, 2× `429 RATE_LIMIT` |
All tasks completed but with significant retry overhead — 87% of retries concentrated in first 15 minutes (burst of concurrent requests at startup).
### Isolated model test (Apr 10, ~08:00 UTC)
Same diagnostic script as before. SDK **1.71.0**, timeout 60s, 6 models × sequential (2 calls) + concurrent (6 parallel calls). Each call: `generate_content()` with `Tool(google_search=GoogleSearch())`.
| Model | Sequential w/ search | Concurrent w/ search (6x) | Sequential w/o search | Concurrent w/o search (6x) |
|-------|---------------------|---------------------------|----------------------|----------------------------|
| gemini-2.5-flash-lite | **2/2 OK** (3s) | **6/6 OK** (2-5s) | 2/2 OK | 6/6 OK |
| gemini-2.5-flash | **0/2 ERROR** (503) | **0/6 ERROR** (503) | **0/2 ERROR** (503) | **0/6 ERROR** (503) |
| gemini-2.5-pro | 1/2 (503) | **0/6 ERROR** (503) | 0/2 (503) | 1/6 (503) |
| gemini-3-flash-preview | **1/2** (44s + 504) | **5/6 OK** (14-35s) | **2/2 OK** (7-37s) | **6/6 OK** (8-31s) |
| gemini-3.1-flash-lite-preview | **2/2 OK** (5-9s) | **6/6 OK** (6-10s) | 2/2 OK | 6/6 OK |
| gemini-3.1-pro-preview | **2/2 OK** (19-38s) | **6/6 OK** (19-32s) | 2/2 OK | 6/6 OK |
### Key finding: capacity rotation
Comparing our Apr 9 and Apr 10 isolated tests (same script, same API key, same prompts):
| Model | Apr 9 sequential search | Apr 10 sequential search | Change |
|-------|------------------------|--------------------------|--------|
| gemini-2.5-flash | **2/2 OK** | **0/2 ERROR** | **degraded** |
| gemini-3-flash-preview | **0/2 ERROR** | **1/2 OK** | **recovered** |
And the timeline across all our data points:
| Time (UTC) | gemini-2.5-flash | gemini-3-flash-preview |
|------------|-------------------|------------------------|
| Apr 9 13:25 | **OK** (2/2 seq) | **DEAD** (0/2 seq, 3/6 conc) |
| Apr 10 00:00 | **79% primary** (prod session) | not tested |
| Apr 10 08:00 | **DEAD** (0/16 total) | **RECOVERED** (8/10 total) |
This is **not a binary outage** — it’s dynamic GPU capacity reallocation between models. A model working at 00:00 UTC can be completely unavailable at 08:00 UTC, with no warning or status page update.
### Models unaffected by rotation
Three models showed **100% success rate** across both days, both sequential and concurrent:
- `gemini-2.5-flash-lite`
- `gemini-3.1-flash-lite-preview`
- `gemini-3.1-pro-preview`
These appear to have dedicated/stable capacity, while `gemini-2.5-flash` and `gemini-3-flash-preview` share a contested pool.
### Impact
For production workloads depending on Google Search grounding, the capacity rotation makes it impossible to reliably choose a model. Our application’s fallback mechanism masks this for end users, but at the cost of ~114 wasted retries per session and 2-3x longer processing times.
**Question for the team:** Is there a way to query model availability/health before sending requests? An endpoint returning current capacity status would let us route to available models proactively instead of discovering failures through retries.