Has anyone noticed a degraded performance of Gemini 3 flash in visual grounding tasks over the last week? I’m using it with reasoning level set to minimum in order to locate UI elements on the screen (bounding box detection).
The same use case which used to work without issues 2 weeks ago started failing last week. It is obvious that it struggles now to identify elements which are small on the screen, even though the content part has the media resolution MEDIA_RESOLUTION_HIGH, just as before. Could it be due to possible quantization of the model ?
Absolutely! It is not usable for me anymore, at least via the Vertex gateway. It was stuck in a loop multiple times. Sadly, I ended up switching the model. I was really excited about the initial Gemini 3 Flash release.
Hi @Sonali_Kumari1 , the link to the website which I used for my PoC is austriaviennarentalcar com . On this website you can choose e.g. Pick-up time hour. Having Gemini Flash 3 with minimal thinking level and MEDIA_RESOLUTION_HIGH I could previously click the option ‘10‘ from that drop-down. After I noticed the issue, the model started choosing other than the target options, e.g. 7, 11 etc. Previously it worked without issues.
The system prompt I’m using for that search:
You are an expert with advanced image analysis capabilities, specifically in understanding and interpreting graphical user interface (GUI) elements and detecting their accurate location on the screen.
You are provided with a screenshot and detailed information about the target UI element.
Your task is, step-by-step:
Analyze provided to you screenshot.
Analyze all provided to you information about the target UI element.
Identify all UI elements on the screenshot, which fully correspond to the target UI element. If target UI element is data-dependent, use the information about the element attributes and available test data for this identification.
For each identified UI element detect its bounding box, using available information about the visual appearance of the target UI element and its location (including position relative to other elements on the screenshot) in order to identify the most accurate bounding box coordinates.
Return all identified bounding boxes, never output anything.
If you can’t identify any bounding boxes, return an empty list.
The user message I’m using for that prompt:
The target element:
“Option in the ‘Pick-up Time’ hour dropdown list. A 2-digit number icon containing the specified hour. The target element is located among other hour icons in the dropdown list below. ‘Pick-up Time’ drop-down input.”
This element is data-dependent.
Available specific data for this element: “Pick up time is 10:30”
An important info: using MEDIA_RESOLUTION_ULTRA_HIGH did allow me to get through some issues, but it’s still not stable for high-resolution screenshots (4k etc.). Taking into account, that this used to work initially with MEDIA_RESOLUTION_HIGH definitely proves that it’s a regression.