I am looking to adapt Gemini 3 Pro Image (gemini-3-pro-image-preview) for a UI automation use case.
My Situation:
I have a large dataset of application screenshots. Each screenshot is paired with image-level labels identifying the components present (e.g., ‘Sidebar’, ‘Search Bar’, ‘Primary Button’), but I do not currently have the specific bounding box coordinates for these elements.
My Requirement:
I want to fine-tune a Gemini model so that it can transition from ‘recognition’ (knowing the button exists) to ‘localization’ (identifying the component and generating its bounding box).
Questions for Support:
-
Does Gemini 3 support Weakly Supervised Fine-Tuning where the model can learn to localize objects based solely on image-level labels and descriptions?
-
If not, does Google Cloud offer a managed labeling service or a tool within Vertex AI that can automatically generate ‘silver-standard’ bounding boxes from my labels using Gemini 3’s zero-shot reasoning?
-
What is the recommended JSONL structure for a multimodal tuning dataset that includes images and labels, but omits spatial coordinates?
My goal is to eventually have the model output structured JSON with bounding boxes for any new screen I provide. Please let me know the best path forward within the Gemini 3 ecosystem.
Best regards