Supervised image finetuning Gemini

Lakshin_Bhavinkumar · January 22, 2026, 10:09am

I am looking to adapt Gemini 3 Pro Image (gemini-3-pro-image-preview) for a UI automation use case.
My Situation:
I have a large dataset of application screenshots. Each screenshot is paired with image-level labels identifying the components present (e.g., ‘Sidebar’, ‘Search Bar’, ‘Primary Button’), but I do not currently have the specific bounding box coordinates for these elements.
My Requirement:
I want to fine-tune a Gemini model so that it can transition from ‘recognition’ (knowing the button exists) to ‘localization’ (identifying the component and generating its bounding box).
Questions for Support:

Does Gemini 3 support Weakly Supervised Fine-Tuning where the model can learn to localize objects based solely on image-level labels and descriptions?
If not, does Google Cloud offer a managed labeling service or a tool within Vertex AI that can automatically generate ‘silver-standard’ bounding boxes from my labels using Gemini 3’s zero-shot reasoning?
What is the recommended JSONL structure for a multimodal tuning dataset that includes images and labels, but omits spatial coordinates?

My goal is to eventually have the model output structured JSON with bounding boxes for any new screen I provide. Please let me know the best path forward within the Gemini 3 ecosystem.
Best regards

Topic		Replies	Views
Gemini 1.5 Flash fine tuning with Vertex AI Community gemini-15 , fine-tuning , vertexai	8	496	January 6, 2025
Fine-tuning Gemini models via Gemini API / Hugging Face / Unsloth Gemini API fine-tuning , models , gemini-api , vertexai	1	93	September 19, 2025
How to fine tune model for image caption generation...? Google AI Studio api , fine-tuning , models	2	198	September 30, 2024
Input image and output json Gemini API fine-tuning	2	262	May 16, 2024
When will Gemini support fine-tuning with images/video data? Gemini API	1	431	August 27, 2024

Supervised image finetuning Gemini

Related topics