I have a task with video/images input and text as output which Gemini does not do very well. I want to fine-tune the model with my own data but it seems like currently Gemini does not support fine-tuning with multimodal images/video data. Since the new Gemini 1.5 pro and flash model do have the multimodal understanding capabilities, I was wondering when will Gemini support multimodal fine-tuning?
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| Fine tuning a multimodal model | 5 | 586 | April 25, 2024 | |
| Gemini pro / flash multimodel finetuning | 1 | 196 | August 19, 2024 | |
| Are we able to fine tune the video understanding on Gemini 2.5 Pro? | 3 | 119 | September 8, 2025 | |
| How to trun model with Gemini on Image input and text output? | 2 | 115 | June 5, 2024 | |
| Can I fine-tune the Multimodal Live API? | 2 | 127 | July 20, 2025 |