I have a task with video/images input and text as output which Gemini does not do very well. I want to fine-tune the model with my own data but it seems like currently Gemini does not support fine-tuning with multimodal images/video data. Since the new Gemini 1.5 pro and flash model do have the multimodal understanding capabilities, I was wondering when will Gemini support multimodal fine-tuning?
5 Likes
Wondering do we have any update on this?