I’ve been building a prototype app that uses the Gemini multimodal live API. The use case involves feeding in video input and then asking text-based questions about the video.
It was working perfectly fine earlier, but now I’ve noticed an issue. Since the model changed from gemini-2.0-flash-experimental to gemini-2.0-flash-live-001, the API no longer responds to text questions about the video. It only seems to respond to audio-based queries now.
Is this a configuration issue on my end, or has something changed with the API behavior? I’d really appreciate your help
Hi @Govind_Keshari. Yes, it does support video input. However, there’s an issue I’ve noticed. When I stream a video as input and then try to ask questions about it using text queries, it responds saying I don’t have access to the video. But if I ask the same question using audio input instead of text, it provides the correct answer.
So, it seems like the video is being processed, but the system only responds properly to audio-based queries, not text-based ones. (This exact thing was working perfectly some time back)
I am able to follow the mentioned steps that you have provided. I am also able to reproduce the issue that you are facing. I have also noticed that once taking a voice input, it is able to identify the video by text as well, but directly with text, it is not taking inputs.
Thank you for raising this repository issue. I’ll update you soon if there is any update on this issue.