Gemini Live Not Responding Correctly to Text

I’ve been building a prototype app that uses the Gemini multimodal live API. The use case involves feeding in video input and then asking text-based questions about the video.

It was working perfectly fine earlier, but now I’ve noticed an issue. Since the model changed from gemini-2.0-flash-experimental to gemini-2.0-flash-live-001, the API no longer responds to text questions about the video. It only seems to respond to audio-based queries now.

Is this a configuration issue on my end, or has something changed with the API behavior? I’d really appreciate your help

2 Likes

Hi @Parivesh, Welcome to forum!!

What kind of error it’s throwing?? I don’t think there is any change in the model, it still supports video input as mentioned in the doc.

1 Like

Hi @Govind_Keshari. Yes, it does support video input. However, there’s an issue I’ve noticed. When I stream a video as input and then try to ask questions about it using text queries, it responds saying I don’t have access to the video. But if I ask the same question using audio input instead of text, it provides the correct answer.

So, it seems like the video is being processed, but the system only responds properly to audio-based queries, not text-based ones. (This exact thing was working perfectly some time back)

Refer to this image:

Are you following any doc or can you share your code if possible so i can repro the issue from my side. Any of the above will be helpful.

Hi @Govind_Keshari. Here are the steps. Please let me know if something else is needed.

:link: Reference Project

Repository: google-gemini/live-api-web-console


:hammer_and_wrench: Setup Instructions

  1. Install Node.js
    Make sure Node.js is installed. If not, download and install it.

  2. Install Dependencies

    npm install
    
  3. Start the Application

    npm start
    
  4. Set API Key
    In the .env file in the project directory, add your Gemini API key:

    REACT_APP_GEMINI_API_KEY="your-gemini-api-key"
    

:test_tube: Steps to Reproduce

  1. Open your browser and go to:
    http://localhost:3000

  2. Click the “Stream” button.

  3. Enable both Video and Audio.

  4. On the left side of the screen, in the message input box, type a question related to what is visible in the video.

    • The model will respond: “I don’t know”.
  5. Now, ask the same question aloud using your voice (with audio enabled).

    • The model will respond correctly this time.
2 Likes

Same experience using the Get_started_LiveAPI.py and using the Daily integration. Definitely something new

Hi @Parivesh ,

I am able to follow the mentioned steps that you have provided. I am also able to reproduce the issue that you are facing. I have also noticed that once taking a voice input, it is able to identify the video by text as well, but directly with text, it is not taking inputs.

Thank you for raising this repository issue. I’ll update you soon if there is any update on this issue.

1 Like