Multimodal Live API - first interaction issues, no response, hallucination ect

Hey guys,

I’ve built an ios app with the gemini multimodal live api that uses video+audio input. I’ve been having a persistent issue for about a month that I can’t solve. When the user connects to the bot, the first interaction is off about 50% of the time. Sometimes the bot never responds, sometimes it interrupts itself or hallucinates a user response. It may be an internal Gemini VAD issue, i’m not sure.

Here’s an example from the server logs of it interrupting itself:


`2025-03-08T15:44:10.004 app[d891152c70d638] ord [info] 2025-03-08 15:44:10.003 | DEBUG | pipecat.transports.base_input:_handle_interruptions:124 - User started speaking

2025-03-08T15:44:11.645 app[d891152c70d638] ord [info] 2025-03-08 15:44:11.644 | DEBUG | pipecat.transports.base_input:_handle_interruptions:131 - User stopped speaking

2025-03-08T15:44:12.290 app[d891152c70d638] ord [info] 2025-03-08 15:44:12.289 | DEBUG | pipecat.services.gemini_multimodal_live.gemini:_handle_transcribe_user_audio:270 - [Transcription:user] Hey, what's in this mug?

2025-03-08T15:44:13.291 app[d891152c70d638] ord [info] 2025-03-08 15:44:13.291 | DEBUG | pipecat.transports.base_output:_bot_started_speaking:203 - Bot started speaking

2025-03-08T15:44:14.584 app[d891152c70d638] ord [info] 2025-03-08 15:44:14.583 | DEBUG | pipecat.transports.base_input:_handle_interruptions:124 - User started speaking

2025-03-08T15:44:14.585 app[d891152c70d638] ord [info] 2025-03-08 15:44:14.584 | DEBUG | pipecat.transports.base_output:_bot_stopped_speaking:210 - Bot stopped speaking

2025-03-08T15:44:15.224 app[d891152c70d638] ord [info] 2025-03-08 15:44:15.223 | DEBUG | pipecat.transports.base_input:_handle_interruptions:131 - User stopped speaking

2025-03-08T15:44:15.664 app[d891152c70d638] ord [info] 2025-03-08 15:44:15.664 | DEBUG | pipecat.services.gemini_multimodal_live.gemini:_handle_transcribe_user_audio:270 - [Transcription:user] Earl Grey tea.

2025-03-08T15:44:19.688 app[d891152c70d638] ord [info] 2025-03-08 15:44:19.686 | DEBUG | pipecat.services.gemini_multimodal_live.gemini:_handle_transcribe_model_audio:278 - [Transcription:model] Based on the color, it looks like there is coffee in the mug. Would you like me to search for any coffee recipes?

1 Like

In that example. The bot started speaking, interrupted itself, and then went silent. However the output and transcript came through, two times with different outputs. Neither were spoken.

anybody have a similar problem? still not solved.