"Something went wrong." [ANSWER] Stream Realtime Multimodal Live API with Gemini 2.0 Constant Bug

TLDR is below. :smiling_face:

Hi there. This is my first post on the Developer Forum, so I’d like to start off with saying that I’m really happy to be here… yet I’m sad my first post needs to be about an issue that a lot of people have been dealing with.


“Something went wrong.” seems to be happening a lot with people using the new visual multi-modal chat with Gemini in the AI studio after chatting for a while(and it inevitably always happens)… and honestly, Google has still not addressed it, yet I’m pretty sure I have diagnosed the issue.

If you open the “create prompt” section instead in the studio, and talk to one of the 2.0 models for too long to where the 1 million~ token context cap is reached… it just refuses to continue letting you message the model outright. These caps are reached much quicker when inputting things like audio and video for the model to analyze and process, and the only way to reduce the tokens to continue messaging the model is by going back in the history of the conversation and deleting both the AI responses, and your messages.

The obvious issue with this for most users is that the AI will no longer have memory of the messages since it is no longer stored within it’s token context window… but it does solve the issue of not being able to message the model.

What I’ve deduced by observing videos and conducting tests myself is that the stream realtime model tends to reach these caps fairly quickly(as it’s using 2.0 flash which only has 1 million~ tokens of context available), but Google has not implemented a memory system yet to allow it to properly save necessary contextual information whilst dropping old bits of the conversation history to be able to allocate more token usage.

Because there is no handling system for this, once either the model generates a response that fills up more tokens, or the user progressively does the same… the model will either reach it’s final response and the stream will immediately run into the error as soon as it sends its final message, or the stream will run into the error in the middle of the user audibly speaking their next prompt(since tokens are added after every single word/letter sequence progressively and not just when the user sends the message), and as soon as either hits the limit, there is no error handling… so the stream just ends abruptly with no features to relieve the chat of tokens.

If possible, this should get pushed to Google to get fixed so that at least if this IS the cause… they’re more transparent about the reason so it doesn’t feel as much like a bug but rather an intentional limiter. It would be less frustrating, especially if we could at least resolve the issue temporarily by deleting old conversation history… or if THEY implemented a change like such.

I know Project Astra has features like that(10 minute memory being the contextual limits of accurate token saving, and long term being that space allocation I just mentioned to remember core components if it doesn’t have the room to remember everything), but we’re not there yet… and it would be really nice for developers if this was just a default instead of a hassling bug since most users are just casually chatting with the bot and aren’t trying to program an entire application just to mediate the issue themselves.

Hope this helps! Please fix it Google, it’s a really nice feature.


TLDR: Stream Realtime crashes because the model runs into token limits, and Google has not implemented an internal system to mediate the issue so it can run continuously without abruptly forcing the user to end conversation.


4 Likes

Hi @Echo, Welcome to the forum!!!

Thanks for detailed analysis. There are some limitations of Multimodal Live api for which you can refer this doc.
Since 2.0-flash-exp is an experimental model and experimental models are to gather feedback, so with these constructive feedbacks, our team try to release new versions.

Thanks.