Hi everyone!
Im getting some troubles when using gemini live models (specifically gemini-25.-flash-live version) due to its built-in context window. Basically my setup is the following one: Im building an in-call AI based assistant that uses gemini live to produce realtime answers for the questions the client does. It could be the only AI engine in the system, but the problem appears when gemini live model produce crazy answers or just produce some outputs that are not aligned with the instructions it received in its system prompt.
In order to fix that, I have implemented a 2 websocket approach. My microservice uses 1 websocket for the assistant and 1 websocket for a gemini live acting as an evaluator. The evaluator basically takes the client audio input and the assistant output transcription (text) and has to produce an evaluation of whether the assistant response was good or not. You can notice that the first websocket RESPONSE_MODALITIES is audio and the second websocket REPONSE_MODALITIES is text (because the evaluator has to produce a json schema as response).
The problem im getting is the following. The assistant gemini live model is receiving the system instruction (text) and the client audio input and it is responding properly (not always but mostly), but the evaluator gemini live is not performing as expected. It receives the system instruction (text), the client audio input and the assistant output (text), but it is producing JSON evaluations for past previous iterations. For example, I mean, when the conversation starts I can tell the system “Hello, help me find a laptop on internet”, the assistant could give me step by step instructions to achieve my goal (correct) and the evaluator would evaluate it properly, maybe outputing {“correct”:True, “reason”:“The client asked for a laptop on internet and the assistant correctly conducted it in a step by step follow up…”}. After some interations, the client says “Now, I want you to tell me who is Lionel Messi.”, the assistant would produce “Yes! Lionel Messi is a really famous footballer…” and the evaluator would response badly {“correct”:True, “reason”:“As the client has requested for a laptop on internet, the assistant did what it had to and correctly conducted it in a step by step follow up…”}.
As you can see, it seems that the model is focusing on the context it has, and I dont know if the problem comes from (1) the way im sending the inputs to the gemini model evaluator or (2) the way gemini live treats its context window. One approach I think would work is cleaning the model context, but I think this cannot be done for this kind of model, doesnt it? I would really appreciate you to ask me any questions to better understand my use case, as well as propose solutions.
Thanks in advance!