For this, you have two options:
-
Use a client-side VAD detection system that sends audio to track the start and end points.
-
Rely on transcription input and start generating a time loop between when the agent speaks and when you receive the transcription input from Gemini. You can also identify the data sent to Gemini.
Client-side VAD detection is viable and not very complex.