I am currently working with the https://github.com/google-gemini/multimodal-live-api-web-console repository.
The ClientContentMessage type allows for sending text data to the API. Using its flag turnComplete: false signals the API that the client is not finished so the data can be sent without the model generating a response to that message (useful for sending contextual data).
After sending this ClientContentMessage I would like to continue the voice conversation using the RealtimeInputMessage type message for the audio data as usual.
I just want to have the voice conversation and pipe some contextual data in between without the model explicitely responding to the contextual data.
Expected behaviour: After a ClientContentMessage with turnComplete set to false the model does NOT respond but I can send follow-up RealtimeInputMessages to which the model will respond.
Actual behaviour: After a ClientContentMessage with turnComplete set to false the model does NOT respond and it does NOT respond to follow-up RealtimeInputMessages with subsequent voice data.
I suspect the server needs a signal that the turn is completed after setting it to false, but the RealtimeInputMessage currently does not provide this data field and I can only end the turn with another ClientContentMessage with turnComplete: true . But this is not compatible with having subsequent voice conversation because the voice data is sent as a stream over the websocket and the server needs to determine end of speech.
Here’s a practical solution for handling mixed text and voice input:
// First send context without completing the turn
await client.send({
type: 'ClientContentMessage',
content: contextualData,
turnComplete: false
});
// Then send a special marker message
await client.send({
type: 'ClientContentMessage',
content: '[CONTEXT_COMPLETE]',
turnComplete: false
});
// Now stream voice data
startVoiceStream({
type: 'RealtimeInputMessage',
audio: audioData
});
Consider submitting a feature request to add a turnComplete flag to RealtimeInputMessage . This would enable seamless transitions between text context and voice streaming.
I fear a problem with adding turnComplete flag to RealtimeInputMessage is that it requires VAD handled on client side instead on server side to pin point the last RealtimeInputMessage for setting the flag.
/**
* send normal content parts such as { text }
*/
send(parts: Part | Part[], turnComplete: boolean = true) {
parts = Array.isArray(parts) ? parts : [parts];
const content: Content = {
role: "user",
parts,
};
const clientContentRequest: ClientContentMessage = {
clientContent: {
turns: [content],
turnComplete: turnComplete,
},
};
this._sendDirect(clientContentRequest);
this.log(`client.send`, clientContentRequest);
}
/**
* Sends a context completion marker message
*/
sendContextComplete() {
const contextMarker: Part = {
text: "[CONTEXT_COMPLETE]"
};
const content: Content = {
role: "user",
parts: [contextMarker]
};
const clientContentRequest: ClientContentMessage = {
clientContent: {
turns: [content],
turnComplete: false
}
};
this._sendDirect(clientContentRequest);
this.log(`client.contextComplete`, clientContentRequest);
}
// Send the text input to Gemini with full context
client.send([{
text: `context....`
}], false);
client.sendContextComplete();
I tried implementing it like this to fit the ClientContentMessage send() function format but it did not make subsequent RealtimeInputMessages trigger a response for me.