`turnComplete` flag set to `false` in `ClientContentMessage` of Multimodal Live API prevents processing of subsequent `RealtimeInputMessage`

I am currently working with the https://github.com/google-gemini/multimodal-live-api-web-console repository.

The ClientContentMessage type allows for sending text data to the API. Using its flag turnComplete: false signals the API that the client is not finished so the data can be sent without the model generating a response to that message (useful for sending contextual data).

After sending this ClientContentMessage I would like to continue the voice conversation using the RealtimeInputMessage type message for the audio data as usual.

I just want to have the voice conversation and pipe some contextual data in between without the model explicitely responding to the contextual data.

Expected behaviour: After a ClientContentMessage with turnComplete set to false the model does NOT respond but I can send follow-up RealtimeInputMessages to which the model will respond.

Actual behaviour: After a ClientContentMessage with turnComplete set to false the model does NOT respond and it does NOT respond to follow-up RealtimeInputMessages with subsequent voice data.

I suspect the server needs a signal that the turn is completed after setting it to false, but the RealtimeInputMessage currently does not provide this data field and I can only end the turn with another ClientContentMessage with turnComplete: true . But this is not compatible with having subsequent voice conversation because the voice data is sent as a stream over the websocket and the server needs to determine end of speech.

Here’s a practical solution for handling mixed text and voice input:

// First send context without completing the turn
await client.send({
  type: 'ClientContentMessage',
  content: contextualData,
  turnComplete: false
});

// Then send a special marker message
await client.send({
  type: 'ClientContentMessage',
  content: '[CONTEXT_COMPLETE]',
  turnComplete: false
});

// Now stream voice data
startVoiceStream({
  type: 'RealtimeInputMessage',
  audio: audioData
});

Consider submitting a feature request to add a turnComplete flag to RealtimeInputMessage . This would enable seamless transitions between text context and voice streaming.

1 Like

I fear a problem with adding turnComplete flag to RealtimeInputMessage is that it requires VAD handled on client side instead on server side to pin point the last RealtimeInputMessage for setting the flag.

Thanks for the response

  /**
   * send normal content parts such as { text }
   */
  send(parts: Part | Part[], turnComplete: boolean = true) {
    parts = Array.isArray(parts) ? parts : [parts];
    const content: Content = {
      role: "user",
      parts,
    };

    const clientContentRequest: ClientContentMessage = {
      clientContent: {
        turns: [content],
        turnComplete: turnComplete,
      },
    };

    this._sendDirect(clientContentRequest);
    this.log(`client.send`, clientContentRequest);
  }

  /**
   * Sends a context completion marker message
   */
  sendContextComplete() {
    const contextMarker: Part = {
      text: "[CONTEXT_COMPLETE]"
    };

    const content: Content = {
      role: "user",
      parts: [contextMarker]
    };

    const clientContentRequest: ClientContentMessage = {
      clientContent: {
        turns: [content],
        turnComplete: false
      }
    };

    this._sendDirect(clientContentRequest);
    this.log(`client.contextComplete`, clientContentRequest);
  }


    // Send the text input to Gemini with full context
    client.send([{ 
      text: `context....`
    }], false);

    client.sendContextComplete();
      

I tried implementing it like this to fit the ClientContentMessage send() function format but it did not make subsequent RealtimeInputMessages trigger a response for me.