Gemini Live API: acoustic echo cancellation needed for SIP telephony deployments

Description:

When using the Gemini Live API in SIP telephony integrations (e.g. over Zadarma, Twilio, or other B2BUA carriers), the carrier reflects the model’s outbound audio back into the inbound RTP stream. Gemini Live’s VAD picks this up as user speech and fires inputTranscription and serverContent: interrupted events — causing the model to interrupt itself mid-sentence.

Observed behavior:

  • inputTranscription fires with the model’s own words (e.g. “I’m sorry,” / “How can”)

  • serverContent: {"interrupted": true} immediately follows

  • The model re-generates from where it was interrupted, creating split conversation turns

  • Conversation transcripts show the model’s words attributed to the user

Current workaround: prefixPaddingMs: 200 in the session config filters echoes shorter than 200ms. However, multi-syllable phrases (>200ms) still pass through.

Requested feature: Acoustic echo cancellation (AEC) support in the Gemini Live API input audio pipeline, so that audio frames matching recently sent output are suppressed before VAD processing. Alternatively, a higher prefixPaddingMs value (e.g. 500ms) or a separate echoSuppressionMs parameter.

Environment: Node.js server, G.711 μ-law 8kHz, SIP via Zadarma B2BUA, gemini-3.1-flash-live-preview

2 Likes