Great advice from @icapora above. We do exactly that (manual VAD with automaticActivityDetection: { disabled: true }). Here are a few additional things we learned shipping this to production that might help with your Turn 2 issue:
Don’t send audio between activityEnd and the next activityStart.
This was our biggest lesson. After sending activityEnd, we suppress all outbound audio until the next activityStart. We observed 1007 (precondition failed) disconnects when audio leaked through in that gap. In our implementation, the mic hardware stays running but we short-circuit the send callback with a flag.
Echo cancellation matters more than you’d expect.
The default Live API behavior is START_OF_ACTIVITY_INTERRUPTS — if the model’s own audio output leaks back through the mic, it can trigger a barge-in and confuse the session state. We enable echoCancellation: true, noiseSuppression: true, and autoGainControl: true. With earphones this is less of an issue, but worth checking your PyAudio setup isn’t feeding playback audio back into the input stream.
Session resumption for handling 1011s.
We still see occasional 1011 disconnects on 3.1 — less frequent than 2.5 but not zero. Our strategy is auto-reconnect using session resumption tokens (sessionResumptionUpdate messages that Gemini sends during the session). Exponential backoff: 1s, 2s, 4s. For your Python client, store the newHandle from these messages and reconnect with it if the session drops.
See: https://ai.google.dev/gemini-api/docs/live-api/session-management
One 3.1-specific gotcha: don’t use periodic flush.
If you’re thinking of sending periodic activityEnd/activityStart to keep the session alive — don’t on 3.1. It interprets activityEnd as “user is done speaking” and responds prematurely, cutting off longer speech. This worked fine on 2.5 but breaks conversations on 3.1.
For your specific Turn 2 issue: I’d check whether your receive loop is truly running concurrently (as icapora mentioned) and whether any audio is being sent in the gap between turns. Those two things together cause the exact pattern you’re describing.