Live API Hangs When Using System Prompt with Audio-Only Response Modality

Hi everyone,

I’m running into an issue with the Live API (using the gemini-2.0-flash-exp model) where it hangs when I include a system prompt, but works fine without one. I’m hoping someone can shed light on whether this is expected behavior, a bug, or if I’m configuring something incorrectly.

What I’m Trying to Do

I’m building an audio-to-audio translation service that takes English audio input and returns Egyptian Arabic audio output. My goal is to set a system instruction like “You are a translator” to guide the model’s behavior.

Setup

  • Model: gemini-2.0-flash-exp
  • Config: LiveConnectConfig with response_modalities=[“AUDIO”] and a speech_config for output voice.
  • Input: Mono, 16kHz, 16-bit PCM audio (verified to work without the prompt).
  • Code: Using the Python async client (client.aio.live.connect).
    config = types.LiveConnectConfig(
    response_modalities=[“AUDIO”],
    speech_config=types.SpeechConfig(
    voice_config=types.VoiceConfig(
    prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name=“Kore”)
    )
    ),
    system_instruction=types.Content(
    parts=[types.Part.from_text(
    text=“You are a translation engine. Your sole purpose is to translate between English and Egyptian Arabic (Egyptian dialect). Do not add any explanations or conversation.”
    )],
    role=“user”
    )
    )

async with client.aio.live.connect(model=“gemini-2.0-flash-exp”, config=config) as session:
await session.send(input={“data”: raw_audio, “mime_type”: “audio/pcm”}, end_of_turn=True)
async for response in session.receive():

The Issue

  • With System Prompt: The code sends the audio successfully (logged as Sending input audio data: X bytes), but it hangs indefinitely at Waiting for audio response… No chunks are received, and it never progresses.
  • Without System Prompt: If I remove the system_instruction (or a similar turns message), it works perfectly—audio is sent, and I get a response (though it’s not translated, just echoed or processed differently).