Audio Input Cannot Trigger History Recall in Gemini Live API (Only Text Input Works)

Issue Description

When using the Gemini Live API (including the streaming “native audio” mode), loading prior conversation history via context or history only works when the user sends text input.
Audio/voice input never triggers the model to recall previously provided history, even though the same context works correctly when queried via text.

This issue appears across both the LiveKit agent integration and direct Gemini Live API calls, indicating that the problem originates from the model or API behavior itself rather than the client implementation.


Reproduction Steps

1. Prepare a History Context

Example context loaded via load_context(history) or history / messages:

User: Where does XX work?
Assistant: XX works at YYY company.

2. Start a new Live session

Model tested:

  • gemini-2.5-flash-native-audio-preview-09-2025

  • gemini-2.0-flash-live-001

**3. Ask the same question using audio input

(Audio) “Where does XX work?”

→ Model responds: “I don’t know.”

**4. Without resetting the session, send the same question as text input

(text) "Where does XX work?"

→ Model responds correctly: “XX works at YYY company.”

5. Repeat the same test using:

  • LiveKit Agent (audio)

  • LiveKit Agent Playground (audio → text)

  • Gemini official Live API sample code (audio → text)

All environments reproduce the same behavior:

  • Audio question → history not recalled

  • Text question → history recalled correctly


Expected Behavior

Audio input should behave the same as text input:
When prior conversation history is loaded into the session, both audio and text queries should equally be able to access and recall that history.


Actual Behavior

  • Text queries can successfully retrieve information from loaded history.

  • Audio queries consistently fail to recall any historical information, responding as if no history exists.


Environment

Tested across:

  1. Gemini Live API — official sample code

  2. Gemini 2.5 Flash Native Audio Preview — streaming mode

  3. Gemini 2.0 Flash Live

  4. LiveKit Agent (same behavior reproduced)

  5. LiveKit Agent Playground (audio → fail, text → success)

The issue is consistent and model-independent.


Additional Notes

  • This behavior strongly suggests that audio inputs are not currently integrated into the history/context attention path, or the audio encoder does not consider preloaded history.

  • The issue is reproducible across all environments, which eliminates LiveKit or client-side problems.

  • A temporary workaround is to inject important history into the system prompt, but this is only a partial solution.

Hi @Linming , Thank you for bringing this to our attention.

Apologies for the delayed response. Could you please confirm if you are still facing the same issue?

Facing the same problem,

  model = "gemini-2.5-flash-native-audio-preview-12-2025"
  config={
    "response_modalities": ["AUDIO"],
    "system_instruction": VOICE_MODE,
    "output_audio_transcription": {},
    "input_audio_transcription": {},
    "thinking_config": {
      "thinking_budget": 0
    },
    "realtime_input_config": {
      "automatic_activity_detection": {
        "disabled": True
      }
    },
    "tools": [{'google_search': {}}]
  }

  if mode == "auto":
    config["realtime_input_config"]["automatic_activity_detection"] = {
      "disabled": False,
      "start_of_speech_sensitivity": types.StartSensitivity.START_SENSITIVITY_HIGH,
      "end_of_speech_sensitivity": types.EndSensitivity.END_SENSITIVITY_LOW,
      "prefix_padding_ms": 20,
      "silence_duration_ms": 200,
    }

  try:
    history = await pg.fetch("SELECT role, content FROM messages WHERE session_id = $1 AND content_type = 'text' ORDER BY id ASC LIMIT 20;", session_id)

    async with client.aio.live.connect(model=model, config=config) as session:

      turns = [{"role": turn["role"], "parts": [{"text": turn["content"]}]} for turn in history]
      if turns: await session.send_client_content(turns=turns, turn_complete=False if turns[-1]["role"] == "model" else True)

      CLIENTS.pop(session_id, None)
      
      await socket.connect(websocket)
      receiver_task = asyncio.create_task(socket.receive(websocket, session, mode))
      sender_task = asyncio.create_task(socket.send_live(websocket, session, session_id, valid_session["name"]))

      await asyncio.wait([sender_task, receiver_task], return_when=asyncio.FIRST_COMPLETED)

  except Exception as e:
    socket.disconnect(websocket)
    logger.exception(e)

when asked anything related to the previous content, the model responds with something like this is first time or start of the conversion …

I am fairly certain this was not an issue before

Hi @Linming we are looking into this. Are you still facing this issue?