Issue Description
When using the Gemini Live API (including the streaming “native audio” mode), loading prior conversation history via context or history only works when the user sends text input.
Audio/voice input never triggers the model to recall previously provided history, even though the same context works correctly when queried via text.
This issue appears across both the LiveKit agent integration and direct Gemini Live API calls, indicating that the problem originates from the model or API behavior itself rather than the client implementation.
Reproduction Steps
1. Prepare a History Context
Example context loaded via load_context(history) or history / messages:
User: Where does XX work?
Assistant: XX works at YYY company.
2. Start a new Live session
Model tested:
-
gemini-2.5-flash-native-audio-preview-09-2025 -
gemini-2.0-flash-live-001
**3. Ask the same question using audio input
(Audio) “Where does XX work?”
→ Model responds: “I don’t know.”
**4. Without resetting the session, send the same question as text input
(text) "Where does XX work?"
→ Model responds correctly: “XX works at YYY company.”
5. Repeat the same test using:
-
LiveKit Agent (audio)
-
LiveKit Agent Playground (audio → text)
-
Gemini official Live API sample code (audio → text)
All environments reproduce the same behavior:
-
Audio question → history not recalled
-
Text question → history recalled correctly
Expected Behavior
Audio input should behave the same as text input:
When prior conversation history is loaded into the session, both audio and text queries should equally be able to access and recall that history.
Actual Behavior
-
Text queries can successfully retrieve information from loaded history.
-
Audio queries consistently fail to recall any historical information, responding as if no history exists.
Environment
Tested across:
-
Gemini Live API — official sample code
-
Gemini 2.5 Flash Native Audio Preview — streaming mode
-
Gemini 2.0 Flash Live
-
LiveKit Agent (same behavior reproduced)
-
LiveKit Agent Playground (audio → fail, text → success)
The issue is consistent and model-independent.
Additional Notes
-
This behavior strongly suggests that audio inputs are not currently integrated into the history/context attention path, or the audio encoder does not consider preloaded history.
-
The issue is reproducible across all environments, which eliminates LiveKit or client-side problems.
-
A temporary workaround is to inject important history into the system prompt, but this is only a partial solution.