Static Audio Output from Gemini Live API (google-genai SDK) on iOS with AVAudioEngine

Hi everyone,

I’m encountering a persistent issue with the Gemini Live API (google-genai SDK) where the audio output received from the API results in loud static during playback on iOS (tested on both Simulator and a physical iPhone), even though the connection and basic interaction seem to work. I’m hoping someone might have insights or suggestions.

Goal:
Implement real-time, bidirectional voice conversation between a SwiftUI iOS app and a Python (FastAPI) backend using the Gemini Live API for STT/LLM/TTS.

Setup:

  • Backend: Python 3.11, FastAPI, Uvicorn, google-genai SDK v1.12.1 (using AI Studio API Key).

  • Frontend: SwiftUI, iOS [Your Target iOS Version, e.g., 17.5+], Xcode16.0 (16A242d). Using AVAudioEngine for recording (AudioManager) and playback (AudioPlayer), URLSessionWebSocketTask for communication (NetworkManager).

  • API Call: Backend uses client.aio.live.connect with model=“gemini-2.0-flash-live-001” (also tried gemini-1.5-pro-latest) and minimal LiveConnectConfig(response_modalities=[“AUDIO”]). System prompt is sent via send_client_content. User audio chunks are sent via send_realtime_input(audio=types.Blob(…)).

  • Audio Formats:

    • Frontend Mic → Backend: 16kHz, 16-bit Mono PCM

    • Backend → Google API: 16kHz, 16-bit Mono PCM (via Blob)

    • Google API → Backend (Expected): 24kHz, 16-bit Mono PCM (raw bytes in response.data)

    • Backend → Frontend: Raw bytes received from Google API via WebSocket (send_bytes).

    • Frontend Playback Target: Configured for 24kHz, 16-bit Mono PCM input.

Problem:
When the backend receives response.data from the Gemini API stream and forwards these bytes to the iOS app, playing these bytes using AVAudioEngine / AVAudioPlayerNode results in loud static noise, not clear speech. This happens consistently on both the iOS Simulator and a physical iPhone.

What Works:

  • WebSocket connection between frontend and backend is stable.

  • Backend successfully connects to the Gemini Live API (:white_check_mark: Entered Gemini Live session. logged).

  • System prompt is sent successfully via send_client_content.

  • User audio (16kHz Int16 PCM) is successfully captured by AudioManager, sent to the backend, and sent to Google via send_realtime_input without backend errors.

  • Backend receives binary data in response.data from the Gemini stream after user speaks.

  • Backend correctly sends audio_start, audio_end, and is_ai_speaking state updates to the frontend via WebSocket.

  • Frontend receives these state updates and the binary data chunks.

  • Frontend AudioPlayer setup does not crash with the latest configurations tried.

Debugging Steps Taken & Key Finding:

  1. SDK Migration: Confirmed we are using the current google-genai SDK (v1.12.1), not the deprecated google-generativeai. Resolved initial import errors.

  2. API Connection: Resolved various AttributeErrors and TypeErrors related to LiveConnectConfig and connection methods by simplifying the config and using manual context management (aenter/aexit) and eventually using send_realtime_input for audio. The connection is now stable.

  3. Model Name: Confirmed gemini-1.5-flash-latest is rejected by the API for bidiGenerateContent. Switched to gemini-2.0-flash-live-001 (also tested gemini-1.5-pro-latest briefly - still resulted in static).

  4. iOS AudioPlayer Implementation (Extensive Debugging):

  • Tried various AVAudioEngine graph setups (direct connection, intermediate mixer).

  • Tried multiple buffer creation/scheduling methods (scheduling Int16 buffers directly, using AVAudioConverter to create Float32 buffers matching processing format, using AVAudioConverter to create Float32 buffers resampled to hardware rate).

  • Ensured careful AVAudioSession configuration (.playAndRecord, .voiceChat, .mixWithOthers) and activation management.

  • Addressed multiple -10868 (FormatNotSupported) crashes related to node connections.

  • The final stable AudioPlayer uses AVAudioConverter to convert received Int16@24k data to Float32@HardwareRate buffers before scheduling. This eliminated crashes but the static remained.

  1. Data Verification (CRITICAL FINDINGS):
  • Modified NetworkManager.swift to save the raw Data bytes received from the WebSocket directly to a .rawpcm file, bypassing AudioPlayer.

  • Imported this received_audio.rawpcm file into Audacity using the expected format parameters (Signed 16-bit PCM, Little-endian, 1 Channel (Mono), 24000 Hz Sample Rate).

  • Result: The audio played back in Audacity directly from the saved raw bytes is also static noise .

Backend Save (CRITICAL FINDING): Modified the backend main.py (receive_from_google function) to save the response.data directly received from the Gemini API stream to a .wav file on the server (using Python’s wave module, setting parameters for 1ch, 16-bit, 24kHz) before sending anything over the WebSocket.

Result: Playing this backend-saved .wav file directly also produced static noise.

Conclusion:
Since the raw bytes received by the frontend client, before being processed by AVAudioEngine, produce static when interpreted with the documented format settings, this strongly suggests the issue lies with the audio data being sent by the Gemini Live API itself via the AI Studio API key route with the tested models. The data does not appear to be clean 24kHz, 16-bit PCM.

The static audio issue persists even when saving the raw bytes directly on the backend immediately after receiving them from the Google API (response.data) and playing that file. This strongly indicates the problem originates from the Gemini Live API itself for the tested model (gemini-2.0-flash-live-001) when accessed via an AI Studio API key. The data being returned does not seem to be clean 24kHz, 16-bit PCM audio. The issue is not within the iOS audio playback code or WebSocket transmission.

Questions:

  1. Has anyone else successfully received clear 24kHz, 16-bit PCM audio output from the Gemini Live API using the google-genai SDK (v1.x) with an AI Studio API Key (not Vertex AI)?

  2. Is there a different, known-working model name compatible with AI Studio keys for the Live API audio output?

  3. Could the audio data format being returned be different from the documented 24kHz, 16-bit, signed, little-endian PCM (e.g., different encoding, endianness, headers)?

  4. Are there any specific configurations or flags needed in LiveConnectConfig (even if using a dictionary) or the initial connection for this specific model/API key combination that might affect audio output quality?

Any help or pointers would be greatly appreciated! We’ve hit a wall after resolving the connection and client-side playback issues.

Thanks!

Addie Design

Hi,
In a similar situation here as well I have tried with similar sampling rates and pyaudio Int16 format but the output appears to be static noise. Earlier I tried an approach with Speech to Text and then sending the user prompt to the API endpoint which worked pretty well for en-us but in order to make it dynamic in terms of diverse language availability, I went for Gemini-2.0-live-001 which resulted in this. If you come across any solutions please share, would be really helpful.
thank you

1 Like

I’ve got the same issue

1 Like

@Siva_Sravana_Kumar_N @GUNAND_MAYANGLAMBAM

Any resolution folks? Would really appreciate if this gets fixed :grinning_face_with_smiling_eyes:

1 Like

Hi, I am following up with the team regarding this issue.

Thanks

2 Likes

By the way, did you check the Get_started_LiveAPI cookbook? Just wondering whether this is an API related issue or a compatibility problem with AVAudioEngine on iOS.

2 Likes

Hi,

Thanks for the suggestion about the cookbook – I’ll double-check it for any differences.

To clarify whether it might be an AVAudioEngine issue, we performed a test directly on the Python backend:

  1. Inside the async for response in gemini_session.receive(): loop, we took the response.data bytes received directly from the Gemini Live API stream.

  2. Before sending these bytes over the WebSocket to the iOS client, we saved them directly into a .wav file on the server using Python’s standard wave module (configured for 1 channel, 16-bit samples, 24000 Hz rate).

  3. Playing this backend-saved .wav file directly on the server machine resulted in the same static noise .

This seems to indicate the issue lies with the raw audio data stream coming from the API itself (using models like gemini-2.0-flash-live-001 with an AI Studio key) rather than being an iOS playback problem.

Are there known issues with audio output quality for these models via the standard Live API endpoint/AI Studio keys, or is there a different recommended model known to provide clean 24kHz 16-bit PCM output?

Thanks again!

2 Likes

Currently gemini-2.0-flash-live-001 is the only model that supports audio output.
I will follow up with you regarding the issue with model quality.

2 Likes