Gemini flash Live API docs chaos sorted out:

Gemini 2.0 Flash on Vertex AI: Configuration for Real-Time Audio Interaction via BidiGenerateContent API

I. Introduction

This report details the specific technical configurations required to implement real-time, bidirectional voice interactions using a Gemini 2.0 Flash model variant hosted on Google Cloud’s Vertex AI platform. The focus is on leveraging the Multimodal Live API, accessed via the BidiGenerateContent service, to enable applications capable of processing inputs and generating audio (Text-to-Speech, TTS) outputs dynamically.

The demand for more natural and interactive AI experiences, particularly those involving voice, is rapidly increasing. Technologies such as Google’s Multimodal Live API are pivotal in enabling these low-latency, conversational applications. This API facilitates seamless exchanges of text, audio, and potentially video between a client application and the Gemini model, supporting use cases like real-time transcription, translation, and voice-driven assistance. This document specifically addresses the implementation details within the Vertex AI ecosystem, providing the necessary model identification, regional availability confirmation, and precise API endpoint specifications.

The subsequent sections will cover:

  1. Identification of the correct Gemini 2.0 Flash model ID on Vertex AI supporting the required features.
  2. Verification of the model’s availability in representative US and European Google Cloud regions.
  3. Specification of the exact WebSocket API endpoint used by Vertex AI for the BidiGenerateContent service.
  4. A synthesis of these findings into actionable implementation details.
  5. Concluding remarks summarizing the key specifications and considerations.

II. Identifying the Target Gemini 2.0 Flash Model on Vertex AI

Successfully implementing the desired real-time audio interaction requires pinpointing the exact Gemini 2.0 Flash model variant on Vertex AI that supports the Multimodal Live API and its Text-to-Speech (TTS) output capability.

A. Model ID Confirmation

The Gemini family includes several models and versions, necessitating precise identification. Various Gemini 2.0 Flash variants have been mentioned in documentation, including gemini-2.0-flash-lite-001, gemini-2.0-flash-live-preview-04-09, gemini-2.0-flash-exp, and gemini-2.0-flash-thinking-exp-01-21. However, recent documentation and announcements concerning the Multimodal Live API capabilities on the Vertex AI platform consistently associate these features with the primary model identifier gemini-2.0-flash.

While earlier preview or experimental phases might have used specific identifiers like gemini-2.0-flash-live-preview-04-09 or gemini-2.0-flash-exp to denote access to the Live API, the current documentation integrates this capability under the main gemini-2.0-flash model ID. A reference to gemini-1.5-pro-preview-0409 in a specific code example appears outdated or context-specific to that particular notebook and does not pertain to the requested Gemini 2.0 Flash variant.

Therefore, the recommended and currently documented model ID for leveraging the Multimodal Live API with its associated features on Vertex AI is gemini-2.0-flash.

B. Confirmation of Multimodal Live API (BidiGenerateContent) Support

The core requirement for real-time interaction is support for the Multimodal Live API. This API is explicitly designed for low-latency, bidirectional communication using WebSockets, enabling simultaneous sending and receiving of data streams including text, audio, and video. Within the underlying Google Cloud API structure, this functionality is implemented via the BidiGenerateContent method.

The gemini-2.0-flash model on Vertex AI is documented as supporting the Live API. Its release was highlighted alongside the introduction of the Multimodal Live API, emphasizing its role in powering dynamic and interactive applications. The API protocol involves specific message types for session setup (BidiGenerateContentSetup), client input (BidiGenerateContentClientContent, BidiGenerateContentRealtimeInput), server output (BidiGenerateContentServerContent), and other control messages exchanged over the WebSocket connection.Notably, using the Live API with gemini-2.0-flash might involve a different maximum input token limit (32,768 tokens) compared to its standard usage (1,048,576 tokens).

The evidence confirms that the gemini-2.0-flash model on Vertex AI supports the Multimodal Live API (via the BidiGenerateContent service) necessary for the requested real-time streaming interactions.

C. Confirmation of Audio Output (TTS) Capability

A specific requirement is the model’s ability to generate audio output (Text-to-Speech) through the Live API. Documentation confirms that the Multimodal Live API supports audio output alongside text output. Examples include “Text-to-audio generation” and “Text-to-audio conversation”. The API reference details that the server can send audio responses , and the gemini-2.0-flash model specifically lists “Audio” as a supported output type, albeit potentially in a preview stage.

The introduction of Gemini 2.0 emphasized native audio output as a key advancement, enabled through the Live API.The API supports specific voices (such as Puck, Charon, Kore, Fenrir, Aoede) and delivers audio in a defined format (Raw 16 bit PCM audio at 24kHz little-endian).

Thus, gemini-2.0-flash on Vertex AI, when utilized with the Multimodal Live API, supports audio output (TTS). However, it is important to note that this capability has been frequently described as being in “private preview” or initially available to “early-access partners” , indicating it might not yet have reached General Availability (GA) status.

D. Implications of Preview/Experimental Status

The recurring designation of the Multimodal Live API and its specific features like TTS output for gemini-2.0-flashas “preview,” “experimental,” or initially for “early-access partners” carries significant implications for implementation.

Google often employs a phased rollout strategy for advanced cloud services, particularly those involving complex real-time interactions like the Live API. Starting with experimental or preview stages allows for iterative development based on user feedback, ensures infrastructure stability and scalability, and manages risks before a GA release. This means developers using these features should anticipate potential changes to the API surface, adherence to specific “Pre-GA Offerings Terms” which might include limited support , and potentially the need for project allowlisting or specific configurations during this phase. The stability and feature set may evolve before reaching GA.

Furthermore, the apparent evolution in model naming conventions is noteworthy. The initial use of distinct identifiers like gemini-2.0-flash-live-preview-04-09 or gemini-2.0-flash-exp for accessing preview features, followed by the integration of Live API support directly under the base gemini-2.0-flash model ID , suggests a consolidation pathway as features mature. While early previews often necessitate separate identifiers to isolate experimental functionality, associating the feature with the primary model name as it stabilizes simplifies long-term development and indicates the Live API is becoming a core, albeit potentially still preview, component of the gemini-2.0-flashoffering on Vertex AI.

Developers must remain cognizant of this preview status. Implementations should be designed with flexibility to accommodate potential API adjustments, and teams should consult the latest Google Cloud documentation for updates on feature maturity and GA timelines.

III. Regional Availability on Vertex AI

Deploying AI solutions requires confirmation that the chosen model and its features are available in the target geographic regions. Vertex AI services are offered in specific Google Cloud regions.

A. Confirmed US Region

Vertex AI models and features are widely available across multiple regions in the United States. The us-central1region (Iowa) is a frequently supported location for core Google Cloud services and AI models. Documentation confirms that us-central1 is an available region for Vertex AI and specifically lists gemini-2.0-flash as supported in this region, including for features like dynamic shared quota and Provisioned Throughput. Furthermore, examples and tutorials related to the Live API on Vertex AI often utilize us-central1 as the reference location , and it is commonly cited as a valid region code for Vertex AI client initialization.

Therefore, the gemini-2.0-flash model, including its Multimodal Live API capabilities (subject to preview status considerations), is confirmed to be available in the us-central1 region in the United States.

B. Confirmed European Region

Similarly, Vertex AI provides services across several European regions. The europe-west4 region (Netherlands) is one such location. Documentation lists europe-west4 as a supported Vertex AI region. Crucially, gemini-2.0-flash is explicitly documented as available in europe-west4, supporting dynamic shared quota and Provisioned Throughput. It is also mentioned as a valid region code example for Vertex AI configuration.

Consequently, the gemini-2.0-flash model, along with its Multimodal Live API features (subject to preview status), is confirmed to be available in the europe-west4 region in Europe.

C. Considerations for Regional Feature Rollout & Processing

While the gemini-2.0-flash model is available in us-central1 and europe-west4, several nuances related to regional deployment warrant attention.

Firstly, the availability of specific features, especially those in preview like the Multimodal Live API with TTS, should always be verified for the target region, as rollouts might not be simultaneous across all locations. Explicit location configuration is mandatory when using Vertex AI SDKs or APIs; the platform does not infer the region automatically.This explicit setting is crucial for ensuring data processing complies with regional regulations (e.g., GDPR in Europe) and for optimizing latency by using resources geographically closer to the application or end-users.

Secondly, the location of the underlying Machine Learning (ML) processing can be a factor. While some older models or specific configurations might have processing routed to central US or EU locations regardless of the API endpoint used , the documentation for gemini-2.0-flash indicates that ML processing generally occurs within the broad geographic area (US or Europe) corresponding to the chosen regional endpoint. Vertex AI also offers a global endpoint option for models like gemini-2.0-flash. Using the global endpoint can enhance availability but sacrifices control over the specific processing region and is incompatible with certain Vertex AI features like tuning, batch prediction, and VPC Service Controls. Given the stateful nature of the Live API and potential data sensitivity, utilizing specific regional endpoints (us-central1, europe-west4) is generally recommended over the global endpoint to maintain control over data locality and ensure compatibility with the full suite of Vertex AI features.

Thirdly, resource allocation and quotas can differ between regions. Documentation for gemini-2.0-flash reveals significantly different Tokens Per Minute (TPM) limits for processing media inputs (images, documents, video, audio) between US/Asia regions and European regions, with EU limits being notably lower. For instance, the audio processing TPM is listed as 1.7M in US/Asia versus 0.4M in the EU. This suggests potential variations in the underlying infrastructure capacity supporting these intensive multimodal features across continents. Applications with substantial multimodal workloads, particularly those deployed in Europe, may need to account for these lower regional quotas through careful design, potentially implementing strategies like request throttling or considering cross-regional architectures if limits are frequently encountered. Quotas are typically applied per model, per region, per minute.

In summary, while us-central1 and europe-west4 are confirmed regions for gemini-2.0-flash, developers should validate the Live API’s specific functional status within these regions due to its preview nature. Using regional endpoints is the advised approach for maintaining control and feature compatibility. Awareness of potentially lower resource quotas in European regions compared to US regions is also important for applications with high multimodal throughput requirements.

D. Gemini 2.0 Flash Live API Availability Summary

The following table synthesizes the availability findings for the key components required:

Feature Element Specification/Status Confirmed US Region Confirmed EU Region Supporting Information Sources
Model ID gemini-2.0-flash us-central1 europe-west4
API / Service Multimodal Live API / BidiGenerateContent us-central1 europe-west4
Audio Output (TTS) Supported (Preview) us-central1 europe-west4
WebSocket Endpoint Vertex AI Regional us-central1 europe-west4
ML Processing Location US Multi-region / EU Multi-region us-central1 europe-west4

Note: Preview status implies potential changes and adherence to Pre-GA terms.

IV. Vertex AI WebSocket Endpoint for BidiGenerateContent

Accessing the Multimodal Live API requires connecting to the correct WebSocket endpoint specific to the Vertex AI platform and the chosen region.

A. Standard Vertex AI Endpoint Structure

Vertex AI services are typically accessed via regional API endpoints. The standard pattern for REST API hostnames is https://{LOCATION}-aiplatform.googleapis.com, where {LOCATION} represents the specific Google Cloud region, such as us-central1 or europe-west4. Client libraries and SDKs are configured to target these regional endpoints, often through initialization parameters or environment variables specifying the project and location.

B. Proposed WebSocket Endpoint Format

Since the Multimodal Live API utilizes the WebSocket Secure (wss) protocol for real-time, bidirectional communication, its endpoint must adapt the standard Vertex AI regional hostname. The endpoint needs to specify the wss scheme and include a path identifying the WebSocket service (/ws/...) and the specific BidiGenerateContentmethod.

Based on Vertex AI’s standard regional hostname pattern and the typical structure for WebSocket services, the most probable endpoint format is:

wss://{LOCATION}-aiplatform.googleapis.com/ws/google.cloud.aiplatform.v1beta1.LlmBidiService/BidiGenerateContent

Here, {LOCATION} must be replaced with the desired regional identifier (e.g., us-central1, europe-west4).

It is important to highlight that while the message protocol and functionality are detailed , none of the provided documentation excerpts explicitly confirms this exact WebSocket URL path for the Vertex AI implementation of the Live API. This absence might reflect the feature’s preview status, where documentation sometimes lags behind the initial release, or it might indicate an expectation that developers will primarily use SDKs which abstract this detail. For developers requiring direct WebSocket connection without an SDK, this lack of explicit confirmation presents a potential point of friction. Therefore, while the proposed format is logically consistent, verification through SDK source code inspection or consultation of the latest official Vertex AI API reference documentation is advisable.

C. Distinction from Gemini API Endpoint

It is crucial to differentiate the Vertex AI WebSocket endpoint from the endpoint used for the standalone Google AI Gemini API. The Gemini API provides access to Gemini models outside the full Vertex AI MLOps platform and uses a distinct endpoint: wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent.This endpoint uses a different hostname (generativelanguage.googleapis.com) and service path.

This distinction reflects Google’s strategy of offering core AI capabilities through multiple channels. The Gemini API provides broader, potentially simpler access for developers, while Vertex AI integrates the same core capabilities (like the Live API) within an enterprise-grade platform offering enhanced governance, security, MLOps tooling, and integration with other Google Cloud services. Although the underlying Live API protocol and message structures are likely very similar across both platforms, the access points, authentication mechanisms, quotas, and surrounding platform features differ. Implementations targeting the Vertex AI platform must use the Vertex AI-specific regional endpoint (wss://{LOCATION}-aiplatform.googleapis.com/...).

D. Configuration via SDKs/Client Libraries

For developers utilizing official Google Cloud client libraries or SDKs (e.g., the Generative AI SDK for Python configured for Vertex AI, the Vertex AI Python SDK, Spring AI integrations), direct manipulation of the WebSocket URL is typically unnecessary. These libraries are designed to construct the correct regional endpoint based on the project ID and location provided during initialization or configuration. Properly configuring the SDK to use Vertex AI (GOOGLE_GENAI_USE_VERTEXAI=True ) and specifying the correct location (vertexai.init(location='us-central1') ) ensures that the library targets the appropriate regional Vertex AI endpoint, abstracting the underlying URL details.

V. Synthesized Implementation Details

Based on the preceding analysis, the following specific details are provided for implementing real-time audio interactions using Gemini 2.0 Flash on Vertex AI via the Multimodal Live API (BidiGenerateContent).

A. Full Model Resource Name

When interacting with the Vertex AI API, models are often referenced using their full resource name. The standard format is: projects/{PROJECT_ID}/locations/{LOCATION}/publishers/google/models/{MODEL_ID}

For the identified model (gemini-2.0-flash) in the confirmed regions, the resource names are:

  • US Region (us-central1): projects/{PROJECT_ID}/locations/us-central1/publishers/google/models/gemini-2.0-flash
  • European Region (europe-west4): projects/{PROJECT_ID}/locations/europe-west4/publishers/google/models/gemini-2.0-flash

Replace {PROJECT_ID} with the specific Google Cloud project ID. This format aligns with standard Vertex AI resource naming conventions.

B. WebSocket Endpoint URL

The required WebSocket endpoint for establishing a connection to the Multimodal Live API (BidiGenerateContentservice) on Vertex AI follows this structure:

wss://{LOCATION}-aiplatform.googleapis.com/ws/google.cloud.aiplatform.v1beta1.LlmBidiService/BidiGenerateContent

Replace {LOCATION} with the target Google Cloud region identifier:

  • For US Region: us-central1
  • For European Region: europe-west4

As noted previously, direct confirmation of this exact path was not found in the provided documentation excerpts, and verification is recommended if not using an SDK that handles endpoint resolution.

VI. Conclusion and Key Takeaways

This report has identified the necessary technical specifications for utilizing the Gemini 2.0 Flash model on Google Cloud Vertex AI to build applications featuring real-time, bidirectional voice interactions with Text-to-Speech (TTS) output.

Summary of Findings:

  1. Model: The appropriate model on Vertex AI is gemini-2.0-flash.
  2. API: This model supports the Multimodal Live API, accessed via the BidiGenerateContent service over WebSockets, enabling real-time streaming.
  3. TTS Capability: Audio output (TTS) is a supported feature of this model when used with the Live API, although it is currently designated as being in a preview stage. Available voices include Puck, Charon, Kore, Fenrir, and Aoede.
  4. Regional Availability: The gemini-2.0-flash model and its Live API capabilities are confirmed to be available in the us-central1 (US) and europe-west4 (Europe) regions.
  5. WebSocket Endpoint: The required Vertex AI WebSocket endpoint is wss://{LOCATION}-aiplatform.googleapis.com/ws/google.cloud.aiplatform.v1beta1.LlmBidiService/BidiGenerateContent, where {LOCATION} should be replaced with the target region (e.g., us-central1 or europe-west4).

Key Considerations for Implementation:

  • Preview Status: The Multimodal Live API and its TTS functionality for gemini-2.0-flash are in preview. Developers must be prepared for potential API changes and adhere to Google Cloud’s Pre-GA terms.
  • Regional Configuration: Explicitly configuring the target region (us-central1 or europe-west4) is essential when using Vertex AI clients or SDKs to ensure requests are routed correctly and comply with data residency requirements. Using regional endpoints is recommended over the global endpoint for this use case.
  • Resource Quotas: Be mindful of potentially lower Tokens Per Minute (TPM) quotas for media processing in European regions compared to US regions, particularly for applications with high multimodal throughput.
  • Endpoint Specificity: Ensure the Vertex AI-specific WebSocket endpoint is used, distinguishing it from the standalone Gemini API endpoint.

Final Recommendation:

The synthesized details provide the core technical information needed for implementation. Developers should proceed using the gemini-2.0-flash model ID and the specified Vertex AI WebSocket endpoint structure for the us-central1and europe-west4 regions. Given the preview status of the Live API and the lack of explicit documentation confirming the exact WebSocket path in the reviewed materials, it is prudent to verify this path through SDK inspection or the latest official API documentation if implementing a direct WebSocket connection without relying on SDK abstractions. Utilizing official Google Cloud SDKs, correctly configured for the target project and region, remains the recommended approach as it typically handles endpoint resolution automatically.

It’s weird, no region works for me if I set the model to “gemini-2.0-flash”
gemini-2.0-flash-live-preview-04-09 - only works on us central1
gemini-2.0-flash-live-001 - works on eu4

Is there a pricing difference between these models ? I’m wondering because yesterday I tested a 1h convo and it billed me 4.9 usd and given the 32 tokens/ sec of audio it should have been way less.

First of all, thank you and I am sorry, but there is not delete or edit button for my post, or I did not found that button yet. The info in the post is partially outdated. Last time I tested like a week ago, the gemini-2.0-flash-live-preview-04-09 in the central 1 was working and I did not tested for Europe. Today (28.4.2025) the Gemini models  |  Gemini API  |  Google AI for Developers looks different and even more confusing. As for the pricing, I spent like a full 2 hours of research and calculations and it’s not clear either. Will need to time the test and read the key costs. The dev support from Google on this is beyond terrible. The best result I got by using Gemini and GPT to run deep search /research and then try. At least the bots read all the discussions and examples of what worked for whom when. Will need to figure all this out in next few days anyhow, will post results here.

1 Like

Hey, I will post results too as I’m going to test and see the cost.

I’ve also used gemini and o3 for deep research :joy:

What I understood is that 1 sec of audio is 32 tokens
1 mil tokens in is $2..10 (that would mean 1 hour of audio tokens in is 24 cents)

What I wrote above is using gemini directly
tonight I will test a 1 h session on vertex ai

It sucks because I’m launching an app this week and idk how many hours of audio streaming I can give my users. I’d complain more but the openai prices are even bigger :joy:

I will create a table with the tests too.

1 hour conversation on vertex ai was $ 3.35. continuous audio stream in (manual, no VAD, cutting out silent audio chunks in my app) text responses out (avg 150 tokens/ response) responses were triggered every 20-30 seconds.

So it’s much more expensive then the pricing would suggest

Coming back with an update, apparently the main cost was something called output predictions.
You can’t find anything about what that might be.
But my guess is due to the real time nature of the API the model starts predicting responses so it has them readly available on user input end.

1 Like

I have build benchmark but is not fully completed yet, its in a format of dual live voice chatbot gemini x open ai, so all can be mesured in long convo (when i complete that) I must sort this out. As for research no data on gemini a lot of data on open ai higher then expected costs. As for tokens, paper sais 25 per second in or out for the gemini = less dollar for 1hour