Pricing and usages for S2S (speech to speech) models

Saif_Kharouf · November 26, 2025, 1:06pm

I want to ask couple of questions about the pricing and usages of the live models (aka speech to speech model):
’’’ I am using pipecat ‘‘‘’
1 - how we are getting billled for the models for the different modalities (audio to audio) / (audio to text)? Do we get billed for text and audio tokens togther or seperated depending on the modality?

2 - The usages of the model I want to undrestand it better like this one for audio to text

prompt_token_count=3384 cached_content_token_count=None response_token_count=10 tool_use_prompt_token_count=None thoughts_token_count=None total_token_count=3394 prompt_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=3380
), ModalityTokenCount(
modality=<MediaModality.AUDIO: ‘AUDIO’>,
token_count=4
)] cache_tokens_details=None response_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=10
)] tool_use_prompt_tokens_details=None traffic_type=None

and for the audio to audio

prompt_token_count=3872 cached_content_token_count=None response_token_count=135 tool_use_prompt_token_count=None thoughts_token_count=None total_token_count=4007 prompt_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=3870
), ModalityTokenCount(
modality=<MediaModality.AUDIO: ‘AUDIO’>,
token_count=2
)] cache_tokens_details=None response_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=135
)] tool_use_prompt_tokens_details=None traffic_type=None

they are the same and did not see the differences. and the audio tokens are low i dont know if this normal or not.

Thanks

Shivam_Singh2 · November 27, 2025, 6:16am

Hii @Saif_Kharouf
Welcome to the AI Forum!!!

Thank you for reaching out to us.
The gemini-2.0-flash-live-001 model applies a rate of $0.35 for text inputs and $2.10 for audio, image, or video inputs. Output generation is charged at $1.50 for text and $8.50 for audio.
These rates are the same as those for the gemini-2.5-flash-native-audio-preview-09-2025 native audio model.

If you need more details information, please refer to this documentation.

Saif_Kharouf · November 27, 2025, 6:29am

Thank you for your response,

But I do not understand how to calculate the cost do I extract text token and audio token from the input for example and calculate them separately?

Shivam_Singh2 · November 27, 2025, 9:03am

Hello,

To calculate Gemini API costs, you first need to use the API to count your input and output tokens, which include both text and audio. Then, multiply the total number of tokens by the price per token for your specific model. Please refer to this document for more clarification.

Saif_Kharouf · November 27, 2025, 9:23am

Thanks for the documentation.

I have another question to ask, the usage that I have shown in the main thread it appears that the input audio token is low for roughly 1 second audio at each turn. and their no output audio token for some reason, althought I have set up the gemini modality to audio.

Thanks for the help.

Shivam_Singh2 · November 28, 2025, 9:56am

Hello,

Yes, this happens because of 1-second audio clips are too short. Try using slightly longer clips, such as 3 to 5 seconds.

Topic		Replies	Views
Could someone help me understand gemini live pricing? Gemini API api , models , billing	1	457	June 23, 2025
Cost estimation for audio input and text output Gemini API gemini-15 , api	3	698	July 5, 2024
Are audio output tokens equal to text output tokens? Gemini API api , models , audio	1	317	May 22, 2025
Gemini 2.5 Flash Preview TTS Gemini API gemini-flash , billing	1	283	July 21, 2025
Token usage calculation with Google ADK and Gemini-2.5-flash-native-audio-dialog Gemini API api , audio , billing , google-adk	5	606	January 9, 2026

Pricing and usages for S2S (speech to speech) models

Related topics