Pricing and usages for S2S (speech to speech) models

I want to ask couple of questions about the pricing and usages of the live models (aka speech to speech model):
’’’ I am using pipecat ‘‘‘’
1 - how we are getting billled for the models for the different modalities (audio to audio) / (audio to text)? Do we get billed for text and audio tokens togther or seperated depending on the modality?

2 - The usages of the model I want to undrestand it better like this one for audio to text

prompt_token_count=3384 cached_content_token_count=None response_token_count=10 tool_use_prompt_token_count=None thoughts_token_count=None total_token_count=3394 prompt_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=3380
), ModalityTokenCount(
modality=<MediaModality.AUDIO: ‘AUDIO’>,
token_count=4
)] cache_tokens_details=None response_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=10
)] tool_use_prompt_tokens_details=None traffic_type=None

and for the audio to audio

prompt_token_count=3872 cached_content_token_count=None response_token_count=135 tool_use_prompt_token_count=None thoughts_token_count=None total_token_count=4007 prompt_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=3870
), ModalityTokenCount(
modality=<MediaModality.AUDIO: ‘AUDIO’>,
token_count=2
)] cache_tokens_details=None response_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=135
)] tool_use_prompt_tokens_details=None traffic_type=None

they are the same and did not see the differences. and the audio tokens are low i dont know if this normal or not.

Thanks :slight_smile:

Hii @Saif_Kharouf
Welcome to the AI Forum!!!

Thank you for reaching out to us.
The gemini-2.0-flash-live-001 model applies a rate of $0.35 for text inputs and $2.10 for audio, image, or video inputs. Output generation is charged at $1.50 for text and $8.50 for audio.
These rates are the same as those for the gemini-2.5-flash-native-audio-preview-09-2025 native audio model.

If you need more details information, please refer to this documentation.

Thank you for your response,

But I do not understand how to calculate the cost do I extract text token and audio token from the input for example and calculate them separately?

Hello,

To calculate Gemini API costs, you first need to use the API to count your input and output tokens, which include both text and audio. Then, multiply the total number of tokens by the price per token for your specific model. Please refer to this document for more clarification.

Thanks for the documentation.

I have another question to ask, the usage that I have shown in the main thread it appears that the input audio token is low for roughly 1 second audio at each turn. and their no output audio token for some reason, althought I have set up the gemini modality to audio.

Thanks for the help.

Hello,

Yes, this happens because of 1-second audio clips are too short. Try using slightly longer clips, such as 3 to 5 seconds.