I want to ask couple of questions about the pricing and usages of the live models (aka speech to speech model):
’’’ I am using pipecat ‘‘‘’
1 - how we are getting billled for the models for the different modalities (audio to audio) / (audio to text)? Do we get billed for text and audio tokens togther or seperated depending on the modality?
2 - The usages of the model I want to undrestand it better like this one for audio to text
prompt_token_count=3384 cached_content_token_count=None response_token_count=10 tool_use_prompt_token_count=None thoughts_token_count=None total_token_count=3394 prompt_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=3380
), ModalityTokenCount(
modality=<MediaModality.AUDIO: ‘AUDIO’>,
token_count=4
)] cache_tokens_details=None response_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=10
)] tool_use_prompt_tokens_details=None traffic_type=None
and for the audio to audio
prompt_token_count=3872 cached_content_token_count=None response_token_count=135 tool_use_prompt_token_count=None thoughts_token_count=None total_token_count=4007 prompt_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=3870
), ModalityTokenCount(
modality=<MediaModality.AUDIO: ‘AUDIO’>,
token_count=2
)] cache_tokens_details=None response_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=135
)] tool_use_prompt_tokens_details=None traffic_type=None
they are the same and did not see the differences. and the audio tokens are low i dont know if this normal or not.
Thanks ![]()