Hi,
As previously reported by users, audio timestamp accuracy in Gemini 2.0 models has been unreliable since transitioning from preview to GA.
Through testing, I have found that if the same audio clip (tested with MP3 files) is converted into a video format (tested with MP4 files with a solid background), the timestamps are accurate. This suggests the issue may be specific to how the model processes standalone audio files. As a workaround, this is not ideal since it comes with a 10x input token increase.
Currently, the gemini-2.0-flash-thinking-exp-01-21
model provides accurate audio timestamps, but I am concerned that this functionality might break again when moving to GA.
Is anyone at Google aware of this issue, and are there any plans to address it in future updates?