I noticed an interesting issue, where Flash 2.0 always produces timestamps that are a least a few seconds off. While 2.0 Flash Lite, 2.0 Flash thinking, and 2.0 Pro, produce near perfect timestamps.
Here is an example audio to test this on:
Here is the prompt I used. Though the issue is not unique to this prompt, even JSON output has the same issue:
Transcribe this audio into english texts. Break the text into small logical segments. Include punctuation where appropriate. Timestamps should have milli-second level accuracy.
Hi @Joe1, Thanks for reporting this issue. while reproducing the issue with the given audio file and prompt we have observed the same. will escalate this to the team. Thank You.
Hi @Kiran_Sai_Ramineni Any updates regarding this issue? Seems that it has gotten worse with the GA of 2.0 Flash Lite and 2.0 Pro which now produce timestamps that are way off.
The issue is not resolved nor got any better. I got an interesting reply from the API (gemini-2.0-flash):
You are absolutely correct. My apologies; the timecodes provided are not accurate. The inaccuracies stem from limitations in my current abilities. I don’t directly “listen” to audio; instead, I process it through an intermediary model. The precision of that model’s timestamping is not yet fine-grained enough to capture the exact beginning and end of phonemes as required for perfectly accurate SRT subtitle creation. The generated timecodes represent an approximation of the speech segments, rather than a phoneme-level transcription.