[Bug] Gemini 3 Flash and 3.1 Pro: Progressive timestamp drift in audio transcription

Samyaz · March 7, 2026, 4:57pm

Hi Gemini team,

I’m the CTO of an edtech startup building a language learning platform for Arabic learners. We use Gemini’s audio transcription capabilities as a core part of our pipeline: we transcribe spoken Arabic content and present it to learners with synchronized text, so timestamp accuracy is critical for us.

We recently benchmarked 4 Gemini models on the same 11:49 Arabic audio clip (a lecture in Moroccan Arabic/MSA code-switching) and discovered serious timestamp drift issues in Gemini 3 Flash and Gemini 3.1 Pro that make them unusable for any timing-dependent application. We used Gemini 2.5 Flash as our ground truth reference (160 segments, centisecond precision).

The problem

We traced 11 identifiable phrases across the full audio duration and compared their segment start times to the 2.5 Flash baseline.

Gemini 3 Flash (`gemini-3-flash`, thinking: minimal) - CATASTROPHIC

The model compresses an 11:49 audio into timestamps that only span 0:00 to 9:14. There is a progressive drift that grows linearly:

Audio position	Expected time	G3 Flash time	Drift
Opening	0:00	0:00	0s
~2 min mark	2:24	~1:57	-27s
~4 min mark	4:40	~3:52	-48s
~6 min mark	5:57	~4:55	-62s
~8 min mark	7:35	~6:07	-88s
~10 min mark	9:51	~7:46	-125s
Closing	11:44	9:07	-157s

The model’s internal clock appears to run approximately 22% too fast. The text quality is actually excellent (best Darija handling, proper punctuation, context understanding), but the timestamps are fundamentally broken. We get 63 beautifully written segments that point to the wrong places in the audio.

Gemini 3.1 Pro (`gemini-3.1-pro`, thinking: low) - SIGNIFICANT

Less severe but still problematic. Timestamps start accurate but develop a progressive backward drift that reaches -16 seconds by the end:

Audio position	Expected time	G3.1 Pro time	Drift
Opening	0:00	0:00	0s
~1.5 min mark	1:33	1:33	-0.5s
~3.5 min mark	3:28	3:28	-0.6s
~6 min mark	5:38	5:38	-0.7s
~8 min mark	7:35	7:18	-17.4s
~10 min mark	9:51	9:48	-3.2s
Closing	11:44	11:28	-16.0s

The drift is non-linear and seems to accelerate around the middle of the audio. Timestamps are also rounded to whole seconds (no centisecond precision).

For comparison: Flash Lite is fine

All three Gemini 3.1 Flash Lite runs (high/medium/minimal thinking) maintain sub-second alignment throughout with no progressive drift:

Model	Mean(abs(drift)	Max(abs(drift))	Progressive?
G3.1 Flash Lite (high)	1.25s	4.75s	No
G3.1 Flash Lite (medium)	1.80s	8.57s	No
G3.1 Flash Lite (minimal)	2.09s	8.46s	No
G3.1 Pro (low)	6.26s	17.42s	Yes
G3 Flash (minimal)	78.67s	157.0s	Yes (catastrophic)

Impact on our product

For language learners, we highlight the current phrase as the audio plays (karaoke-style). With G3 Flash, by the 6-minute mark the highlighted text is already a full minute behind the audio. That’s a completely broken experience. With G3.1 Pro, the desynchronization becomes noticeable around the 5-minute mark and distracting by 8 minutes.

This forces us into an awkward position: the models with the best text quality (G3.1 Pro, G3 Flash) are the ones with broken timestamps. We currently have to either use Flash Lite (weaker text, good timestamps) or run a two-pass pipeline (Flash Lite for timing, Pro for text), which doubles our API cost.

Reproduction

Audio: Any Arabic audio clip longer than 5 minutes should show the drift (we tested with an 11:49 clip)
Models: gemini-3-flash and gemini-3.1-pro
Method: Transcribe with segment timestamps, then compare segment boundaries against gemini-2.5-flash output or manual annotation
The drift is progressive, so it becomes more obvious with longer audio

Questions for the team

Is the G3 Flash timestamp compression a known issue? It looks like a clock-rate bug rather than a random error.
Is the G3.1 Pro drift related to how it handles segment boundaries with its 10-second chunking? The drift seems to correlate with the consolidated segment approach.
Are there any API parameters we might be missing that could improve timestamp accuracy for these models?
Is there a timeline for fixes? We’d love to use G3.1P or G3F in production given their superior text quality, but we can’t ship broken timestamps to learners.

Happy to share the full benchmark data (6 JSON files with all segments and timestamps) if that would help the team investigate.

Thanks for your attention to this. The text quality improvements in the newer models are genuinely impressive - if the timestamps matched, these would be a clear upgrade for our use case.

Samyaz · March 18, 2026, 5:32pm

Hi, any update ? thanks

Afzal_Hossain · May 14, 2026, 8:48am

Please fix it. I migrated from Whisper V3 to gemini-3.1-flash-lite, It’s Generating significantly better transcription, but the timing goes off a lot as the video progresses.

Topic		Replies	Views
Gemini Pro Timestamp Accuracy Issues in Audio Transcription Gemini API gemini-15 , api	9	994	March 27, 2025
Gemini Flash 2.0 audio transcription timestamps incorrect Gemini API audio	4	930	March 27, 2025
Timestamp generation (Forced Alignment) on 2.0 production models is still broken Gemini API models , audio	12	726	August 4, 2025
Timestamp Generation (Forced Alignment) on 2.0-Pro-Exp Gemini API audio	5	460	March 3, 2025
Refer to timestamps performance severely degraded - 2.5 Flash, 2.5 Pro (Gemini and Vertex api) Gemini API bug , models , vertexai , gemini-25	2	267	November 26, 2025