[Bug] Gemini 3 Flash and 3.1 Pro: Progressive timestamp drift in audio transcription

Hi Gemini team,

I’m the CTO of an edtech startup building a language learning platform for Arabic learners. We use Gemini’s audio transcription capabilities as a core part of our pipeline: we transcribe spoken Arabic content and present it to learners with synchronized text, so timestamp accuracy is critical for us.

We recently benchmarked 4 Gemini models on the same 11:49 Arabic audio clip (a lecture in Moroccan Arabic/MSA code-switching) and discovered serious timestamp drift issues in Gemini 3 Flash and Gemini 3.1 Pro that make them unusable for any timing-dependent application. We used Gemini 2.5 Flash as our ground truth reference (160 segments, centisecond precision).

The problem

We traced 11 identifiable phrases across the full audio duration and compared their segment start times to the 2.5 Flash baseline.

Gemini 3 Flash (gemini-3-flash, thinking: minimal) - CATASTROPHIC

The model compresses an 11:49 audio into timestamps that only span 0:00 to 9:14. There is a progressive drift that grows linearly:

Audio position Expected time G3 Flash time Drift
Opening 0:00 0:00 0s
~2 min mark 2:24 ~1:57 -27s
~4 min mark 4:40 ~3:52 -48s
~6 min mark 5:57 ~4:55 -62s
~8 min mark 7:35 ~6:07 -88s
~10 min mark 9:51 ~7:46 -125s
Closing 11:44 9:07 -157s

The model’s internal clock appears to run approximately 22% too fast. The text quality is actually excellent (best Darija handling, proper punctuation, context understanding), but the timestamps are fundamentally broken. We get 63 beautifully written segments that point to the wrong places in the audio.

Gemini 3.1 Pro (gemini-3.1-pro, thinking: low) - SIGNIFICANT

Less severe but still problematic. Timestamps start accurate but develop a progressive backward drift that reaches -16 seconds by the end:

Audio position Expected time G3.1 Pro time Drift
Opening 0:00 0:00 0s
~1.5 min mark 1:33 1:33 -0.5s
~3.5 min mark 3:28 3:28 -0.6s
~6 min mark 5:38 5:38 -0.7s
~8 min mark 7:35 7:18 -17.4s
~10 min mark 9:51 9:48 -3.2s
Closing 11:44 11:28 -16.0s

The drift is non-linear and seems to accelerate around the middle of the audio. Timestamps are also rounded to whole seconds (no centisecond precision).

For comparison: Flash Lite is fine

All three Gemini 3.1 Flash Lite runs (high/medium/minimal thinking) maintain sub-second alignment throughout with no progressive drift:

Model Mean(abs(drift) Max(abs(drift)) Progressive?
G3.1 Flash Lite (high) 1.25s 4.75s No
G3.1 Flash Lite (medium) 1.80s 8.57s No
G3.1 Flash Lite (minimal) 2.09s 8.46s No
G3.1 Pro (low) 6.26s 17.42s Yes
G3 Flash (minimal) 78.67s 157.0s Yes (catastrophic)

Impact on our product

For language learners, we highlight the current phrase as the audio plays (karaoke-style). With G3 Flash, by the 6-minute mark the highlighted text is already a full minute behind the audio. That’s a completely broken experience. With G3.1 Pro, the desynchronization becomes noticeable around the 5-minute mark and distracting by 8 minutes.

This forces us into an awkward position: the models with the best text quality (G3.1 Pro, G3 Flash) are the ones with broken timestamps. We currently have to either use Flash Lite (weaker text, good timestamps) or run a two-pass pipeline (Flash Lite for timing, Pro for text), which doubles our API cost.

Reproduction

  • Audio: Any Arabic audio clip longer than 5 minutes should show the drift (we tested with an 11:49 clip)
  • Models: gemini-3-flash and gemini-3.1-pro
  • Method: Transcribe with segment timestamps, then compare segment boundaries against gemini-2.5-flash output or manual annotation
  • The drift is progressive, so it becomes more obvious with longer audio

Questions for the team

  1. Is the G3 Flash timestamp compression a known issue? It looks like a clock-rate bug rather than a random error.
  2. Is the G3.1 Pro drift related to how it handles segment boundaries with its 10-second chunking? The drift seems to correlate with the consolidated segment approach.
  3. Are there any API parameters we might be missing that could improve timestamp accuracy for these models?
  4. Is there a timeline for fixes? We’d love to use G3.1P or G3F in production given their superior text quality, but we can’t ship broken timestamps to learners.

Happy to share the full benchmark data (6 JSON files with all segments and timestamps) if that would help the team investigate.

Thanks for your attention to this. The text quality improvements in the newer models are genuinely impressive - if the timestamps matched, these would be a clear upgrade for our use case.

2 Likes

Hi, any update ? thanks