How to get consistent Multi-Speaker Transcription output from Gemini 2.5 Pro?

I’m struggling with inconsistent output from Gemini 2.5 Pro when transcribing long multi-speaker audio files.

I am trying to transcribe a longer audio (4h) using Gemini 2.5 Pro. The language is quite specific (non-common). But Gemini works, and Whisper doesn’t. My approach is the following:
(1) Remove non-speech segments using SpeechBrain’s VAD model
(2) Identify speaker segments using pyannote/speaker-diarization-3.1
(3) Merge consecutive speakers to create groups of max 30 mins
(4) Pass the 30 mins audio files to Gemini 2.5 Pro

As the 30 mins audio has multiple speakers, and we have pre-computed timings for each speaker, I need the Gemini’s output to match these timings.
To achieve this, I have tried two things:
(1) Pass the whole 30 mins audio along with the speaker timestamps (a list of start_time=MM:SS and end_time=MM:SS) to the Gemini model and ask it to transcribe each subsegment.
(2) Split the 30 mins audio into individual files for each speaker and ask the model to transcribe each file individually.
In both scenarios I ask the model to respond with a JSON array of transcribed_text which needs to have the same length as the number of speakers.

I am facing the following problems:

  • The model often gives less transcriptions than there are speakers in the audio
  • The model returns an incomplete JSON (starts hallucinating after a while)

What is the best approach to take here?

1 Like

@DEDI
welcome to the community,

Have you tried reducing the temperature in the gemini call?
also in your case, could you specify the number of speakers in the audio in the prompt and ask gemini to give individual transcriptions.

alternative approach is to split the audio into smaller chunks i.e 6-7 sub samples of 5 min chunks (including some overlap) and ask gemini to transcribe it into text and then do a secondary pass on the total text to correct the transcriptions and any final text formatting.(like Json)

I think this might give you better result.

I recently noticed that if the audio is stereo, the length Gemini calculates is slightly longer than the reality (+/- 10%).
So your 30 minutes would be interpreted as being an audio of 33 minutes, therefore cutting a part of it if you indicate 00:00 to 30:00.

Also, by experience, by reducing the length of the audio I was sending, hallucination greatly reduced. I get way better results with parts of 5-10 minutes (and to increase consistency, simply have it checked at the end with text-to-text model). (receiving the previous and the next segment at the time).