Speaker Diarized and Timestamped Transcription with Gemini

Hi folks!

Doing a lab for school, and I was wondering if anyone has had any luck getting Gemini-2.5 to do long-form audio and video (1hr to 3hr range) transcriptions that are diarized and timestamped.

Fairly unfamiliar with Gemini-2.5 but when I tried the March 03-25 model a couple months ago it seemed to be very promising. Has any one had any particular luck with certain prompts and system instructions? Also is enabling “Thinking” for the flash models any helpful?

Also wondering how folks are handling long form media and Gemini’s given context window, with chunking being the first thing that comes to my head, but I’m not sure if it’ll be able to retain the long form context of the speaker diarization if I do chunk.

Any help would be greatly appreciated!

Hello,

Welcome to the Forum,

Could you please share a bit more detail about your goals, what exactly you are expecting to achieve, and which model (flash/pro) you are currently using?

Hello Lalit,

I’m trying to generate a diarized and timestamped transcript for audio clips that are approximately 60 to 180 minutes long. I want it to be utterance level, as in I don’t need word level accuracy.

Ideally I want to try to use the flash model, but I’ve only gotten sub-par output with the flash model.

I would recommend going through audio understanding doc and video understanding doc in Gemini API documentation.