I’m experimenting with the flash 2.0 model for long video understanding, and i’m finding that its ability to provide timecodes on longer videos (more then a few minutes) is completely made up. It seems like its providing timecodes fine for the first few minutes, and then it starts hallucinating the rest of the timecodes.
An example prompt: given this video, list all scenes and their timecode (format MM:SS).
I’ve tried both raw chat, function calling and structured outputs, but none of them have any significant difference.
I’ve also experimented with supplying some timecodes of scenes i know of, hoping it would steer the model towards the right timecodes.
Anyone found tricks for how to reduce timecode hallucinations?