I was recently trying out Gemini 1.5. I wanted to see how the multi-modal support worked. It was very impressive.
However, it was able to tell things about a video clip I took that I did not think were available from the video frames. Such as where I shoot the video, since there were no identifying signs. When I queried the model about how it determined the video was taken in Norway the response was: “There are a few clues in the video that suggest it was taken in Norway. First, the scenery is very mountainous, and Norway is known for its mountains. Second, the lake in the video is frozen, and Norway has a cold climate that would allow for lakes to freeze. Finally, the small town at the end of the video has a very Scandinavian look to it, with wooden houses and a simple design”. This is very impressive, but is it only using the video image data? Or is it also looking at some video metadata that might tag the location?
I plan on inspecting and editing out the metadata in the future but thought I would see if anyone knew.