When processing video is the metadata used?

I was recently trying out Gemini 1.5. I wanted to see how the multi-modal support worked. It was very impressive.

However, it was able to tell things about a video clip I took that I did not think were available from the video frames. Such as where I shoot the video, since there were no identifying signs. When I queried the model about how it determined the video was taken in Norway the response was: “There are a few clues in the video that suggest it was taken in Norway. First, the scenery is very mountainous, and Norway is known for its mountains. Second, the lake in the video is frozen, and Norway has a cold climate that would allow for lakes to freeze. Finally, the small town at the end of the video has a very Scandinavian look to it, with wooden houses and a simple design”. This is very impressive, but is it only using the video image data? Or is it also looking at some video metadata that might tag the location?

I plan on inspecting and editing out the metadata in the future but thought I would see if anyone knew.

1 Like

Thanks for the question and for sharing your experience with the model!
I’m not fully sure about the implementation details, but I believe it’s just image data at the moment. Of course, things may change in the future.

Thanks for the question! Confirming that the model only uses video (vision) data, and there is no additional metadata that the model uses to identify location.