Implementing "ViViT: A Video Vision Transformer"

ariG23498 · January 18, 2022, 4:36am

Videos are sequences of images. Modelling on video clips requires image representation models (CNNs) and sequence models (RNNs, LSTMs etc.) working together. While this approach is intuitive, how about a single model that takes care of the two modalities?

In our (with @ayush_thakur ) latest keras.io example, we minimally implement ViViT: A Video Vision Transformer by Arnab et al., a pure Transformer-based model for video classification. The authors propose a novel embedding scheme and a number of Transformer variants to model video clips.

arXiv: [2103.15691] ViViT: A Video Vision Transformer
Tutorial: Video Vision Transformer

vivit

Topic		Replies	Views
Implementing ShiftViT Show and Tell keras , education	0	1071	March 2, 2022
How to input videos in Video Vision Transformer? Keras vision	1	399	November 22, 2023
Video Classification with a CNN-RNN Architecture Show and Tell keras , learning , education	4	1986	June 16, 2021
Video Swin Transformer in Keras Show and Tell keras	1	750	October 21, 2023
Image classification with MobileViT Show and Tell keras , learning , education	1	2113	October 29, 2021

Implementing "ViViT: A Video Vision Transformer"

Related topics