Videos are sequences of images. Modelling on video clips requires image representation models (CNNs) and sequence models (RNNs, LSTMs etc.) working together. While this approach is intuitive, how about a single model that takes care of the two modalities?
In our (with @ayush_thakur ) latest keras.io example, we minimally implement ViViT: A Video Vision Transformer by Arnab et al., a pure Transformer-based model for video classification. The authors propose a novel embedding scheme and a number of Transformer variants to model video clips.
arXiv: [2103.15691] ViViT: A Video Vision Transformer
Tutorial: Video Vision Transformer