We’ve reimplemented #VideoFocalNet, a strong video action recognition network featured in #ICCV 2023. Notably, #VideoFocalNet is a #CNN-based network, and one of its primary contributions is the Spatio-Temporal Focal Modulation (STFM) layer, which has enabled it to achieve competitive state-of-the-art performance among #transformer-based video networks, i.e. #uniformer, #video-swin, etc. The visualizations presented below are from the initial and final STFM layer in the model.