Combining the benefits of convolutions (for spatial relationships) and transformers (for global relationships) is an emerging research trend in computer vision. In my latest example, I present the MobileViT architecture (Mehta et al.) that presents a simple yet unique way to reap benefits of the two.
With about a million parameters, it achieves a top-1 accuracy of ~86% on the tf_flowers dataset on 256x256 resolution. Furthermore, the training recipes are simple and the model runs efficiently on mobile devices (which is atypical for transformer-based models).