I implemented the very recent NeurIPS 21 paper Transformer in Transformer in TensorFlow which uses attention inside local patches essentially using pixel level attention paired with patch level attention. This also achieves SoTA performance on image classification beating ViT and DeiT with similar computational cost.
1 Like
1 Like
This is a nice implementation as well, thanks for sharing