What happens when we apply similar pure convolution blocks on patches of images? We can train a network with 0.8 million parameters for 10 epochs on CIFAR-10 and get ~83% top-1 test accuracy without having to use any fancy regularization. ConvMixer (the recently talked about architecture on Twitter):
There are a few visualizations of the internals of ConvMixer that might be useful for the community.
Learned patch embeddings:
Convolution kernel from the middle of the network showing varying locality spans: