This year, Google introduced MLP-Mixer, an architecture based on Multilayer perceptrons ( MLPs ) and Mixer layers. Each Mixer layer consists of two MLPs, one for token-mixing ( mixing per-location features ) and another for channel-mixing ( mixing spatial information ).
This architecture yields competitive results against models which use convolutions and vision transformers.
Using a similar approach, I’ve tried using MLP Mixers for text classification, thereby using them on embeddings of shape max_length * embedding_dims
. The architecture is similar to what is mentioned in the paper, except for some changes in how the text sequences are fed to the Mixer layers.
I’ve used this model in the Kaggle competition " Natural Language Processing with Disaster Tweets" and it achieves an accuracy of 73.95 % which is comparable to that of a model using 1D Convolutions.
The Kaggle notebooks,
https://www.kaggle.com/shubham0204/tweet-classification-with-1d-convolutions