I am glad to present my implementation of the “Fastformer: Additive Attention Can Be All You Need” paper.
This is a Transformer variant based on additive attention that can handle long sequences efficiently with linear complexity. Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even better long text modeling performance.