Introduces the Transformer, a novel architecture that departs from the previous models reliant on recurrent layers. The core innovation is the attention mechanism that processes data in parallel and captures complex dependencies in sequences. This model significantly improves the efficiency of training models on tasks involving sequences, such as natural language processing (NLP) and machine translation. The Transformer has since become the foundation for subsequent advancements in NLP, including models like BERT and GPT, revolutionizing how machines understand and generate human language.
Attention is All You Need
Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Date:
Journal: IEEE Transactions on Information Theory, Volume 22, Issue 6, November 1976, pp 644–654
classic data-science