Transformers

The transformer is a deep learning model that uses self-attention to process sequential input data, such as natural language, all at once. It was introduced in 2017 by a team at Google Brain and has since been used primarily in the fields of AI/NLP and AI/Computer Vision/Computer Vision. Unlike RNNs, transformers process the entire input all at once and can learn long-range dependencies between input and output sequences more efficiently.
The transformer architecture follows an encoder-decoder structure but does not rely on recurrence and convolutions to generate an output. Instead, it uses multi-headed attention mechanisms to directly model relationships between all words in a sentence, regardless of their respective position. The encoder compresses an input string from the source language into a vector that represents the words and their relations to each other. The decoder module transforms the encoded vector into a string of text in the destination language.
Multi-headed attention is a module for attention mechanisms that runs through an attention mechanism several times in parallel. Each of these parallel computations is called an attention head, and the independent outputs are concatenated and linearly transformed into the expected dimension. Multi-head attention allows neural networks to control the mixing of information between pieces of an input sequence, leading to better performance in natural language processing tasks such as machine translation and text summarization.

Resources

Courses

Code

References

For NLP

For Computer Vision

Self-supervised vision transformers

Vision transformers with convolutions

Multi-modal transformers

See AI/Deep Learning/Multimodal learning

For RL

See "Decision transformer" in AI/Reinforcement learning