 What is self-attention in transformer neural networks? Self-attention in transformers is a mechanism used to better understand the context of words. Context of a word in a sentence depends on the words that come before it and the words that come after it. With self-attention, we compare the words of a sentence to the words of the same sentence. When comparing, we create this attention matrix, and for every word this matrix will determine which words to pay attention to. The transformer encoder is essentially a self-attention unit. Say we take input word vectors, pass it through the trained attention unit, and get output word vectors. The output word vectors will be more context aware and hence better than the input word vectors.