 What is the difference between self-attention and multi-head attention? In self-attention, every word in a sequence pays attention to every other word to understand context. Take all the word vectors, break each word down into query, key, and value vectors, compute the attention matrix, and generate the final word vectors. These vectors have good context awareness. Multi-head attention is similar, but we're performing this operation multiple times in parallel. Take the input word vectors, break each word vector down into query, key, and value vectors, and break each of those into eight parts, assuming eight heads, with one per head. Compute the attention matrices for each case, we'll have eight of them, and compute the output vectors, which we will also have eight of. Concatenate each of these sets of eight vectors to get the final word vectors. Now, these word vectors have even better contextual understanding than just the single head self-attention.