 Let's talk about multi-headed attention in Transformers. We take every word of an input sentence and then add some positional encoding to get positional encoded vectors where each word is 512 dimensions. For every word, we generate a query vector, a key vector, and a value vector, each also 512 dimensions. These lines divide the vector into eight heads of 64 dimensions each. We then multiply the query and the key vectors in order to get a max sequence length cross max sequence length matrix. We'll apply scaling and softmax to get eight attention vectors. We'll then multiply it with the value tensors and concatenate them in order to get vectors that have better representation of the meaning of words.