 This is the code for the multi-head attention part of transformer neural networks. We first take the input words in parallel. For every word, we generate a key query and value vector. We split the key query and value vectors into eight heads. We'll extract each of these three vectors for every word. We perform scaled dot product attention to get eight attention matrices and eight value vectors. That code looks like this, by the way. We then concatenate the eight attention heads to get the final value vector. And then pass it through a feedforward layer so that all the heads can interact with each other. For every word, we will end up with a vector out that best represents the context of a word in the sequence.