 An input English sentence is converted into an embedding to represent meaning. We add a positional vector to get the context of the word in the sentence. Our attention block computes the attention vectors for each word. Only problem here is that the attention vector may not be too strong. For every word, the attention vector may weight its relation with itself much higher. It's true, but it's useless. We are more interested in interactions with different words. And so we determine like eight such attention vectors per word and take a weighted average to compute the final attention vector for every word. Since we use multiple attention vectors, we call it the multi-head attention block. The attention vectors are passed in through a feedforward net one vector at a time. The cool thing is that each of the attention nets are independent of each other.