 How has multi-headed attention coded out in transformers? In the encoder, every word vector is encoded into three vectors each, a query, key, and value vector. In code, this is equivalent to passing the input word embedding through a feed-forward layer that has the QKV vectors stack on top of each other. The green comments are the shapes of the tensors in each step. We then reshape the QKV vector to split the embedding dimension into eight heads and the rest of the embeddings. So, multiple heads act like another batch dimension. We then for every word extract the query key and value vectors, perform attention, and get this output tensor that consists of words that better encapsulates the context of a word.