 How does multi-headed attention work? We have a vector that represents each word. This is then mapped to three vectors, which is a query, a key, and a value. We split each of these into eight parts. The query and key are used to generate eight attention matrices. These are then used to generate eight output vectors. We concatenate the vectors to get a single vector that corresponds to the same word as before. But now this new vector actually better understands context of the word.