 What are the key query and value vectors in Transformers? We have some word vectors, but the Transformer encoders goal is to create better word vectors that understand context. And so every word will be represented with three vectors, a query vector, which represents what am I looking for a key vector, which represents what I can offer and a value vector, which is what I actually offer. Every word has some query vector, which is what it is looking for. And it'll compare itself to every key of every word. This is represented by the product of query and key. We scale this to get stable values, add a mask, if working with the decoder to prevent us from looking at data into the future. And then we'll softmax these values to ensure that we have a probability distribution. This will be our attention vector, and it's basically a transformation. We can apply this transformation to the value vector to get a new value vector, which is just better at understanding context.