 The code for the crux of this multi-head attention is pretty simple. Every word is represented by three vectors. The query vector, which is what I'm looking for. The key vector, which is what I can offer. And then the value vector, which is what I actually offer. Every word will take their query and compare it to the key of every other word. We divide it by the query dimension to ensure that the variance is not too large. A mask is an optional input to this function, which is used by the decoder to ensure that it doesn't look ahead at words when trying to generate context. This is because that would be considered cheating. And then we have a softmax to create a probability distribution for every single word. The output attention variable here is essentially just a transformation. We have the old value vector, which we apply the attention transformation to, to get this new value vector out. The word vectors and out are much more context aware and are hence higher quality than the input value vector.