 Why do we need to perform masked attention in the decoder, but we don't need masked attention in the encoder of a transformer? This is to prevent us from cheating. The goal of the attention units is to understand context and generate word vectors. In the encoder, the inputs are passed in simultaneously, so every word can look ahead and behind for understanding context. In the decoder, we are generating an output one word at a time, so we don't have access to the later words during inference time. Hence any context learned by a decoder during training time has to be done with the words that only come prior to it. So during training time, the later words need to be masked.