 What are the different types of masks used in Transformers? In the encoder, multi-head attention, we use a self-padding mask. For sentences shorter than the maximum length, we use padding tokens, but these tokens shouldn't be representing any information, and so we zero out any information by not paying attention to them, that is, we mask them. In the decoder, mask self-attention, we perform the same self-padding mask, but on top of this, we're also using a lookahead mask. This ensures the decoder cannot cheat by looking ahead during training, and so we mask tokens that are ahead of the current token, as seen here. This is because we have access to all elements in the encoding phase, but we don't have access to all elements in the decoding phase during inference time. In the cross-attention of the decoder, we use padding masks only. For more details, check the playlist where we build Transformers together.