 Five concepts in Transformers, part one. Attention. This involves how much focus each word in a sentence should pay to other words either in the same sentence as in self-attention or to words in a different sentence like in cross-attention. Masking. This prevents attention from focusing on certain words. This could be the look-ahead mask to ensure that words can't get information from other words that come after it. And then we have padding masks to ensure there is no focus made for padding tokens. Positional encoding. To every word in a sentence, we add some information to encode position. Tokenization. This is a procedure of splitting up a sentence into tokens. Tokens can be characters, words, or byte pair encodings. Loss. This is a function the neural network tries to optimize when learning parameters. In the Transformer, this is the cross entropy loss. For more information and to build a Transformer, check out this playlist on the channel.