 During the training phase for English to French we feed the output French sentence to the decoder But remember computers don't get language. They get numbers vectors and matrices So we process it using the input embedding to get the vector form of the word And then we add a positional vector to get the notion of context of the word in a sentence We pass this vector Finally into a decoder block that has three main components two of which are similar to the encoder block The self-attention block generates attention vectors for every word in the French sentence to represent how much each word is related To every word in the same sentence next we pass each attention vector to a feedforward unit This makes the output vector more digestible by either the next decoder block or a linear layer The softmax layer transforms it into a probability distribution and the final word is the word corresponding to the highest probability