 Secondly, we need to have positional encoding. The problem is that attention models don't preserve the sequence information. They look at all the other words in the input and say which ones are important and they forget which one was the first or the second or the third. Now you could try and take each word and say okay give it a label input one or two or three or four but it turns out to work better to the following slightly arcane encoding that builds up a representation of the where each word is in the sequence using a bunch of signs and cosines. So if we have n tokens in a sequence each of dimension d say 300 like a word of ec we would then say what we're going to do is build up an n by d matrix where for each position each word the zero first second up to the this case the 60th word is encoded by some mixture of the even numbered elements of that 300 dimensional vector being signs of something and the odd ones being cosines of them and these get bigger and bigger as you get bigger through the sequence. What you're doing is having a slow sign and a faster sign and a faster sign and a faster sign and short ones and this turns out to give a nice representation that lets the neural net keep track of where it is and in sort of weird neural net world what they do is take the matrix x which was the original n by d the d dimensional encodings the actual encodings like word of ec of all n words and add in the positional encoding add them up together keeping roughly the same information and then just pop that into your neural net. Details don't matter fact that does matter you need to tell the model the order that the words are in so it can actually take advantage of that and compute how far away it is for example or whether it's the beginning of the sentence or the end of the sentence where words have slightly different meanings based on where they show up.