 Why do transformers use the sine and cosine functions for positional encoding? So the transform architecture contains two parts, an encoder and a decoder. But before passing the vectors into these blocks, they are positionally encoded first. This means that for every word vector, we will add a position vector of the same size. To generate position vectors, we use sine and cosine functions. And we do this for two reasons. For one, the output of the sine and cosine values are constrained between negative and positive one. This allows a word in a sentence to pay attention to other words that may be even further away from it. A second reason to use these sine and cosine functions is that they are easy to extrapolate. Even though we haven't seen some sequences of a specific length during training, we can just calculate the position vectors regardless with a simple formula.