 Computers don't understand words. They get numbers, they get vectors, and matrices. The idea is to map every word to a point in space where similar words and meaning are physically closer to each other. The space in which they are present is called an embedding space. We could pre-train this embedding space to save time, or even just use an already pre-trained embedding space. This embedding space maps a word to a vector. But the same word in different sentences may have different meanings. This is where positional encoders come in. It's a vector that has information on distances between words in the sentence. The original paper uses a sine and cosine function to generate this vector. But it could be any reasonable function.