 Well, the initial embedding is constructed from three vectors. The token embeddings are the pre-trained embeddings. The main paper uses word piece embeddings that have a vocabulary of 30,000 tokens. The segment embeddings is basically the sentence number that is encoded into a vector. And the position embeddings is the position of a word within that sentence that is encoded into a vector. Adding these three vectors together, we get an embedding vector that we use as input to BERT. The segment and position embeddings are required for temporal ordering since all these vectors are fed in simultaneously into BERT and language models need this ordering preserved.