 We pre-train BERT with mass language modeling and next-endance prediction. For every word, we get the token embedding from the pre-trained word piece embeddings, add the position and segment embeddings to account for the ordering of the inputs. These are then passed into BERT, which under the hood is a stack of transformer encoders, and it outputs a bunch of word vectors for mass language modeling and a binary value for next-endance prediction. The word vectors are then converted into a distribution to train using cross entropy loss. Once training is complete, BERT has some notion of language. It's a language model. The next step is the fine-tuning phase, where we perform a supervised training depending on the task we want to solve.