 So pre-training. The goal of pre-training is to make Burt learn what is language and what is context Burt learns language by training on two unsupervised tasks simultaneously They are mass language modeling and next sentence prediction For mass language modeling Burt takes in a sentence with random words filled with masks the goal is to output these mask tokens and This is kind of like film the blanks. It helps Burt understand a bidirectional context within a sentence In the case of next sentence prediction Burt takes in two sentences And it determines if the second sentence actually follows the first in kind of what is like a binary classification problem