 During BERT pre-training, we train on mass language modeling and next sentence prediction. In practice, both of these problems are trained simultaneously. The input is a set of two sentences with some of the words being masked, and we convert each of these words into embeddings using pre-trained embeddings. This provides a good starting point for BERT to work with. So it would output 1 if sentence B follows sentence A in context and 0 if sentence B doesn't follow sentence A. Now on the fine-tuning phase though, if we wanted to perform question answering, we would train the model by modifying the inputs and the output layer.