 So there are a couple of different ways to use BERT. The most common way is to use the fact that BERT has a magic CLS token that's at the very beginning. So if you take an input to BERT, it has CLS, I went to the bank and bought and then a period and then a separator for the next sentence, and then outputs these and the output of the context sensitive CLS is something which is a BERT embedding which captures the context in which the CLS token shows up and therefore captures the sentence. You can then take a prefab off-the-shelf BERT, feed it in a sentence, find the embedding for the CLS token, feed that into a really simple feedforward neural network and predict your label. Looks great. You could also instead of that note that BERT is just one Mongo big neural net and it goes to a whole bunch of neural net calculations, but in the end if you take it on some piece you're putting as input, you could take not just the output of the decoder for the masking, but you could take the output of the next to last layer of the same style that you've done with CNNs, or sometimes that's often what people do, you could take the outputs of the last layer, the one before it, the one before it, the one before it, take the last four layers outputs and call that a bigger feature set and then pass that in as the feature that you use for your training set. So note these all have the same property, take a pre-trained BERT trained on an enormous data set, then use it to generate an embedding, which you then use on your small label data set to do the learning. Finally, if you're going to do a really good job and want high accuracy, you can instead of taking BERT as being entirely fixed, note that again BERT is nothing more than a neural net and what you can do is take a pre-trained BERT, take for example the mapping that goes from the CLS token to the embedding through your neural net to the label, but instead of just learning a new set of weights on the mapping here that goes from the CLS token embedding to the label, you could then let some of the weights in BERT be adjustable, take BERT as a starting initialization, freeze most of the BERT model because there's 100 million, 300 million parameters in here, but take a flashed four layers worth of parameters, let them be free to be adjusted, and now do a stochastic gradient descent trained on your data set, not just adjusting the dense model that goes from the BERT output, but actually changing the internal weights on BERT, and that goes by the nice name of fine tuning BERT. You can take BERT and fine tune it so that we'll do a better job on whatever your end task is, or whatever job is similar to your end task, and so for most of the commercial applications that I know, what people do is they grab one of the pre-trained BERTs and then they fine tune it to do a better job on the end task of interest, write a classic semi-supervised learning.