 We've seen sentence embeddings before, but they're so important it's worth quickly reviewing them. So there are a bunch of different ways that one can embed a sentence. The simplest one is to embed each word or each byte pair encoding within the sentence and then just average those. That works really well. One can also take all of the hidden states of a convolutional neural net that's run over the neural net over the sentence or a recurrent neural net run over the sentence and again take the average of them or often you take the maximum of them. We've seen that you can take a sequence-to-sequence model and take the output of the first sequence, the final hidden node of the sequence-to-sequence model. That works well if the sentences aren't too long and finally we will see that one of the easiest and nicest ways to embed sentences is directly. Do a context-sensitive embedding such as BERT of a special CLS token. Before every sentence will put a CLS token, that token shows up in the context of the sentence that follows it and that then gives an embedding which represents that context and hence represents the sentence. Once you have an embedding of a sentence like a CLS token, you can then pop it into your favorite task. For example, labeling is it a positive sentence or a negative sentence. So there are lots of ways you can do this and depending upon the particular problem different methods work better or worse but let's sort of think of what this looks like for something like a convolutional neural net. We take a set of words. Each word is then embedded into a different space. We run local convolutional filters, maybe a two word or a three word or a four word sequence there. Each of these convolutional filters walks down through the sentence and gives an embedding. So we have embeddings for each piece there. We can then concatenate these into some longer feature. If things are padded and truncated then we just feed this whole thing into a neural net. If not we could do a max pooling, find the maximum for each of these, run them through another neural net and then predict the label. Is it a happy sentence or a sad sentence or a positive review or a negative review? And if you look at this on a whole bunch of different problems in different cases sometimes LSTMs give the best method if you're doing a sequence to sequence, sometimes a convolutional neural net based on the embedding versus max pooling to the best one, sometimes a CLS token from BERT does give the best answer. So people who really care will typically try each of these and see for their particular problem which one works best.