 Hello everyone. Welcome to another episode of Code Emporium and this time we are going to talk more about sentence embeddings. So an embedding is a numerical representation of some language. And we require this because computers don't understand language, but they do understand numbers. And so we're going to have to perform some transformation to transform some language in the form of a sentence to some vector representation that a computer can understand. In order to see how exactly we can construct such high quality representations of sentences, we're going to cover a lot of topics. And with each case we will see some advancements that eventually lead to this world of understanding strings of words that we call sentences and how computers understand them too. So let's get to it. We started discussion with the n gram vectors. So essentially with n gram vectors we take a sentence, we'll break it down into its corresponding unigrams, by grams, trigrams, and so on, and then use those representations in the form of a vector. So for example, this first item here is it could be like a by gram feature that says how many times does the by gram good day occur in a sentence. And similarly we can have like a unigram feature here. It's like one if it occurs once, two if it occurs twice, and so on. This year was one of the earliest representations of sentences introduced by the father of information theory Claude Chen way back in a dissertation in 1948. Another very early method of representing sentences was the TFIDF vectors. So TF stands for text frequency, IDF stands for inverse document frequency. Now when converting a sentence into a TFIDF vector, we determine the TFIDF of every single word in the vocabulary with respect to the sentence, and then populate it in the form of vector. So for example, this here would be the TFIDF score of the word good. This here kind of might be the TFIDF score of the word this. And the size of this vector would be equivalent to the vocabulary size. So a cool thing about both of these n gram and TFIDF vector representations is that they are super simple to understand and interpret. However, they do have the difficulty of being extremely sparse and extremely large. So this, for example, the TFIDF vector is of the order of vocabulary size. And the n grams is also of the order of vocabulary size and grows as the vocabulary grows itself. This means that the vector representations of sentences would be just so far apart that you can't really distinguish which sentence is closer to another sentence. And this is a super common problem that we call the curse of dimensionality. And it plagues many fields of artificial intelligence. I have a full video discussing the curse of dimensionality right here. So clearly sentences are a little tough to make computers to understand. So let's try to break this down by considering word representations first. In 2001, there was a paper called the neural probabilistic language models that was introduced. And it introduced the concept of dense word representations. So these here are individual word vectors that are of the size of maybe like 64, 128 or 256 dimensions. Every word would be of this same size. And a lot of these numbers in these vectors are non zero. So they are dense word representations. The idea here is that if we were to plot all of these out in some hypothetical space, we would see that closer the vectors are physically means that closer their representations and meaning would be at least this is the ideal case. And so a lot of future progress and NLP was done to kind of fine tune these vectors to be as high quality as possible such that closer vectors would actually mean something that's actually closer to each other. So for example, we have the word to VEC architecture that was introduced in 2013. So the word to VEC framework actually consists of two different architectures. One of them is the continuous bag of words architecture. And the other is the skipgram architecture. Both are very similar in how they operate. During the training phase, for example, we would pass in the contextual words of a sentence, and then have them aggregated. And it would learn the center word representation. And the opposite is true for the skipgram case. Now, once these models were trained, we could essentially just look up the word vectors for a given word. And it wouldn't change with respect to context. So for example, the queen in drag queen, and the queen and king and queen, despite having different meanings would still be mapped to the same vector representation. So that's all about words. But what about sentences? Because sentences is something that we care about in this context. Well, one way now to go about thinking about sentences is if we have like a bunch of word vectors, we could just take the average of them, and we'll get another vector and that vector is going to be the representation of a sentence. This is a very simple architecture and is kind of exactly the idea behind a neural bag of words. It is called a bag of words because we are simply just taking the average of words in a sentence without any regard to their positioning. So it's treated literally as a bag of word vectors. And because this is a neural approach using neural networks, it is a neural bag of words. But this clearly has the issue of being ignorant of contextual information. And aside from the neural bag of words, there have been many strides to actually preserve some form of context information when creating word vectors and sentence vectors. So one example of this is the time delay neural networks. Time delay neural networks were introduced in a 1989 dissertation in order to address the idea of phoneme detection. So let's say that you have an input sequence of sound waves that someone is speaking, using a bunch of convolution and polling operations, we would treat this eventually as a classification problem, where the sound spoken is classified into the phoneme spoken, that is either a but a d or a gut. Now, while these neural networks were introduced way back in 1989, it was only in the 2010s that they really started their resurgence and being applied to more complex tasks in natural language processing. And this is for a variety of reasons. The first one is a lot of software advancements. So all of the advancements I said about continuous and dense word representations and word to back had not taken place at the time. And so once they had taken place, there was a much more easier lift towards making it more applicable to, you know, larger NLP tasks. Another reason why there was a resurgence in time delay neural networks was hardware and GPUs, a lot more CPU and storage. And also there was the availability of large amounts of data that was required for these neural networks to learn complex patterns for much more complex tasks. The time delay neural network was also not the only one to use convolution in order to solve NLP tasks. We had the dynamic convolutional neural network, which basically can take in a sentence, learn some representation of that sentence in order to further solve some broader NLP task. And in doing so, this specific DCNN would actually have a dynamic architecture depending on the length of the sentence. So in this case, it does some dynamic max k pooling operation. And depending on the larger the sentence, then more of these layers we would have and more of these red squares for pooling would be selected per every layer. So it was another way to use the convolution operation in order to get contextual information about a sentence and represent that into a single vector. Now another stride of also including contextual information into a sentence embedding was through recurrent neural networks. So you can think of recurrent neural networks as feed forward neural networks that are rolled out over time. And so they actually work well with sequence data. Now, recurrent neural networks, despite being introduced in the 1980s, were notoriously hard to train. And that was until a 2013 PhD dissertation by Ilya Sotskever that showed how recurrent neural networks could be stably trained. And so a lot of the NLP research that came after 2013 had their state of the art that relied typically on some form of recurrent neural network. And the diagram shown here is that of a long short term memory cell network or an LSTM network. And this network actually was able to handle much longer term dependencies. So for processing very long sentences and documents and having some contextual understanding, the LSTM networks were the state of the art for quite a long time. But an issue with this network is that it is very slow to train. And it uses a truncated version of back propagation called back propagation through time in order to learn its parameters. But this is exceptionally slow. In order to combat this issue of slower training, we had the transformer neural networks. These were introduced in 2017. And they had an encoder here and a decoder. So the encoder would take some input sentence, and then generate word vectors. And the decoder would take these input sentences and a past token that was generated and generate a new token. And this is the decoder that would generate one at a time. But the encoder can process inputs in parallel. And this parallel processing of inputs is what makes training of these neural networks so much more faster than their recurrent neural network counterpart. And they also make use of attention mechanisms throughout. And with these attention mechanisms, long term dependencies can be effectively captured. And so after 2017, a lot of the sequence to sequence tasks such as machine translation started to use these transformer neural networks in order to solve those tasks. The issue though with transformer neural networks and all of the networks that came before it are that they need to be trained from scratch for every single problem, even though some of those problems might be similar to each other. So for example, if I'm learning language translation and question answering, if I want to train a model language translation, I would need let's say 100,000 examples, give or take whatever it is. Now if I wanted to kind of let's say a model that already knows language translation, if I wanted to learn, let's say question answering, it's kind of like I need to retrain that model from scratch. And so it would need like let's say another 100,000 examples. But in many cases, we just don't have that availability of data for every single little nlp task. And so it would be nice to leverage the concept of transfer learning, where we have a basic understanding of language. And then we would fine tune this pre trained model in order to understand a specific task. This would mean that we don't really need that much data for much more different variety of nlp tasks. And thus, this is why BERT and GPT were introduced. And so in BERT, we have a pre training phase. And then we have a fine tuning phase. And this case, like the BERT model is trained on mass language modeling, where it is trying to fill in the blanks, try to predict like the blanks in a sentence. And then it is also trained on the concept of next sentence prediction. So it's given two sentences, does sentence a contextually come before sentence B? Or are they unrelated? That is more like a classification problem. And once we have a model that is pre trained on this, it has some understanding of language in this case, it can be fine tuned to understand different kinds of other tasks, for example, natural language inference, for example, question answering and also named entity resolution. Now until this point, BERT is fantastic. But it still only deals with word embeddings. And in this video, we're really concerned about sentence embeddings. So how do we go from learning the representations of words to representations of sentences? And this is where sentence transformers comes into the picture. Now sentence transformers are built on top of BERT in order to get some really high quality sentence embeddings that ideally preserve some meaning. Now the simplest transformer that you can come up with is basically passing a sentence into BERT, BERT is going to generate word vectors that are of high quality. You can take the average of those vectors, and then you'll get a final vector U that represents the sentence. This is similar to like treating it as a neural bag of words approach, very similar to that approach, but this time using BERT. An issue with this approach, though, is that the final sentence embedding is not going to be a very high quality if you simply just average words. In fact, they might even be worse than like taking the average of glove embeddings or word to vex embeddings. So this doesn't work out very well. That said, we can actually train a similar BERT architecture that looks like this or can also look like this in order to better create sentence embeddings. In this case, we can train and two architectures on two tasks. And one of them is going to be natural language inference that is given two sentences. Can we determine whether sentence A entails sentence B, sentence A contradicts sentence B, or they're kind of just unrelated and neutral. And so using this, we would train what we call a Siamese network, because these are essentially exactly the same networks and architectures. We're just passing two different sentences, but to the same exact architectures over here, we generate two sentence embeddings that represent sentence A and sentence B. And then we'll just concatenate sentence A, sentence B, and the absolute difference between them to get a long vector and then pass it through a softmax classifier to classify it as entails contradicts or just neutral. Now that's natural language inference. Next is sentence text similarity, where we would try to determine how similar sentence A is to sentence B using some cosine similarity metric. Once we train this BERT architecture plus pooling layer using either or both of these methods, we can then just strip apart just this section of just using the BERT and pooling section in order to use that now as our final sentence transformer. So now if you actually pass in a sentence over here, we will get a vector that actually better represents what that sentence actually means. And this can be used in so many applications, like for example, it could be used in search where you have a query that you have searched for and you want to make sure that the results are similar to the queries that you search for. This could be used in question answer sites like Cora or Stack Overflow too, for example. And a lot of this fundamental understanding is kind of how also the modern language models that we see burgeoning today are trained and how they understand longer documents since longer sentences too. I made a bunch of resources on almost each of these individual topics that you can check out in the description down below. And if you think I deserve it, please do give this video a good like and subscribe for more amazing content. I will see you very soon. Bye bye.