 In this video, we're going to talk about word-to-vec. So computers don't understand words, but they understand numbers. And so we need to convert these words into some numeric representation. These numeric representations are called embeddings. And ideally in their state, the closer the numbers start to each other, the closer is their meaning. And so you'll have two similar meaning words over here, closer to each other, than let's say Apple, which is way up here. Very representations of words and sentences involve breaking them down into their corresponding engrams. So for a sentence, we would represent them as a vector of engrams of words, where every word here is an engram. So we have two words as a biogram, three words as a trigram. While this is a very interpretable vector, it is very large. And so in 2003, we had neural probabilistic language models introduced, which would internally learn a continuous, dense representation of a word. So this would make the vector size a fixed sequence for every single word that's much more smaller in the order of hundreds of numbers per vector. And they can also be represented in a continuous space. It's continuous because every single point is a valid point in the space, although not every single point corresponds to a word. About a decade later in 2013, word to vec was introduced. So word to vec is a framework that tries to create those dense, continuous vectors. And it does so with two main architectures, that is the continuous bag of words architecture and the continuous skip gram architecture. So to illustrate this with an example, let's say that we have the sentence, the biggest lie ever told. And we want to train this single layer neural network in order to learn about the word embeddings. Let's say that the window size is two. So that means that in order to predict, let's say the word lie, we consider two words that come up for it and two words that come after it. So the biggest ever told is the input for the C bow model. And each of these are initially one hot encoded vectors where the size is the one hot size of the vector is the same size as the vocabulary. Let's say it's a hundred, right? So each of these will be 100 cross one vectors. And these here is now it has to be projected onto some other smaller dense vector through a shared weights. So this is going to be like you can consider this as let's say that we want to have the dense representation as like 256 dimensions, then that means that the shared the weight itself, you have to multiply it by 100 cross 256. That's the dimensions of this matrix over here. And so if you multiply one cross 100 matrix with a 100 cross 256, you get a one cross 256 dimensional vector, which would be pretty dense. And this vector should be a representation of the word the. And it's shared because we do the exact same thing right over here where we have the same exact matrix of 100 cross 256. And we multiply with this to get another one cross 256 dimensional vector called biggest same with ever and told. And so we have four of these dense vectors, which we take the sum of in order to get like the final projection of one cross 256. And now from this dense vector representation, we need to now expand it back into the vocabulary size, which is a 100 cross one dimensions. And so we have a 256 cross 100 dimensional vector here in order to get the word lie. And this is a very simple network, which we can then learn through back propagation. The idea here is to use an input set of words in order to predict the central context word. Skip Graham is very similar in the sense that we have the word lie and we are now predicting outer context words. Now I mentioned that the vector that we are multiplying this with is the exact same. While it kind of looks a little strange in this diagram, we're actually passing in training examples and pairs. So for example, for this sentence, let's say that, you know, the positive pairs in this case are lie and the lies, the input biggest is the output lies, the input efforts, the output lies, the input and told is the output. And so we're actually just doing one of these at time. And again, context doesn't matter at all. And the matrices that we are using over here are actually the exact same overall. Both of these architectures optimize the objective of is the output word in the context of the input word. And in the end, we will end up with these weight matrices over here to be a word to that mapping. So we can give an award. And for that one word, we can get its corresponding 256 dimension vector representation. And similarly here, we can give in the word. And after it's trained, we will be able to get its corresponding vector representation. It's just a lookup table at that point. So the pros of this are it overcomes the cursor dimensionality with simplicity. In this architecture, it's a super simple architecture. It's basically a feed forward network with one hidden layer. And so it's very quick and easy to train. And another issue here that we are overcoming is the cursor dimensionality and gram vectors are very large. And so if you were to represent them in an embedding space, that space is also very large and distances between points in that space are very large. They are so large that determining which points are closer to each other becomes almost indistinguishable because the distances are just that large. And this is what we call the curse of dimensionality. And the embedding representations would become useless in such a large space. On the other hand, here we're able to learn a very compressed representation of words. And so pairwise distances in the embedding space become more meaningful. And this transitions well into the second pro of close to the meaning equals closer physical vectors. Another cool thing is that pre-trained embeddings can be used in a host of applications. Even in the industry today, there's a lot of applications that would just take words map them to their vectors in order to get that vector representation and use it directly in their application. And another pro I thought was pretty cool was that it's self-supervised. The input to this model and the output to this model actually come from the same exact example, just a sentence. So the sentence itself serves as both the input and the output. Because there's an input and output, it is supervised. And because it's provided by the same sentence, it is self-supervised. And because it's a self-supervised, it's actually very easy to get more data. We just really need more sentences to train. That said, the word-to-veg framework does have some cons. So first of all, in this particular instance, the global information is not accounted for. We're literally just using individual, local context in order to generate the word vectors for both skip-bram and C-val. This can be solved with global vector embeddings, glove embeddings, which we'll take a look at next. A second con is that it doesn't work very well for morphologically rich languages. And this can be solved with fast text, which we will also look at next. And the third is that it lacks broader context awareness, which we have looked at in the past with LSTM's burden GPT. I'll link those videos in the description. Glove or global vectors have the same objective of trying to learn these word representations. However, instead of using just local contextual information from a single sentence, they use global information across different sentences in order to learn these word representations and embeddings. So more specifically, we construct a word word co-occurrence matrix by for every word, we would try to understand which words are in the context of other words. So for example, in this space over here is the number of times that the word hoax is in the context of the word COVID. And this over here is the number of times the word believe is in the context of the word hoax. So once we have this word word co-occurrence matrix, we will try to construct these vector representations of words. And they work just as well as the sea bow and skip gram case to illustrate kind of more what we're doing here is let's say that we have two words that's ice and steam. And we want to determine if solid gas, water and fashion, which words are they actually closer to in meaning with global vectors, we actually do this via probability ratio. This here is the probability that the word like solid is in the context of ice. This here is the probability that the word solid is in the context of steam. This is the ratio of two, those two probabilities. And so if the probability ratio is very large, we know that it is more related to ice and steam. If the probability ratio is very small, then we know that it is more related to steam than ice. But if the probability ratio is, you know, somewhere around one, then it's not really relevant to both ice and steam. And so what we can see here is that solid, it's a very high probability ratio here. And so it is more associated with ice. Gas, low probability ratio closer to zero and sense, and hence it is more associated with steam. And then water and fashion are kind of in the center and they're not really associated with either of these as much. And this is exactly what we would want in the embedding space. We want the vector for solid to be closer to ice. We want the vector for gas to be closer to steam, but we don't want water to be closer to ice or steam, nor do we want something like fashion to be closer to ice or steam. And because this is exactly what we want, we would build a neural net architecture that optimizes this probability ratio objective. And so this is just another way to learn word representations. Another issue that we had with word to Vec is the fact that it cannot handle morphologically rich languages very well. And by morphologically rich, I mean that there are certain words that might change their form because of the gender, because of the prepositions involved or actions. This is something that you see more in non-English languages, kind of like Finnish and Turkish and Arabic with their inflections, and also in a lot of Indian languages. I'm going to be illustrating one with an Indian language called Kanada, which is spoken in South India, with this example here. So let's say that we have house from the house in the house. All of these technically describe the same thing, a house. But if you were to translate this to that language Kanada, house is written as money. From the house, we add it with preposition from, it would be moneyinda. And in the house, which is with the preposition in, it's going to be moneyally. So technically, each of these are words and each of these are different looking words, but they have the exact same meaning. And so because of this, we want the vector representations to also reflect this by making sure that the vectors for each of these words are as close to each other as possible. Unfortunately, though, with word-to-vec, we are going to be treating each of these words as independent vectors completely. And there are cases where, you know, if one of these words occur, like money might occur in the corpus, but moneyinda probably doesn't occur. However, we should be able to infer this even though it's not in the corpus at all. In English, you can see that it's all just house. There's no differences. So English here is not a morphologically rich language. The way that fast text would handle the situation is by considering subword information. So we would break each of these words into their corresponding character n-grams. Like we would break them into trigrams, four-grams, five-grams, and six-grams. And then use those to create vectors, aggregate them to represent the word as a whole, and then we can use them in, like, the skip-gram architecture in order to learn appropriate word embeddings. And this way we can make use of, like, just because now here they have so many similar roots here, or maybe in Finnish and Turkish and Arabic, they have similar base values. And the only difference is probably the inflections. By considering character-level information, all of these words are going to be very similar to each other in terms of their vector composition. And so they will be similar to each other in, you know, closer to each other in the vector space, and hence closer in meaning. And so fast text is a great way to understand morphologically rich languages via subwords. All of these four word-to-vec architectures that we mentioned still have a major downside, which is context-aware embeddings. Currently with word-to-vex we have a word, and we will always get the same vector regardless of the context. However, context does matter. For example, the queen in that drag queen slays, and the queen in this sentence, she has a queen and ace for a perfect hand. These queens, they're queen, but they are very different in meaning. And in order to address this we have LSTMs, then later BERT, and GPT, and of course the modern large language models today that actually account for this a lot better. All in different ways though. And I've created those videos and you can check those links in the description below. But that's all I have for today. Thank you all so much for watching. If you like the video, please give it a like, subscribe, share, and I will see you in the next one as we continue our discussion through the world of national language processing.