 Hey everyone! In this video, we're going to talk about word embeddings and also see how embeddings have evolved over the course of NLP history. With natural language processing, we deal with words, but computers don't understand words directly, and so they need to be converted into numbers. Now words are typically converted into numeric representations called vectors. Now each number in these vectors can either be interpretable by humans, or they may not also be interpretable by humans. So for example, let's consider the case of n-grams. We have a sentence that's now converted into a numeric representation so that it can be understood by computers. Now this numeric representation consists of all the possible unigrams, bi-grams, and trigrams, and so on n-grams that can be used to represent any sentence in English here. So for example, this first position represents the number of cases the bi-gram good day occurs in the original sentence. It occurs once, and hence we have the number one here as a numeric representation. And similarly, we can do the same for every single unigram, bi-gram, and trigram. A great thing here is that we now have a numeric representation that's understood by a computer, but several downsides are this is a massive vector. And so if we're going to be using any of these in our statistical modeling, we're only going to be needing to cherry-pick a few of these in order to have it usable. And because of this, this is also a very manual process. We might be missing some things that are actually required in a general case when constructing a model. And we just won't be aware of it. So it's definitely a time-consuming, laborious task that also may not be very high quality. To solve these shortcomings, a neuro probabilistic language model was introduced in 2003. Language modeling involves predicting what word will come given a previous set of words. For example, I want some French, what? A language model will take the sentence in and predict the next most probable word, which in this case is toast. So with this language model, we are taking in n previous words in order to predict the next word. Now, what happens is that for every single word, we will learn a dense representation that is a vector that has some fixed number of numbers to represent every single word. Now, unlike n-grams, the individual numbers that occur in each of these vectors are just not human interpretable. But they do account for all kinds of situations that humans might not be aware of. Another cool thing here is that since this is a neural network, we can train it end to end to learn the concept of language modeling, as well as learn all of the word vector representations. But the bad part here is that this could be a little computationally expensive to train. So for example, if every word is represented by 100 numbers in a vector, and we need to concatenate all of those together, this will be of the order of thousands of numbers to represent this vector. And this here is the size of vocabulary of like tens of thousands of words at the very least. And so you'll be looking at millions of parameters or even tens of millions of parameters just by doing a little computation over here. So it's not great for large vocabularies or like a large number of examples that we might be considering or larger dimensions of each word that we want to represent with. And ideally, if we have a larger dimension, we'd be able to capture more concrete complexities about language since it is very complex. Over the course of the decade, there have been many other architectures that were introduced to improve the quality of these embeddings. For example, this paper introduces the concept of position for every single word to improve the embeddings. But it suffers from the similar downside of this is pretty computationally expensive to train in order to just learn the word embeddings. In 2013, one of the larger breakthroughs in generating word embeddings occurred. So we now have the word Tuvek. The paper introduced two models to understand word embeddings while preserving the idea of simplicity. And those two models were the continuous bag of word models and the skip gram model. So in the C bow case, in the continuous bag of word case, we would take two words previous and the two words next in order to predict the next word. And this projection layer is going to be the word embedding for this specific word. And the skip ground case does the exact same thing, but in reverse where it takes a word and tries to predict its contextual surrounding words. And this projection, again, is going to be the vector representation of this current word. So once you are done with training either of these networks, you'll end up with a table of a word as well as its corresponding embedding. This is a much simpler architecture with way less parameters. And so this began the age of pre trained word embeddings and the idea of word Tuvek. However, this also had some cons. Well, first of all, for every single word, we would generate the exact same vector regardless of its context. For example, the Queen and drag Queen and the Queen and King and Queen would have the exact same word embeddings. But these word embeddings are supposed to represent meanings. So that doesn't really work out very well. Also, when generating these word embeddings, you can see that we're only looking at a very limited context window. Like we're only looking at the previous two words and the next two words in order to generate the word embedding itself during the training phase, which itself also has some limitations of contextual awareness. Now, in order to further increase the quality of the embeddings generated, we had Elmo, that is the embeddings from language models. Elmo is basically a bi directional LSTM model that is used to learn the concept of language modeling. And hence is also used to understand word embeddings in the process of training the network. So a cool thing about this is that the bi directional LSTM captures some form of context when during the training phase itself, and they're also long term dependency context. That's why it uses an LSTM a long short term memory cell. But the cons here are the cons that we would typically see in LSTMs. First of all, they are very slow to train. They're so slow that they use a truncated version of back propagation called back propagation through time to train their networks. And they're also not truly bi directional. That is that here they learn a forward context and a backward context and then they concatenate them together. But they're not learning both of them at the same time. So there might be some contextual information that may be lost. Now, just a few months earlier than the Elmo paper, which was introduced in March of 2018, the attention is all you need paper was introduced just three months prior. And this introduced the architecture of the transformer neural network. So the transformer neural network consists of two parts. It has an encoder and a decoder. So the encoder will take in the input words and will add in positional encodings in order to generate vectors that have some contextual awareness. So if like, I am a J, like three words is passed into the inputs, it will generate three word vectors here, all of them are dense word embedding representations that are supposed to preserve the meaning of the word. These are pretty good because they take care of the two downsides of LSTMs. So LSTMs are slow to train, but transformers are quicker to train because we can pass in data in parallel and take advantage of GPUs. And also LSTMs are not truly bidirectional because they learn forward and backward context separately and then concatenate them. But in transformers here, we have an attention mechanism that allows a word to pay attention to words that are before it, and the words that are after it in order to generate context and all this happens at the same time. So the context of the words can be generated really well and they can be generated very fast too. Now building on top of this architecture, we have Bert and then GPT. Bert is a bidirectional encoder representation of transformers where we pre chain the model on mass language modeling and next sentence prediction. And all of this helps the model understand contextual awareness of every word so it can generate better word embeddings. And so we can then fine tune this on much less data on tasks like question answering or translation. And so because they have the transformer architecture and are also learning different language tasks, the embeddings that represent every single word for Bert and GPT are just better than anything that we have seen before. And this is why GPT especially right now is kind of the core of many of the language models that we see today like chat GPT. So I hope this helped you understand the concept of embeddings, the history of embeddings and how they've evolved over time. If you like this video, please do give it a like, share, subscribe and I will see you in the next one. Now videos like this are fun yet challenging to make. So I want to take some time to talk about our sponsor, Taro. This is a social platform to help software engineers grow in their career. So say you land a software job, but then what? It could be really hard to navigate your career and it's tough to get good career advice. Taro facilitates these discussions whether you are an entry level or a senior. You can be a part of discussions to get advice from software engineers across many companies. There are many nontechnical questions that I wish I could have asked someone in the past to advance my career, but really never found a good form to do so. But I think Taro is that good place. I'm a machine learning engineer which does overlap with software engineers. And while the platform does not have too many machine learning engineering questions at the moment, I'm doing my best to answer any questions that are there still whenever I can. And I think this community is really nice to be a part of still. So if you're looking for a premium community of software engineers to be a part of, consider signing up for Taro using my link in the description to get 20% off your annual purchase.