 In this video, we're going to mention 20 of the most important research papers that made language models as powerful as they are today. We're going to distill this by covering research since the 1940s, so let's get to it. First off, language models can predict the next word in a given sentence given its previous words. And these models don't actually process words, but they process numbers. And so we need a way to represent language with numbers so that these models can understand them. And eventually, they'll be able to perform intelligent tasks. And this is the basis of language processing. With that, the 20 papers that we're going to go through will represent language with numbers in the most effective way as possible. So let's get to it. We start off with the paper, Mathematical Theory of Communication. This paper was written in 1948 by Claude Shannon, who is regarded as the father of information theory. So essentially with this paper, since we want to represent language with numbers, we do so by representing sentence with n-gram vectors. This is done by first breaking down the sentence into its corresponding n-grams and determining which n-grams are present in the representation. This was one of the earliest ways that computers understood sentences. The second paper is the one titled Neural Probabilistic Language Models. So an issue with the n-gram vectors approach is that they were sparse vectors, so they're too large for computers to process. This paper introduced the idea of neural networks to solve the problem of language modeling. And during this training, the model learns also how to represent words with a dense, fixed-size vector, which may be of size like 64, 128, or 256 dimensions. This is a lot smaller in vector size than the n-gram vectors. The next paper here is NLP almost from scratch. So until now, we would need to learn vectors by training neural networks on language modeling or some other task. But if we change the problem, then we would need to train these word vectors from scratch. So instead of training a neural network on a specific problem, why don't we train it on multiple problems with a unifying architecture? And so different problems could share the same word vector representation. And this is somewhat a philosophy that we even follow today. Now in this research, the core architecture uses convolution to learn vectors. This was inspired by previous research on time delay neural networks introduced in 1989 for phoneme detection. So overall, the NLP almost from scratch paper introduces the idea of learning shared representations of words. Next up, we have the efficient estimation of word representations in vector space. Now, this is the paper that introduced word to VEC. So keep in mind that the goal, even since 1940s, paper was to represent text that is like words and sentences with some vector that preserves meaning and is tractable for computers to understand. The previous architectures in NLP almost from scratch does learn word representations that are reasonably high quality, but it needs a lot of data and a complex architecture to do so. Word to VEC learns word vectors, but with a much simpler architecture. Word to VEC itself is a framework of two architectures. One is C bow and the other is skip gram. We train this network by passing in and also passing out one hot vector of words, but this center layer is going to have the internal dense word representation that will eventually be learned. And so once it's trained, we're going to have a dictionary of every word followed by its vector representation. Now a similar paper is glove, which is global vectors for word representation. This is similar to word to VEC, but the vectors are formed instead of just using local context information. We use global co occurrence of words in order to form the final dense word representations. Another paper that is very similar also is enriching word vectors with subword information. This is the paper that introduced fast text, which is also similar to word to VEC, but it's very useful in morphologically rich languages like Finnish Arabic and some Indian languages. Morphologically rich languages have words that can change their forms slightly just based on gender or preposition. If you want more information on word to VEC, glove and fast text, I have a full video as well. Next is a convolution neural network for modeling sentences. So we've spoken about words and representing words as vectors, but we also want to represent sometimes sentences as fixed length vectors such that their meaning is preserved. So this DCNN does so using a convolution and some max K pooling operations. And this K really depends on the size of the input. And hence, this is a dynamic network. And this way longer sequences of words can be represented pretty well. For more information on dynamic convolution neural networks and also time delay neural networks, you can check out my companion video here. Now the next paper I want to introduce is learning internal representations by error propagation. So this is the paper that was written in 1985 and introduced recurrent neural networks. So recurrent neural networks like dynamic convolution neural networks can process sentences. They get representations of sentences and can even solve tasks involving these sentences. This dissertation is laced with however a ton of math. And so I would recommend actually reading chapter 10 of the deep learning book, because it actually gives a good intuition about recurrent neural networks from scratch and also has very nice visuals to accompany it. Next up is long short term memory. The main issue with traditional recurrent neural networks is that the gradients would either explode or vanish. So the LSTM architecture was introduced to combat this issue for longer inputs. Now the outline of this paper is kind of clear, but there is a lot of math laced into it as well. And so for this case, I would also recommend a popular blog post on this topic by Kola. And I think they explained LSTM's very good with some stunning visuals. For additional information, I have a full companion video on this topic too. And yes, I make a lot of videos. Now I also wanted to call to attention a PhD thesis by Ilya Sofskever. The premise here is the original RNNs and LSTMs that were introduced in the 80s and 90s didn't quite gain popularity until past the 2010s. A big reason for this was that they were very tricky to train. And this thesis showed how they can be stably trained using some techniques on parameter initialization, optimization, and regularization. And this is why recurrent neural networks for NLP became super popular around like the 2013 to 2016 time. Next up, we have deep contextualized word representations. So let's pivot back to word representations like using word2vec. Now the goal is to understand really high quality embeddings for words. Now the main issue with word2vec is that it really doesn't care about context. The same vector is generated no matter where the word is in the sentence. But this isn't good because the word meanings can change, but the vector doesn't. To solve this, ELMO considers context. So every word vector that is generated is a function of all words in the sentence. And it does so by having a bidirectional LSTM architecture. So LSTM networks help them train on long sequences. And the bidirectional component allows ELMO to understand words that come before it as well as words that come after it. The result is a deep contextualized word representation. Hence the paper. And we can fine tune this on any other NLP problem. Now the next paper I want to talk about is a very famous one called attention is all you need. Another issue with LSTMs and RNNs are that they are slow to train. And this is because data needs to be passed into them sequentially. The architecture consists of an encoder and a decoder that makes use of attention for processing long term dependencies kind of like LSTMs. Now this transform architecture was kind of the turning point from the recurrent base architectures to more attention based methods that set the stage for modern language models. For a brief overview on this topic, I have a video and also to code this out completely from scratch, I have a playlist of videos. More fun, more homework. Next up, we have BERT, the pre training of deep bidirectional transformers for language understanding. So BERT essentially is a stack of transformer encoders. The main issue with transformers is that we need to train the architecture from scratch every single time that we need to train on a new task, which is also language related. So language modeling, question answering, machine translation, each of them, we need a lot of examples of each. And this may not be super easy to get your hands on. So to deal with this, BERT split its training phase into two parts. One is a pre training phase where we train on language modeling and next sentence prediction. And then we fine tune it on whatever NLP task that we want to do. And the idea is that with fine tuning, you don't need as much data. Next up, we have improving language understanding by generative pre training. This is the paper that introduced GPT. So GPT is a stack of transformer decoders that also leverage transfer learning in order to learn language tasks. It dives into the same pre training and fine tuning phase that we also saw with BERT. Next up is the paper language models are unsupervised multi taskers. This is the paper that introduced GPT too. So the older versions of GPT and BERT still required quite a bit of data for fine tuning. Overfitting is also easy. And honestly, the fine tuning process is not exactly how humans would learn. We don't need like hundreds of thousands of fine tuned examples just to understand how to translate from one language to another or how to complete the sentence. We might need just like one or a few examples. And this is where meta learning, such as like zero shot learning, one shot learning and few shot learning come into play. Now, while GPT two wasn't super successful, this paper called the language models are few shot learners that introduced GPT three was pretty successful. So GPT three is essentially a scaled up version of GPT two, which has a suite of architecture such that its largest architecture is 175 billion parameters. Now, for more information on GPT one, two and three and its progression, I suggest you check out this video. Next up is sentence BERT sentence embeddings using Siamese BERT networks. This is the paper that delves into a new kind of architecture called sentence transformers. So BERT and GPT produce high quality word embeddings. But how do you get sentence embeddings from this? If you simply take like the average of the word embeddings that you get and just use that average vector as a sentence embedding, well, that's not very high quality. Instead, we would want to train a Siamese network of BERT and a pooling layer on a few tasks. So one of them is natural language inference where we check if one sentence either entails contradicts or is neutral with respect to another sentence. And the other task is sentence text similarity, where we have two sentences and want to see how similar they are to each other. Once these networks are trained, we can then just use that BERT plus pooling layer in order to input the words of a sentence. The pooling layer will take some form of weighted averaging. And then it's that vector that is going to be much higher quality and representing a sentence. The next recommendation here is the chat GPT blog. Chat GPT is probably one of the most well known items on this list, and it is a fine tune chatbot that makes use of reinforcement learning with human feedback. The original GPT architecture is fine tuned to answer questions. We then build a rewards model that says, given a question and an answer, give a score that says how acceptable this is. We then use reinforcement learning to fine tune the GPT model. And this is done to ensure that chat GPT is safe, factual and non-toxic. And I have a playlist of videos explaining this infographic too. And next up here, we have llama. So this is the main paper for llama to llama to is to GPT three as llama to chat is to chat GPT. So llama itself is a pre trained language model. It is open source architecture that is trained on publicly available data. Compared to GPT, it has a smaller architecture, but is trained on much more data. And because of its small architecture, inference becomes faster. And for more information on llama, I have another full companion video. That's all for today. I hope these papers helped you navigate the idea of what papers are important in the field of language modeling and natural language processing in general. If you like the video, please do give it a like subscribe and I will see you very soon in another one. Bye bye.