 Greetings fellow learners! Now before we get into this linguistic world of NLP and neural networks, I have a thought-provoking question for you. Did you ever play educational games growing up? And if so, what was your favorite? I know mine was Rita Rabbit. I remember solving a bunch of problems, especially those involving counting of bananas. Now, I'm not sure if my love of bananas came from this, but it surely fueled my love of bananas then on. I played this game on a CD in my parents' office room, so it was pretty fun times. So please comment down below what your favorite educational game was, and I would love to just know you better. Now this video is going to be divided into three parts. We're in the first part, we're going to start with a storyline of how NLP evolves in neural networks, followed by some more details, and see how we got to this world of large language models today. And then in the final part, we're going to look at some code. So let's get to it. Computers cannot process words, but they can process numbers. So words need to be converted into some numerical representation. And typically, they are converted into vectors or embeddings. Now these vectors are n cross 1 matrices. And we can convert a sentence into a vector known as an n-gram vector, where every position in this vector is a count of unigrams, bigrams, trigrams, and so on that are present in the sentence. But here we run into a problem. These vectors are too large to be processed by computers in mass. So instead of representing a sentence as a vector, we can represent words as fixed size vectors. Now these word vectors are typically of the order of tens or hundreds of dimensions instead of the thousands of dimensions that we see with n-gram vectors. These word vectors or word embeddings can be trained to encapsulate the meaning of a word using architectures like word-to-vec, glove, which is global vectors, and fastex. But here we run into a problem. Words have different meanings depending on their context. And these word-to-vec architectures give the same numerical representation for a word, even though the context might be different. Hence we want a neural network architecture to take in and return a sequence of words, which is a sentence. And so recurrent neural networks become useful. Recurrent neural networks are essentially feed-forward neural networks that are rolled out over time. And they can solve three classes of problems related to sequences. So the first is sequence to vector problems where the input is a sequence or a sentence and the output is a vector. This is like sentiment analysis. It can solve vector-to-sequence problems where the input is a vector and the output is a sequence or a sentence. This is kind of like image captioning. And the third is sequence to sequence processing where the input is a sentence and the output is a sentence. This is like translation, question answering, and many more. But recurrent neural networks have problems. Well first, they are slow to train, and we even use a truncated version of back propagation to train them. These recurrent neural networks are susceptible to exploding and vanishing gradients during training. And this is particularly true when dealing with longer sequences. So replacing dub neurons with the complex LSTM cell though helps address that second point of long sequences pretty well. This connection to previous cell iterations is used to retain memory for longer sequences. But the first point on slower training still remains true. And this is because recurrent neural networks process inputs sequentially, that is one word at a time. So they don't take advantage of parallelization and GPUs very well. To deal with this, transformer neural networks were introduced. And we will talk about transformers more in past two. Squeeze time! Have you been paying attention? Let's quiz you to find out. What are word embeddings? A. These are vectors that capture the syntax structure of a word. B. These are vectors that represent the number of occurrences of a word in a sentence. C. These are vectors that encapsulate the meaning of a word. Or D. None of the above. Comment your answer down below and let's have a discussion. And if you think I do deserve it at this point, please do give this video a like because it will help me out a lot. I'm not going to do it for quiz time and pass one for now, but keep paying attention because I will be back to quiz you. Now the problem with LSTM networks is that they are slow to train. And to deal with this, transformer neural networks were introduced. So LSTM networks process inputs sequentially, that is one word at a time. But transformers can take words in parallel. So let's look at this architecture to see how this is possible. The transformer is an encoder decoder architecture. The architecture has a position encoding. So even though the words are passed in together, the position encoding ensures that the position of the word is encoded into its word vector. The architecture makes use of attention mechanisms. And this is used to tell the network how important a word is with respect to the other words in the sequence. And so vectors of words learned internally better represent the meaning of the word. This is great, but we have a problem here. Transformers have millions of parameters and hence training transformers from scratch can require millions of training examples, which can be pretty hard to get. So to deal with this, we can use transfer learning. To set the stage, the encoder and the decoder of the transformer can be picked apart. We can stack the encoder layers to get BERT and we can stack the decoder layers to get GPT. Now BERT and GPT learn via transfer learning. So in transfer learning, the training phase is broken down to two parts. The first is the pre-training phase. The model is made to understand language and this is done by training the model on some form of language modeling. Language modeling is essentially a model that predicts the next word or some contextual words given context. And then the second phase now is the fine tuning phase where this is done by further training the pre-trained model on a specific task. The idea here is that if we start with a pre-trained model trained on language modeling, we just need to fine tune the model. And fine tuning the model doesn't require millions of examples of data and we can probably make do with something like 100,000 examples or so. BERT and GPT are large language models. So they are language models which are large and their earliest forms had hundreds of millions of parameters. And today we have many LLMs based on these architectures that are really only getting bigger. Chat GPT for example is an evolution of the GPT architecture. And it's a chat bot that's built on top of GPT, fine tuned on answers to questions. And for more of a deep dive in the technical information of Chat GPT itself, you can check out this full playlist right here. Quiz time! It's that time of video again. Have you been paying attention? Let's quiz you to find out. Which of the following statements are true? A. BERT and GPT are large language models. B. BERT and GPT are attention-based architectures. C. BERT and GPT have over 100 million parameters. Or D. BERT and GPT don't need much data to pre-train. Now note that more than one statement here can be correct. So comment your answer down below and let's have a discussion. Now that's going to do it for quiz time and past two for now but keep paying attention because I will be back to quiz you. If you want to get started with coding out large language models, where do you start? Well, I think a good place to start is Hugging Face Hub. So with Hugging Face Hub, this page here shows an example of a pre-trained BERT model. And as we discussed before, BERT is pre-trained on two tasks. One is a form of language modeling known as mass language modeling and it is also trained on next sentence prediction. We can now use this model to one, either make inferences directly on the language modeling task. Kind of like what we see here, the goal of my life is mask and if you try to compute it, it will try to infer what this mask should be. Now the second way that we can use this pre-trained model is to download it and then try to fine tune the model so that it can cater to specific NLP tasks that you want to solve. And to kind of play around with this model, they give a sense of starter code right over here where from the transformers you can import a pipeline. And using this pipeline, you can make use of the BERT based uncased model. And then you can start playing around with examples. Now a cool thing here is it's not just BERT, you can also search for any other models such as let's just say Llama. And you can go to any one of these models here, look at the model card and for some cases you might need to request access to the repository in order to get access to the code itself, which I have done. But this is how you can kind of get started with either just playing around with the models, making inferences online or you can just download the model and try to fine tune it for your own purposes. Quiz time! Ooh, this is gonna be a fun one. What is the problem being solved by the following model? A. Language modeling B. Language translation C. Sentiment analysis Or D. Question answering Comment your answer down below and let's have a discussion. And if you think I do deserve it at this point, please do give this video a like because it will help me out a lot. Now it's gonna do it for quiz time and pass three of the explanation, but before we go, let's generate a summary. Computers can't process words, but they can process numbers. So words need to be converted into numbers. And gram vectors convert a sentence to a vector. They are simple, but they are large. Now word-to-vec and similar models can learn to represent a word as a fixed size vector. And these are more manageable in size. But a specific word in different contexts has different meanings. So these vectors should be different, but they are not different. Now recurrent neural networks can take in the context and internally learn to better represent these words. But they are slow to train and have trouble with longer sequences. The issue with longer sequences can be addressed with LSTM cells, but these are even slower to train. Now transformers can be quicker to train as they can process input words in parallel, but transformers have millions of parameters and hence require potentially millions of examples to train from scratch. This is where transfer learning can help with two architectures. BERT, which is a stack of transformer encoders, and GPT, which is a stack of transformer decoders. Both are pre-trained on some version of language modeling, and then it can be fine-tuned to solve a specific language task using less data. And this is the start of the large language models that are currently exploding today. And you can get started with using these pre-trained language models on Hugging Face Hub today. And that's all we've got for you today. Now I just brushed across a bunch of topics in natural language processing and their concepts, but you can take a better look at each of these using the natural language processing 101 playlist right here. Thank you all so much for watching. If you like the video and think I deserve it, please do give this video a like and I will see you in the next one. Bye-bye.