 In this video, we are going to talk about chat GBT and all of its technical details. So a chat GBT, you ask a question and it gives a wonderful response. The way I want to structure this video is we will first talk about some fundamental concepts that are required to understand chat GBT and then go through each and every single detailed step in which helps you at least understand how chat GBT constructs the answers that it does so that it's safe and non-toxic and also quite factual. So let's get to it. Also thank you all for 100,000 subscribers. Now if you can get this channel to 150,000 subscribers, that would be absolutely amazing and we'll be posting more machine learning, deep learning and AI content in general just like this. So with that, let's get started with some chat GBT lore. To understand chat GBT, we kind of need to understand some more fundamental concepts. So chat GBT is built on top of GPT as well as the entire paradigm of reinforcement learning and the GPT models themselves are essentially language models and they are also built on top of transformer neural networks. So let's take a look at each individually. So language models. Language models are models that have some inherent understanding of language in a mathematical sense and I say in a mathematical sense because they understand a probability distribution of a sequence of words. So given some context or words that have preceded it, these language models can determine what is the most appropriate word or word token to generate next. And depending on the type of data that is used to train this language model and also the architecture of the language model itself, we can get different types of probability distributions of these word sequences, which means that these language models will try to generate different kinds of words depending on these different circumstances. And because of this, we can actually generate language models to handle very specific tasks like question answering, text summarization as well as language translation among others. Now let's talk about transformer neural networks. So transformer neural networks are a sequence to sequence architecture that takes in a sequence and outputs another sequence. Sequences in this case can be a sequence of words for language. The transform architecture consists of two parts, an encoder and a decoder. And so in order to do this translation from English to French, it will take all the words of the English sentence simultaneously. It'll generate word vectors over here for every word simultaneously. These word vectors will then be passed into the decoder. And then the decoder will generate the French translation one word at a time. And every time that there's one word generated, it's going to be provided as context to the decoder itself. Now I've explained much more in detail how this training works in my video on transformer neural networks. So please do check it out for more information. Now what's really cool about this architecture is that we have now two components, an encoder and a decoder that do have some sort of contextual understanding of language and can actually be used as a base for language models. And so if we stack the encoders, we'll get a bidirectional encoder representation of transformers or BERT. And if we take the decoder parts and stack them together, we get a generative pre-trained transformer. These are popular language models, which are typically pre-trained on just general language data. And then they are fine-tuned by us depending on the task that we want to solve. ChatGPT is a GPT model that is fine-tuned to respond to a user's request. And then it is also fine-tuned further by using reinforcement learning. Reinforcement learning is a method of achieving some goal via rewards. I'm going to explain reinforcement learning in general by using a typical example and explaining these concepts here. But I'm also going to explain just after how these concepts relate to ChatGPT. So at first we have an agent over here. And the goal is to make the agent go to this end state. In order to entice this agent to make certain moves, we use rewards. Rewards are the scalar values that you see in these squares. A high reward is used to entice it towards the goal. And every other place has a lower reward so that we can make sure that the agent actually goes as fast as possible and taking as few steps as possible. The state here is going to be the representation of the current step. So for example, this agent is in a position 1-1 so that position is the state. Action is what action is taken by the agent. For example, left, right, up or down within the boundaries here in order to get to the final goal. And the policy thus becomes the sequence or one sequence of actions that the agent will take in order to try to achieve the goal. So one such policy, for example, could be the sequence of actions down, down, right, right, right. And in this case the reward given, total reward would be 10-1-1-1-1-1 which is 6. Another policy could be this sequence of actions, right, down, right, up, right, down, down. And in this case the total reward would be 10-6 which is 4. And now we have two policies each with its own reward and so we can tell which policy was better than which. Now relating this to chat GPT, the agent is the model itself. The reward depends on the response that's given by chat GPT. If the entire response is a good response then it's going to get a high reward. If it's not a good response it's going to get a negative reward. Now in order to talk about state, every single action taken by an agent is essentially a time step. Now in the context of chat GPT, a time step occurs when every single word or word token is generated and so we can define a state as a combination of the user input prompt as well as every single word that has been generated until this point. And this can be used to infer what action we need to take which is what word we should generate next because this is inherently a language model. And so the overall policy would be the sequence of actions taken which is what is the sequence of words that are generated so different policies would entail different responses. And if we have multiple responses, each of these are policies, that means that each of them will have their own rewards and we can then start comparing them to see which was the better response and which was a worse response. And then we would fine tune the model in order to help it generate these better responses. Now that we have like a holistic idea of some of these foundational pieces, how do they all come together with chat GPT? So I took a screenshot of this from the open AI blog and let's actually walk through each of these. So the entire process can be divided into three major steps. In the first step, we take a GPT model that has been pre-trained on understanding language itself and now we will fine tune it in order to take in a user prompt and actually generate a response according to that prompt. And we get the data with labels. So we'll have essentially a few labors that will write a prompt and also write a response to how they want to see that prompt answered. And because we have the input of the output, this becomes a supervised fine tuning of the GPT model. Hence SFT. Next, we now take this supervised fine tune model and we'll take a single prompt, pass it through the model and we generate a few responses over here. And now the labeler is going to be ranking how well these responses are. So for each of the responses that are generated, the labeler will assign some reward. And this is going to be used to train another GPT model, which is called the rewards model. And because this is a model, it is a function which takes some input and generates an output. The input to this model is going to be an initial prompt, as well as one of the responses. And the output is going to be the reward that quantifies how good was this response. And now in step three, we'll take an unseen prompt, pass it through a copy of the supervised fine train model, and then we are going to generate a response. And this response is passed through a rewards model to get some rank that quantifies how good was this response. And now this rank is actually going to be used to further fine tune our fine tuned model. And the way that it's done is that this rank is going to be used in the loss function of this model to back propagate some updates to the parameters. Now what's really interesting is that this process actually helps the model incorporate non-toxic behavior as well as also create factual responses. And this is because that's how the reward was generated itself. The responses which were non-toxic and also factual were given a higher reward. And so incorporating the reward into the model in this way is going to help the model generate responses that are less toxic and also more coherent and factual. And that's kind of the overview of this entire process of how chat GPT works. Now that we took a look at this full infographic, in the next three sections, we will dive into these three steps. So let's start with step one in identifying what is the GPT in chat GPT? Why do we use it and how was it trained? So the source of like GPT kind of comes from the transformer neural network architecture which was introduced in 2017, which was a sequence to sequence architecture. So it would take in as an input some sequence and it would output another sequence. Now in the field of natural language processing, this can actually be super useful because sentences are a sequence of words. And so we started to use these transformers for well, NLP problems like translation. And I'm going to walk through exactly how that goes here. So the transformer architecture has two parts. It has an encoder and a decoder. The encoder is going to take all the input simultaneously. And within this, it's going to actually create some vectors for each of these words. Or I should say word pieces, since there's going to be four that we have four vectors over here. Now, these four vectors are going to passed in simultaneously into this decoder architecture. And we're probably going to have when starting out, we'll have like a start token. And then it's going to now one to the decoder and it's going to output one word at a time. In this case, let's say that the problem that we're trying to do is translation from English to an Indian language, specifically a South Indian language called Kannada. So it's going to be fun. I'm going to teach you a few things here, OK, about Kannada. So my name is a J. So the first word that's probably going to be generated by this, hypothetically, if it's completely trained well and it's working properly is going to be, let's see, Nanna. Which is translation, the Kannada word for my and then name is a J. So after this first word is generated, this word is now going to be the input for the next pass. And so we input it here. And then in this next pass, we'll generate the next word here, which is supposed to be Hesaro. This this actually means name. And then for the third pass, it's going to generate the next word, which is a J. Now, what this actually kind of shows is that this overall architecture has some semblance or understanding of language. And in fact, we figured out that simply the encoder part up here and the decoder part individually also have some understanding of language. And so we can pick them apart, stack them up in order to have them understand more and more parts and intricacies of language. So if you stack just the encoder pieces together, you get a bidirectional encoder representation of transformers, which is Bert and hence that entire research field that's gone into it. And if you take the decoder part and stack them up, you'll get generative pre trained transformers. Which is GPT. And we're going to be focusing more on these GPT architectures moving forward. So I hope the what is GPT over here kind of makes more sense right now. Now that we have the what out of the way, why exactly are we using GPT architectures over, let's say, I don't know, we're current neural networks or any other typical modeling strategy that we used in the past? Well, if we wanted to typically train a supervised model, we would need to train all of these model parameters from scratch. And we do this by collecting a lot of labeled data. Unfortunately, though, with each and every single one of these tasks mentioned, you need to create a lot of get a lot of label data feed it into your model to train it and to learn those parameters. But this is actually going to be extremely hard to find in very large quantities. And even if it does find the plethora of the huge vast amounts of stores of labeled data, it'll probably only be able to answer one of these major domains at a time. And so to solve this issue, we want to adopt a modeling strategy where we're doing some generative pre training followed by some discriminative fine tuning phase. Now we'll have a model where we want to do some generative pre training, which is the unsupervised approach to learn about language modeling. Language modeling here is the type of problem that we're actually optimizing for. Then we have discriminative fine tuning, which is a supervised approach to learn very specific tasks. So in this case, it could be like question answering or document classification or simply user response generation, like a chatbot. And so now that we have some understanding of why we need a GPT architecture, let's actually play out what generative pre training and discriminative fine tuning actually mean in practice. In order to do generative pre training, the goal is to optimize for the problem of language modeling. Now, if we want to make GPT a language model, language models have an understanding of word sequences. Its main objective is to predict what word is going to come next, given the context of all the previous words that have come before it. Mathematically represented, that's exactly what this is, but we'll get to that. So let's say that we have one training example where we just basically scour the internet for random sentences. One of those sentences is today I want to play. Now we have a start token, which we now input to this untrained GPT architecture. When we start, we want to say, OK, right here, we wanted to generate the word today. Once we generate today, we're going to now put that as the input for the next time step where we pass it into GPT and it should do its little magic in here and try to predict I. Now we're going to every single time we do this, we want to tune the parameter such that it is more likely to produce this word I. In the next time step, we want to generate the word want. Then in the next time step two. And then in the next time step, play. Now, mathematically speaking, we want GPT to be optimizing some objective. This is typical of all of machine learning and deep learning. And the objective we're trying to optimize is that of the language modeling objective where we want to try to predict the next word, which is W I. In this case, let's say this entire sentence is W. So W I would be play using all of the previous words that came before it using the words today I want to and theta here are the parameters of this GPT architecture and this overall statement across all of the words is something that we want to maximize. And so theta will have the parameters that will maximize this objective. And so I hope this relationship between what's happening in intuitively and also mathematically makes a lot more sense. Now, at the end of this generative pre-training phase, we're going to get a model that has some inherent understanding of language. And more practically speaking, it's going to be able to generate. Well, given a word sequence, it will be able to figure out what word to generate next. Now, in the discriminative fine tuning phase, we now have a general model, but we want to make it satisfy or solve a very specific problem. For example, in document classification, simply giving the next word is not enough. We actually want to understand the overall sentiment of the document, of what it represents, or we want to understand a categorization. Is it a sports document? Is it as a news document? Is it some other type of document? And so typically we would have this pre-trained architecture and add probably like a very simple linear layer randomly initialized and then have a very small amount of actual training data for document classification to basically learn these small sets of parameters and further fine tune the rest of this model. Now, because there's only a small amount of new parameters and also like most of these parameters in the GPT model already have a good understanding of language, you don't need too many of these these pairs or these supervised pairs of examples for document classification. And this is what makes it much easier to actually just get started with so many of these facets of natural language processing using just simply a pre-training and fine tuning kind of argument or phase. Now, with chatbot and chat GPT, the input is some user prompt and the output is a response and that's already kind of the format in which GPT architectures are already fine tuned. And so all we would need to do is we don't even need to add very new parameters as we did for document classification, but we can get extra examples of user prompts and their corresponding responses that are required in order to further just fine tune this specific model where we're just going to be tuning these parameters. And it's actually this that we kind of see where in the first step of chat GPT, the label or demonstrates the desired output behavior and it fine tunes via supervised learning. Now, to get a more concrete idea of exactly what's going on over here, let's look at this figure in normal GPT. We would typically pass in, let's say, one token and GPT will generate one token at a time. So typically, it'll be let's say that we generated three words. Today, I will now this is going to be passed to GPT in this fourth time step. And eventually it's going to generate a vector, which is going to be of the size vocab size cross one. And I guess like in the later versions of GPT, this vocab size is the number of possible tokens that we could possibly generate in this specific language by GPT. These tokens are not exactly words, but they are word pieces so that this vocab size doesn't doesn't skyrocket to be of the order of infinite number of values because there's just that many words. The vocab size is like around 50,000 or something like that. And this is actually going to be applied to a soft max. And the reason we do this is we want to make this entire thing a probability distribution. So the total value is going to sum to one. And when it's a probability distribution, it's going to signify what is the probability that each of these tokens is going to be used as the next word. Now, obviously, we can only choose one of them. And we don't typically choose the top one with the highest probability because that just sounds less natural and not superhuman. Instead, we use a sampling technique. This can include like temperature sampling or nucleus sampling or top case and you can sample basically from this soft max in order to determine what the next word should be. Let's say that the word that we generated or the word that we kind of picked was the word play. And so play is going to be the next word that's generated. And hence, you'll get a response today, I will play by chat GBT. And this is also why you kind of see chat GBT generate one word at a time. The this ability actually comes from the underlying GBT architecture. And in fact, just to make it even more concrete, I'm going to do a direct comparison with chat GBT. And so let's say that you ask a question, what will you do today? Chat GBT might respond with today, I will as the first three words. And maybe during the fourth word, again, it's going to go through the GPT architecture, create this huge vector of vocab size cross one. It's then going to sample from this probability distribution. It's converted into a probability distribution because of soft max. And then we'll sample from that to get the next word. And so I hope now this creates an even better picture of what is going on behind chat GPT and how big of a deal GPT is within chat GPT. I also want to hammer home the point that like chat GPT didn't just come out of thin air, it's very clearly based on many concepts of language modeling, of transformer neural networks, of GPT architectures that have come before it and so much more. Let's now move on to the step two of this infographic to get a much more detailed picture. So just to reiterate this at a very high level, we have a supervised fine tune model that can take one question and generate multiple responses. And these responses might be slightly different from each other. And so we have a person who will human label decide which of these responses is actually better for this specific question. And so they're ranked. And then we use this data as training data to train a rewards model. And the rewards model is of a similar architecture. Now, this is the gist of it, but there are some questions that we want to answer here. So my big question here was, why can GPT generate different outputs for just one input? So let's say here that we have a GPT model. And in this case, it's taking an input. We'll just call this some you and it's going to be what's for breakfast. And the output of this model is going to be for every single timestamp. It will generate one word at a time. So it first generates today, then it generates we then will then have then French. And now we are at this stage to generate the sixth word. Let's call this entire response W. Now what GPT here is already trained and already fine tuned as well. And so it already has a notion of language, just like a language model. And what that means is that it has an understanding of the probability distribution of word sequences. And so what it's going to predict here is well, this is the sixth word. So we'll call that W five, given all the words that have come before it. That's zero to four and this entire input context. So that's you. And so GPT is going to try to determine what this value is and which word this corresponds to. And then it would output that corresponding word. Typically in machine learning models, if there is an output with the highest probability value, it will just output whatever that value is. But that's not exactly what we want in a language model. Because had we done that in a language model for the same input, that means we would always generate the same exact output every single time. Because in this case, let's say toast is the highest probability word. We will always say today, we will have French toast for this same input of what's for breakfast. But this isn't exactly human behavior and humans tend to say words that are not the most optimal at every single word that we speak. And so in order to circumvent this, we kind of use decoding strategies to make the decisions more stochastic and more human like. So when we pass in the input context, what's for breakfast to a GPT model? Now GPT will now go through a decoding strategy. And this decoding strategy will then determine what word we generate. Now there are many kinds of decoding strategies. There's for example, there's nuclear sampling, temperature sampling, there's top case sampling, where the main goal here is not just to take the top word, but to sample from some top few distribution words here in order to generate this next word. And so it gives it some element of stochasticity. So let's take an example that today we will have French blank and GPT is supposed to determine what's supposed to go here. Now GPT is already trained like I mentioned before, so it has this knowledge of word sequences and probabilities. And so GPT determines that at this stage, this is the probability distribution of words that can go into this spot. So there's a 31% probability that this word should be toast, then there's a 19% bread, 7% fries, and it's in descending order, this distribution. Now with just a greedy sampling, which is the traditional case of like, Oh, just pick the highest probable word. It's just going to predict the toast all the time and every single time. However, if we use something like top case sampling, let's say that K is 10 for every single word, like in this case, it's going to use the top 10 of these words with the highest probabilities in order to generate the word. So it takes the top 10 here, and then it will sample from these top 10 and then use that sampling as a part of this word next. And in this case, the top K is, well, let's just say we pick fries. So today we will have French fries would be the prediction if we had used the decoding strategy top 10. So with nuclear sampling, it's kind of very similar. But instead of for every single case, like every word, we always pick a fixed amount like 10. We'll pick a variable amount depending on the probability distributions that we have at that point. So for example, P is equal to 0.9 would mean that we will get all the words such that the top words that correspond to then up to 90% of the total probability. So if you add these numbers, you get 0.57. So hypothetically, if this P was 0.57, we would have only taken these top three, sampled from it, and then use that as a French word. And in this case, let's say that we could have gotten something like bread. Today we will have French bread would have been the case with nuclear sampling. And then we have temperature sampling where we kind of change the overall distribution itself and we skew it depending on a temperature value. So this temperature value, for example, can range from 0 to 1. If it's 0, all of the highest probabilities will be skewed much higher and the lower probabilities will be skewed much lower. And in this way, you will have like toast 10 to 100%. Everything else 10 to 0%. And if you were to then perform this temperature sampling, it would be the same as the greedy approach because you'll always get toast anyways. However, as you increase this temperature to something like 0.7, this value of the probabilities will go decreasing for toast and it'll probably increase for the smaller probabilities so that when you start sampling, you get a higher chance of variability for this next word. And so you can see that as you get closer and closer to one, the randomness and variability of the word generated increases. In this case, let's just say it was toast. It could be toast. It could be something else. But the greedy will always have toast like I mentioned before. And so if we had done this in another world in the same way, we might have gotten something like this. This could have also been a very possible output where the top K nucleus and temperature samplings gave different possible words here, but the greedy will always give you toast. You can actually see all of this math and action by going to playground for open AI. It's like a beta version. And you can just type in a specific prompt and you will get a response. And this response can be different depending on how you set the temperature or if you want to use like instead of temperature sampling, you want to use like some nucleus sampling over here. You can set up the top P value or anything else too. So I'd highly recommend just checking this out. And so I hope that how GPT can just take the same prompt and yet generate multiple responses makes more sense now. Now in this next phase here, we have labellers that need to rank different responses that we get here. And by ranking, they also have to assign some actual reward value because this reward is going to be used quantitatively in a loss function. So it has to be a number and not just like an arrangement that they've shown here. But how exactly do we correctly quantify the quality of a response here? So for every single labeler, they'll be given a screen that looks something like this where they have, you know, they have the input prompt over here. They have the output of that prompt over here. And then they'll be asked to just rate this on a scale of one to seven. And then they're asked like a bunch of questions that are binary-choiced over here. So one of my first thoughts in looking at this screen was like, why are we asking them so many extraneous questions? Don't we just care about the rating itself? And this itself is just going to be used as the reward? Well, that is partially true actually. It can be used as a reward. But let's say that I am a labeler and I choose three for a specific user prompt and an output. But how good is my rating of three? To get more meta into this, how high quality is my rating of three? You can't really determine that so well because what's three for me might be two for you or someone else. And because of that, it becomes harder to get very high quality ratings too. So to combat that issue, we use something called a scale. Now a scale is essentially just a set of questions with categorical responses that we ask. All of these questions are made to ascertain how well the labeler is sensitive to the issues that are being presented. After all, we want chat GPT to have some understanding of nuance of language as well as understand sensitive topics. So there's a bunch of labellers and they all fill this out for the same question. We can then aggregate all of these responses for the specific instruction output pair below. And then we can just say, oh, so it looks like this labeler labeled at three, but they didn't quite answer the questionnaire similar to how other people answered it. And so I'm not going to really consider their label three to be of high value. And hence I'm only going to use the other responses that correspond to the people who have labeled this in a very similar way. And so by filling out this questionnaire and only using responses that correspond to where the bulk of people had filled out this questionnaire, we can only get the ratings that are good ratings and can be used in order to train a rewards model. Also, just a little tidbit here, the typical type of scale that's used here is called the Likert scale. Likert scale is a common type of scale that typically corresponds to questions of psychological nature. And so like I mentioned before, we have good labels here now. That's our first step to actually training a good rewards model. And so a follow question is here, how do we train this rewards model? Our rewards model is the same supervised fine-tuned model but with a scalar output. And hence I've kind of connected all the neurons to just like a single output neuron here. The input is going to be a prompt and the corresponding response. And the output is going to be a reward that tells us how high quality this response prompt combination is. This architecture here is the duplicate of this rewards model architecture here. And so you can kind of treat this as like a Siamese network where we have a prompt response one here, prompt response two here, they have their own rewards. And we just compare this to the actual labels that we just generated, like which response was better. Now we generated rewards here. We can then use it in a loss function which can then be used to back propagate some values and hence further tune this rewards model. Now this loss function here assumes that the response one is always better than the response two as a label. And it's also a very interesting function here that we'll try to get some more intuition on. So here's the loss function and the R1 is the reward for the first response, R2 is the reward for the second response. And this loss function, we'll assume that the true label was that the first response was definitely greater than the second response. Now if our model, however, continued to predict that response one was greater and greater than response two, that's a good thing. And that good thing is reflected by our loss. It means that the model is actually getting it correct. And that's why we have lower and lower loss. On the other hand, if the model predicts that the second response was greater than the first, this is wrong. And so you can see that as you increase the score by the model, if it was like much higher and higher, then you can see that the loss only increases here. And so this loss function is actually quite effective in training our rewards model. If you're kind of wondering like why we have a sigmoid function here, it's because that this loss is proportional to the log odds that the first response is greater than the second response. I've taken a screenshot of this exact loss function from the paper of Instruct GPT. And this kind of just shows exactly how the training is happening. You would think that like in normal training we would just take all of these pairs, we have a loss value for every single pair and we just randomly start updating our network. But the problem is that there are four responses, for example, from the same prompt. And if we take pairs of those, we have four C2 different prompts, which is like six different prompts in this case. And each of these prompts, if we keep passing them into the model directly and shuffling, it may lead to overfitting. And so what we would do instead is we'd start batching all of the single prompts with all of their response pairs together. And so all of the six kind of losses that we kind of get from the single prompt, they are all used together to make only one update to the model instead of six updates to the model. And this has the benefit of one, decreasing the computation time because there's just less back propagation updates we need to make. And two, it also helps us prevent the model from overfitting, especially as the number of responses that we get for a single prompt is higher. K corresponds to that number of responses. I take it as four, but it could be as high as nine or 10 or anything that you decide. And so that's how we train our rewards model. And then we can use it, as I mentioned before, here to truly understand the quality of an unseen response and then use this output reward to further fine tune our fine tune model in order to have it generate more human responses that are factual and non-toxic. And so I hope that you all have a better understanding of how this step two truly works about how GPT generates multiple responses with a single input, how labelers actually rank and also how to get rankings that are of high quality and then creating and training a rewards model along with some very cool loss function ideas. Let's now dive into the details of step three. And this is kind of where the reinforcement learning piece is really tied in. We have an unseen prompt that we passed through the supervised fine tune model which generates a response. Now I want to actually go through exactly the mechanism and how the GPT model will generate this response. So what does a response look like from GPT? So we have the supervised fine tune model of GPT, specifically GPT 3.5. Now the input to this is going to be a user prompt and also all the words that were generated prior for this specific prompt. In this case, this is the beginning so we'll have like a start token. We input this to GPT in our first time step and this is going to generate our actual first word or the next word which is today in this case for what is for breakfast. Now in the next time step, today will now be a part of the input along with everything that came before it that we mentioned before and it will generate now the next word in the sequence which is say I. And then this process repeats where we take now the word that was generated as the input for the next time step and we generate the next word which is will. Then have, then French and toast. And so you can see that the GPT model over here is going to generate one word at a time using all of the previous words that it had come before it as an input context. And so we have a user prompt. We pass it into the supervised fine tune model to generate one word at a time until all of the words have been generated for that response. Then this entire response is going to be passed along with the input prompt to the rewards model that is already trained. And so what's going to do is is going to tell us how good was this response for this input prompt. We'll get a reward and we now use this reward to fine tune our original supervised fine tuned model of GPT. Now in order to make updates to parameters, it has to be used somehow in the loss function. And this is exactly what we do with this PPO model or technique. How is the GPT model updated? It is updated via proximal policy optimization. Policy optimization techniques in general are a class of techniques that try to maximize the total reward scene. And specifically in the proximal policy optimization case, the way it accomplishes this is using the reward in the loss function itself. This function you can consider is like the negative of the loss function. And so we want to kind of maximize whatever this value here is. This product of this R function as well as this advantage function A, it is going to be proportional to the reward. And so higher the reward, we're going to have like a much higher value for this entire function. And that is going to influence the direction of the parameters in our GPT network. These theta are the parameters of our original GPT model. I'm going to explain the individual terms here in my next pass. But for now, I hope you kind of just get to understand what proximal policy optimization is trying to do in general. It is trying to optimize or maximize the total reward seen by our network. Now let's get to the pass three where we dive into further details. For our pass three, I want to actually take a look a little closer into this model itself and just re-saw in pass two how it generates one word at a time but how exactly does it select which word to generate at every single time step. So let's consider now our example for we have our original GPT model, we have an input prompt, and it has been generating words one at a time like it generated first today, then I, then it will, then have, then French. And now it is at the step where it should try to generate what should come next based on this input as well as everything you know, everything that's come before it will also be an input here as well. Every single time that it needs to make this decision it's going to have like a table in its brain and this table is of a probability distribution. It's a table of words as well as the probability that they will occur next for this given input prompt and given response that has been generated until now. So in this case it's going to say okay there's like a 38% chance that toast is going to be coming and filling in this, there's 27% chance it's going to be fries and there's an 8% chance it's going to be bread and maybe like there's some other words that come after this in the table. But to actually decide the exact word that's going to be done, GPT is not going to just choose the top one and be like oh this toast is definitely going to be it because in general in language specifically we don't always as humans say the most optimal word that's going to come next. It's just not going to sound very human. And so what's going to happen instead is we have this probability distribution and according to this distribution we're going to just sample a word. So there's a higher probability that toast is going to be selected but it's not guaranteed that toast is going to be selected. And let's say that in this case if I pick some word it just happens to be toast and that's great then toast will come and fill this blank. But it could have very well been fries or bread at any point of time. And because this table is generated a new every single time we get a new input and these inputs can be different every single time we use even the same request prompt. This is kind of why GPT as you've seen probably in this initial case that it can actually create different responses despite having the same input prompt. And because this response can be different every single time even for the same prompt we want to pass it into the rewards model just to check how good it actually was as a response. And like I mentioned in previous passes we use this reward in our fine tune model to actually change the gradients and make some gradient update every single time we have an input prompt. Now let's actually get into details about the loss function that's used to make the gradient updates. So like I mentioned before this is the loss function or rather the negative of that loss function that we want to maximize. Theta is the parameters of our original GPT architecture and is going to be the parameters of our chat GPT model to come. Now T over here is going to be every single time we have one complete response that is one time step. R is going to be a ratio or rewards ratio where it is the ratio of rewards with the new parameters for the given input prompt divided by the reward of the old parameters with the same given input prompt. And so if this is a very high number that's way over one that means that whatever parameter updates we are thinking to make now they're actually going to be better at least for this specific input prompt than what it when the old GPT was for the previous time step. And so ideally we would want to make these gradient updates in general. Now A is going to be called the advantage function and in reinforcement learning the advantage function is a value that assesses how high quality the output was with respect to the input which in our case is the same thing as a number that's proportional to the reward. And so overall this product of R and A is going to be very high if the response is very good and it's going to be very low that is either like closer to even negative because this advantage function can be negative if the response is very bad at least for this given input. And so we want this product to be as high as possible and that's kind of exactly what these policy optimization techniques do. They try to maximize the total reward. Now this is great but we also don't want to make the gradient update a little too large because typically in learning we want to learn step by step. And so in order to make sure that the gradient update is not too large we will try to clip the upper and lower bounds of this ratio. And in this case we can clip it by an arbitrary value of epsilon. We choose one as the center because if like the rewards ratio is one that means that the new parameters reward is equal to the old parameters reward which means that in downstream like the parameters themselves don't change. How large epsilon is will quantify how much we are allowing the gradient updates to change after looking at like a single example user prompt. For example if it was between if it was like 0.1 epsilon that means that this rewards ratio will be confined between 0.9 and 1.1 which means that we are only going to make minor gradient updates overall. And in fact that's also why we're taking the minimum of these two values so that we have literally the smallest update that we will be able to make. Now why do we have an expectation over here? Well this kind of goes to the fact that I mentioned before how we can generate for the same input multiple kinds of responses. And so we would want to simulate this input the same input into GPT like multiple times over and then we'll take an average of those values and so this entire thing won't just be reliant on a simple like arbitrary output that chat GPT happened to produce. And so we take this final value and we then make gradient updates via gradient ascent. Because now this is not a loss function it's like the negative of the loss function something that we are trying to maximize. And so we're going to keep performing this update for every time step T that is every single time we see a user input response pair and over time this value of theta is just going to get better and better until eventually we'll get a model that is just non-toxic to a very high degree factual to a very high degree sounds more human to a very high degree and that is chat GPT that we see today. Now that's going to do it for the video I hope that this infographic and chat GPT is demystified and while there's an explosion of language models on the scene today they are still based on the same fundamental principles. So please do like subscribe thanks again for 100,000 subscribers love you all and we will see you very soon for another video. Bye bye.