 Hello everyone and welcome to a special episode of Code Emporium where we're going to talk about chat GPT. So chat GPT is a language model that takes in a user prompt and then generates a textual response and the responses that we have been seeing are like one of the most realistic until date. So how exactly is chat GPT doing so well and also how is it incorporating a sense of non-generating nonfactual statements as well as non-toxic statements too. In this video we're going to look at the technical details to see exactly how. To understand chat GPT we kind of need to understand some more fundamental concepts. So chat GPT is built on top of GPT as well as the entire paradigm of reinforcement learning and the GPT models themselves are essentially language models and they are also built on top of transformer neural networks. So let's take a look at each individually. So language models. Language models are models that have some inherent understanding of language in a mathematical sense and I say in a mathematical sense because they understand a probability distribution of a sequence of words. So given some context or words that have preceded it, these language models can determine what is the most appropriate word or word token to generate next. And depending on the type of data that is used to train this language model and also the architecture of the language model itself, we can get different types of probability distributions of these word sequences, which means that these language models will try to generate different kinds of words depending on these different circumstances. And because of this, we can actually generate language models to handle very specific tasks like question answering, text summarization as well as language translation among others. Now let's talk about transformer neural networks. So transformer neural networks are a sequence to sequence architecture that takes in a sequence and outputs another sequence. Sequences in this case can be a sequence of words for language. The transform architecture consists of two parts, an encoder and a decoder. And so in order to do this translation from English to French, it will take all the words of the English sentence simultaneously. It'll generate word vectors over here for every word simultaneously. These word vectors will then be passed into the decoder. And then the decoder will generate the French translation one word at a time. And every time that there's one word generated, it's going to be provided as context to the decoder itself. Now I've explained much more in detail how this training works in my video on transformer neural networks. So please do check it out for more information. Now what's really cool about this architecture is that we have now two components, an encoder and a decoder that do have some sort of contextual understanding of language and can actually be used as a base for language models. And so if we stack the encoders, we'll get a bidirectional encoder representation of transformers or BERT. And if we take the decoder parts and stack them together, we get a generative pre-trained transformer. These are popular language models, which are typically pre-trained on just general language data. And then they are fine-tuned by us depending on the task that we want to solve. ChatGPT is a GPT model that is fine-tuned to respond to a user's request. And then it is also fine-tuned further by using reinforcement learning. Reinforcement learning is a method of achieving some goal via rewards. I'm going to explain reinforcement learning in general by using a typical example and explaining these concepts here, but I'm also going to explain just after how these concepts relate to ChatGPT. So at first, we have an agent over here. And the goal is to make the agent go to this end state. And in order to entice this agent to make certain moves, we use rewards. Rewards are these scalar values that you see in these squares. A high reward is used to entice it towards the goal. And every other place has a lower reward so that we can make sure that the agent actually goes as fast as possible and taking as few steps as possible. The state here is going to be the representation of the current step. So for example, this agent is in a position 1-1, so that position is the state. Action is what action is taken by the agent. For example, left, right, up or down within the boundaries here. In order to get to the final goal. And the policy thus becomes the sequence or one sequence of actions that the agent will take in order to try to achieve the goal. So one such policy, for example, could be the sequence of actions down, down, right, right, right. And in this case, the reward given total reward would be 10-1-1-1 which is 6. Another policy could be the sequence of actions, right, down, right, up, right, down, down. And in this case, the total reward would be 10-6 which is 4. And now we have two policies, each with its own reward. And so we can tell which policy was better than which. Now relating this to chat GPT, the agent is the model itself. The reward depends on the response that's given by chat GPT. If the entire response is a good response, then it's going to get a high reward. If it's not a good response, it's going to get a negative reward. Now in order to talk about state, every single action taken by an agent is essentially a time step. Now in the context of chat GPT, a time step occurs when every single word or word token is generated. And so we can define a state as a combination of the user input prompt as well as every single word that has been generated until this point. And this can be used to infer what action we need to take, which is what word we should generate next, because this is inherently a language model. And so the overall policy would be the sequence of actions taken, which is what is the sequence of words that are generated. So different policies would entail different responses. And if we have multiple responses, each of these are policies, that means that each of them will have their own rewards. And we can then start comparing them to see which was the better response and which was a worse response. And then we would fine tune the model in order to help it generate these better responses. Now that we have like a holistic idea of some of these foundational pieces, how do they all come together with chat GPT? So I took a screenshot of this from the open AI blog, and let's actually walk through each of these. So the entire process can be divided into three major steps. In the first step, we take a GPT model that has been pre-trained on understanding language itself. And now we will fine tune it in order to take in a user prompt and actually generate a response according to that prompt. And we get the data with labors. So we'll have essentially a few labors that will write a prompt and also write a response to how they want to see that prompt answered. And because we have the input and the output, this becomes a supervised fine tuning of the GPT model, hence SFT. Next we now take this supervised fine tune model and we'll take a single prompt, pass it through the model, and we generate a few responses over here. And now the labeler is going to be ranking how well these responses are. So for each of the responses that are generated, the labeler will assign some reward. And this is going to be used to train another GPT model, which is called the rewards model. And because this is a model, it is a function which takes some input and generates some output. The input to this model is going to be an initial prompt as well as one of the responses. And the output is going to be the reward that quantifies how good was this response. And now in step three, we'll take an unseen prompt, pass it through a copy of the supervised fine train model. And then we are going to generate a response. And this response is passed through a rewards model to get some rank that quantifies how good was this response. And now this rank is actually going to be used to further fine tune our fine tuned model. And the way that it's done is that this rank is going to be used in the loss function of this model to back propagate some updates to the parameters. Now what's really interesting is that this process actually helps a model incorporate non-toxic behavior as well as also create factual responses. And this is because that's how the reward was generated itself. The responses which were non-toxic and also factual were given a higher reward. And so incorporating the reward into the model in this way is going to help the model generate responses that are less toxic and also more coherent and factual. And that's kind of the overview of this entire process of how Chad GPT works. So I'm going to end it here, but I'm also going to be making some more videos on this topic and as well as other topics in deep learning as well as machine learning too. So if you do like what you saw, please do give it a like, subscribe, and I will see you very soon. Bye bye.